I am looking for a program that will do job of stripper

Nagesh1

Should clean HTML and strip out all CSS references, but leave basic html tags intact.

I am googling and not finding anything yet.

Cassidy

You're probably using the wrong term in your ~~googling~~ search.

Are plenty of scantily-clad images appearing in your results?

Nagesh1

You're probably using the wrong term in your ~~googling~~ search.
Are plenty of scantily-clad images appearing in your results?

Yes. Also some industry chemical products are coming up.

HTML Tidy is not working like desired. I don't want to strip out HTML completely. Leave basic HTML tags in there.

Cassidy

Perhaps ask around and find a Java programmer who could knock up some cross-platform utility that could regex-replace out the stuff you're not interested in?

belgariontheking

I came here for strippers ready to recommend that one program that puts tits down in the corner by your clock.

But I came away empty handed.

Nagesh1

This needs some modification

@Cassidy said:

Perhaps ask around and find a Java programmer who could knock up some cross-platform utility that could regex-replace out the stuff you're not interested in?

pjt33

@Nagesh said:

Yes. Also some industry chemical products are coming up.

Anything from 3D graphics? I once spent a couple of days creating a Wavefront loader which stripped models.

PJH

@Nagesh said:

This
needs some modification

Something certainly does - the comments have (ironically) what appears to be HTML tags (amongst other stuff) in them, as if something's stripped the angle brackets from the comments:
@mike said:

Given the way we work around here, a conversion from WordML/.doc format to HTML

@jake said:

(Small blurb on project intentions here: a href="http://www.critical-masses.com/projects.html"http://www.critical-masses.com/projects.html/a --scroll down to HTMLMin)

@Sam said:

Made a version for .NET 1.1. A few bugs fixed (quoted classes were not removed, not all empty tags were deleted).

a href="http://webdevel.blogspot.com/2006/01/clean-word-html-command-line-tool.html"http://webdevel.blogspot.com/2006/01/clean-word-html-command-line-tool.html/a

etc.

spamcourt

Try Beautiful Soup, it is not that hard and works really nice. If you are not in a hurry I can code a basic filter by Tuesday.

Nagesh1

I have downloaded it for now on my person windows 7 machine. First let me see how it work here, before I take it to work place and test on office machine.

@spamcourt said:

Try Beautiful Soup, it is not that hard and works really nice. If you are not in a hurry I can code a basic filter by Tuesday.

Thank you for providing link.

rad131304

@Nagesh said:

Should clean HTML and strip out all CSS references, but leave basic html tags intact.

I am googling and not finding anything yet.

did you try something like*

sed -e 's/(<[A-Z][^>]*)style=(("[^"]*")|([^\s]))(.*>)/\1\3/Ig' infile > outfile

or

perl -pe 's/(<[A-Z][^>]*)style=(("[^"]*")|([^\s]))(.*>)/\1\3/ig' infile > outfile

where you replace infile and outfile with the appropriate file names obviously. I think this should match any opening html tag with a style attribute and simply delete the style attribute. You may want to run it recursively until it finds no more style attributes in case somebody put more than one into an html tag.

spamcourt

@rad131304 said:

@Nagesh said:
Should clean HTML and strip out all CSS references, but leave basic html tags intact.

I am googling and not finding anything yet.
did you try something like*
sed -e 's/(<[A-Z][^>]*)style=(("[^"]*")|([^\s]))(.*>)/\1\3/Ig' infile > outfile
or
perl -pe 's/(<[A-Z][^>]*)style=(("[^"]*")|([^\s]))(.*>)/\1\3/ig' infile > outfile
where you replace infile and outfile with the appropriate file names obviously. I think this should match any opening html tag with a style attribute and simply delete the style attribute. You may want to run it recursively until it finds no more style attributes in case somebody put more than one into an html tag.

Awww holy mother of Mechanisburg not again this here. While you can make a useful parser with sed/awk/perl please stop advising the use of regular expressions to process html (like your example). This pages explain it better and with a nicer tone:
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

RegEx match open tags except XHTML self-contained tags

I need to match all of these opening tags: <p> <a href="foo"> But not self-closing tags: <br /> <hr class="foo" /> I came up with this and wanted to make

http://www.xtranormal.com/watch/12666177/you-cant-parse-html-with-regular-expressions

Also, you "think this should match any(...)" ? gosh, it's a computer. What's next? praying and hoping that what you coded matches the specification?

Ben L.

@spamcourt said:

Also, you "think this should match any(...)" ? gosh, it's a computer. What's next? praying and hoping that what you coded matches the specification?

// I really, really hope this line isn't a syntax error.
doSome thing()

rad131304

@spamcourt said:

Awww holy mother of Mechanisburg not again this here. While you can make a useful parser with sed/awk/perl please stop advising the use of regular expressions to process html (like your example). This pages explain it better and with a nicer tone:

http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

RegEx match open tags except XHTML self-contained tags

I need to match all of these opening tags: <p> <a href="foo"> But not self-closing tags: <br /> <hr class="foo" /> I came up with this and wanted to make

http://www.xtranormal.com/watch/12666177/you-cant-parse-html-with-regular-expressions

Also, you "think this should match any(...)" ? gosh, it's a computer. What's next? praying and hoping that what you coded matches the specification?

Shockingly? I agree with you that regex should not be used for complex file parsing, but I didn't claim that it actually parsed XHTML properly, just that it might solve one specific problem. Note that I said *should* because I didn't test it beyond checking it against a random HTML file. The only things I bothered to worry about from the spec were that sometimes the attribute's value isn't quoted, and that tag names are insensitive. He'd gotten one response and hadn't indicated that it actually solved his problem, so I provided another that might solve the specific problem at hand.

Remind me never to try and help you.

Cassidy

@rad131304 said:

Remind me never to try and help you.

It was Nagesh that wanted the stripper. Spamcourt already owns his own harem.

notchulance

Install nodejs from nodejs.org
I think now it makes npm available for use on the command line:

npm install -g coffeescript
npm install -g sax


save the code below as nagesh.coffee
run it like: coffee nagesh.coffee

### nagesh.coffee ###

sax = require("sax")
strict = false
parser = sax.parser(strict)

parser.onerror = (err) ->
  console.log(err)
  
parser.onopentag = (node) ->
  if node["name"] is  'STYLE' or (node["name"] is 'LINK' and node["attributes"]["REL"] in ['STYLESHEET','stylesheet'])
    return node = null
  atribs = {}
  for key,val of node["attributes"]
    if key not in ["STYLE", 'style']
      atribs[key] = val
  node["attributes"] = atribs
  console.log node
  
test = '
  <!DOCTYPE html>
  <html>
      <head>
          <link rel="stylesheet" type="text/css" href="nagesh.css">
          <title></title>
          <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
          <style>
              div{
                  text-align: center;
                  color: #FF8822;
              }
          </style>
      </head>
      <body>
          <div id="bodyDiv"  style="background:#00F0FF;" >
              Je suis content.
          </div>
      </body>
  </html>
'
parser.write(test).close();

sprained

This is one of the very few things that Adobe Dreamweaver is brilliant at, actually. Although maybe a tad expensive if this is all you're going to use it for.

Daniel15

Last time I had to clean up HTML, I wrote a C# app that uses the HTML Agility Pack to parse the page. It looped through every element in the file, removing the ones that weren't in a whitelist (and cleaning up the attributes of the remaining elements). I used it to do the first pass of converting a Sharepoint wiki with horrid HTML to Dokuwiki/MediaWiki format. I can try find the app if it'd be useful to you.

@sprained said:

This is one of the very few things that Adobe Dreamweaver is brilliant at, actually. Although maybe a tad expensive if this is all you're going to use it for.

Dreamweaver: The world's most expensive HTML cleanup tool.

Nagesh1

@Daniel15 said:

Last time I had to clean up HTML, I wrote a C# app that uses the HTML Agility Pack to parse the page. It looped through every element in the file, removing the ones that weren't in a whitelist (and cleaning up the attributes of the remaining elements). I used it to do the first pass of converting a Sharepoint wiki with horrid HTML to Dokuwiki/MediaWiki format. I can try find the app if it'd be useful to you.

@sprained said:
This is one of the very few things that Adobe Dreamweaver is brilliant at, actually. Although maybe a tad expensive if this is all you're going to use it for.

Dreamweaver: The world's most expensive HTML cleanup tool.

Please send me codez.......................

Kthxbai,
nagesh

Cassidy

@Nagesh said:

Please send me codez.......................
Kthxbai,
nagesh

That tickled me!

Nagesh1

@Cassidy said:

@Nagesh said:
Please send me codez.......................

Kthxbai,
nagesh

That tickled me!

You must be ticklish naturally.

ammoQ

Maybe this is what you need: http://acme.com/software/decss/

notchulance

He probably just meant Reset. Trolled again.