I am looking for a program that will do job of stripper



  • Should clean HTML and strip out all CSS references, but leave basic html tags intact.

    I am googling and not finding anything yet.



  • You're probably using the wrong term in your googling search.

    Are plenty of scantily-clad images appearing in your results?



  • @Cassidy said:

    You're probably using the wrong term in your googling search.

    Are plenty of scantily-clad images appearing in your results?

    Yes. Also some industry chemical products are coming up.

    HTML Tidy is not working like desired. I don't want to strip out HTML completely. Leave basic HTML tags in there.



  • Perhaps ask around and find a Java programmer who could knock up some cross-platform utility that could regex-replace out the stuff you're not interested in?



  • I came here for strippers ready to recommend that one program that puts tits down in the corner by your clock.

    But I came away empty handed.



  • This needs some modification


    @Cassidy said:

    Perhaps ask around and find a Java programmer who could knock up some cross-platform utility that could regex-replace out the stuff you're not interested in?



  • @Nagesh said:

    Yes. Also some industry chemical products are coming up.


    Anything from 3D graphics? I once spent a couple of days creating a Wavefront loader which stripped models.


  • Discourse touched me in a no-no place

    @Nagesh said:

    This
    needs some modification
    Something certainly does - the comments have (ironically) what appears to be HTML tags (amongst other stuff) in them, as if something's stripped the angle brackets from the comments:
    @mike said:
    Given the way we work around here, a conversion from WordML/.doc format to HTML

    @jake said:
    (Small blurb on project intentions here: a href="http://www.critical-masses.com/projects.html"http://www.critical-masses.com/projects.html/a --scroll down to HTMLMin)

    @Sam said:
    Made a version for .NET 1.1. A few bugs fixed (quoted classes were not removed, not all empty tags were deleted).



    a href="http://webdevel.blogspot.com/2006/01/clean-word-html-command-line-tool.html"http://webdevel.blogspot.com/2006/01/clean-word-html-command-line-tool.html/a

    etc.



  • Try Beautiful Soup, it is not that hard and works really nice. If you are not in a hurry I can code a basic filter by Tuesday.



  • I have downloaded it for now on my person windows 7 machine. First let me see how it work here, before I take it to work place and test on office machine.

    @spamcourt said:

    Try Beautiful Soup, it is not that hard and works really nice. If you are not in a hurry I can code a basic filter by Tuesday.

    Thank you for providing link.



  • @Nagesh said:

    Should clean HTML and strip out all CSS references, but leave basic html tags intact.

    I am googling and not finding anything yet.

    did you try something like*

    sed -e 's/(<[A-Z][^>]*)style=(("[^"]*")|([^\s]))(.*>)/\1\3/Ig' infile > outfile
    

    or

    perl -pe 's/(<[A-Z][^>]*)style=(("[^"]*")|([^\s]))(.*>)/\1\3/ig' infile > outfile
    

    where you replace infile and outfile with the appropriate file names obviously. I think this should match any opening html tag with a style attribute and simply delete the style attribute. You may want to run it recursively until it finds no more style attributes in case somebody put more than one into an html tag.



  • @rad131304 said:

    @Nagesh said:

    Should clean HTML and strip out all CSS references, but leave basic html tags intact.

    I am googling and not finding anything yet.

    did you try something like*

    sed -e 's/(<[A-Z][^>]*)style=(("[^"]*")|([^\s]))(.*>)/\1\3/Ig' infile > outfile
    

    or

    perl -pe 's/(<[A-Z][^>]*)style=(("[^"]*")|([^\s]))(.*>)/\1\3/ig' infile > outfile
    

    where you replace infile and outfile with the appropriate file names obviously. I think this should match any opening html tag with a style attribute and simply delete the style attribute. You may want to run it recursively until it finds no more style attributes in case somebody put more than one into an html tag.


    Awww holy mother of Mechanisburg not again this here. While you can make a useful parser with sed/awk/perl please stop advising the use of regular expressions to process html (like your example). This pages explain it better and with a nicer tone:
    http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

    http://www.xtranormal.com/watch/12666177/you-cant-parse-html-with-regular-expressions

    Also, you "think this should match any(...)" ? gosh, it's a computer. What's next? praying and hoping that what you coded matches the specification?



  • @spamcourt said:

    Also, you "think this should match any(...)" ? gosh, it's a computer. What's next? praying and hoping that what you coded matches the specification?

    // I really, really hope this line isn't a syntax error.
    doSome thing()


  • @spamcourt said:

    Awww holy mother of Mechanisburg not again this here. While you can make a useful parser with sed/awk/perl please stop advising the use of regular expressions to process html (like your example). This pages explain it better and with a nicer tone:

    http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html


    http://www.xtranormal.com/watch/12666177/you-cant-parse-html-with-regular-expressions

    Also, you "think this should match any(...)" ? gosh, it's a computer. What's next? praying and hoping that what you coded matches the specification?

    Shockingly? I agree with you that regex should not be used for complex file parsing, but I didn't claim that it actually parsed XHTML properly, just that it might solve one specific problem. Note that I said *should* because I didn't test it beyond checking it against a random HTML file. The only things I bothered to worry about from the spec were that sometimes the attribute's value isn't quoted, and that tag names are insensitive. He'd gotten one response and hadn't indicated that it actually solved his problem, so I provided another that might solve the specific problem at hand.

    Remind me never to try and help you.



  • @rad131304 said:

    Remind me never to try and help you.
     

    It was Nagesh that wanted the stripper. Spamcourt already owns his own harem.



  • Install nodejs from nodejs.org
    I think now it makes npm available for use on the command line:
    
    npm install -g coffeescript
    npm install -g sax
    
    
    save the code below as nagesh.coffee
    run it like: coffee nagesh.coffee
    
    
    ### nagesh.coffee ### sax = require("sax") strict = false parser = sax.parser(strict) parser.onerror = (err) -> console.log(err) parser.onopentag = (node) -> if node["name"] is 'STYLE' or (node["name"] is 'LINK' and node["attributes"]["REL"] in ['STYLESHEET','stylesheet']) return node = null atribs = {} for key,val of node["attributes"] if key not in ["STYLE", 'style'] atribs[key] = val node["attributes"] = atribs console.log node test = ' <!DOCTYPE html>
    <html>
    <head>
    <link rel="stylesheet" type="text/css" href="nagesh.css">
    <title></title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <style>
    div{
    text-align: center;
    color: #FF8822;
    }
    </style>
    </head>
    <body>
    <div id="bodyDiv" style="background:#00F0FF;" >
    Je suis content.
    </div>
    </body>
    </html> ' parser.write(test).close();


  • This is one of the very few things that Adobe Dreamweaver is brilliant at, actually.  Although maybe a tad expensive if this is all you're going to use it for.



  • Last time I had to clean up HTML, I wrote a C# app that uses the HTML Agility Pack to parse the page. It looped through every element in the file, removing the ones that weren't in a whitelist (and cleaning up the attributes of the remaining elements). I used it to do the first pass of converting a Sharepoint wiki with horrid HTML to Dokuwiki/MediaWiki format. I can try find the app if it'd be useful to you.

    @sprained said:

    This is one of the very few things that Adobe Dreamweaver is brilliant at, actually.  Although maybe a tad expensive if this is all you're going to use it for.

    Dreamweaver: The world's most expensive HTML cleanup tool.



  • @Daniel15 said:

    Last time I had to clean up HTML, I wrote a C# app that uses the HTML Agility Pack to parse the page. It looped through every element in the file, removing the ones that weren't in a whitelist (and cleaning up the attributes of the remaining elements). I used it to do the first pass of converting a Sharepoint wiki with horrid HTML to Dokuwiki/MediaWiki format. I can try find the app if it'd be useful to you.

    @sprained said:

    This is one of the very few things that Adobe Dreamweaver is brilliant at, actually.  Although maybe a tad expensive if this is all you're going to use it for.

    Dreamweaver: The world's most expensive HTML cleanup tool.

    Please send me codez.......................

    Kthxbai,
    nagesh



  • @Nagesh said:

    Please send me codez.......................

    Kthxbai,
    nagesh

     

    That tickled me!



  • @Cassidy said:

    @Nagesh said:

    Please send me codez.......................

    Kthxbai,
    nagesh

     

    That tickled me!

    You must be ticklish naturally.



  • Maybe this is what you need: http://acme.com/software/decss/



  • He probably just meant Reset. Trolled again.


Log in to reply