Reinventing XML



  • every now and again i find a wtf post on the forums that isnt a homework...

     

    http://forum.java.sun.com/thread.jspa?threadID=5210236 



  • runs away screaming

    Seriously though, we had to do something similar at a previous job... and we eventually wrote a parser to convert it to/from xml.  That was sad.  What was sadder though is that the intern that wrote it took about 300 lines of code, and only worked on certain formats.



  • Why, oh why do people insist on embarrassing themselves? I understand that the OP thinks he has no choice in the matter, but he probably does. Like have the originator of this idiotic idea do it him/herself.



  • What's amazing is that no one in the thread points out that as long as overlapping is forbidden, a simple stack-based parser will handle what he's doing without the need for recursion.



  • where did these idiots learn how to program ? Recursion is part of learning programming. But reinventing the wheel is not. Sounds like the project lead is a dinosaur. Either way, maybe they could have used a regular expression parser.



  • Does it really matter what kind of parser can be used? I mean, I know that all of us programmers like a challenge and would want to solve the problem, but if the problem is idiotic, does that still apply?



  • Couldn't you just do "sed -e 's/</&lt;/g' -e 's/>/&gt;/g' -e 's/[/</g' -e 's/]/>/g'" or equivalent and pass the result to a real XML parser? That seems to be the simplest solution.



  • [ quote user="petvirus" ]

    every now and again i find a wtf post on the forums that isnt a homework...

     

    http://forum.java.sun.com/thread.jspa?threadID=5210236 

    [ /quote ]

    Why would anybody ever want some sort of markup language that was like XML, but used  [ and ] instead of < and >?



  • @vt_mruhlin said:

    Why would anybody ever want some sort of markup language that was like XML, but used  [ and ] instead of < and >?

    Because you could hurt yourself on the pointy bits.



  • maybe they mean they want to implement their own BB Code system, I've done that before long long ago although now I'd just use one of the many WYSIWYG editors that's available.



  • I wonder why does just about every single messageboard software out there use [tags] instead of XML then.



  • @Sunstorm said:

    I wonder why does just about every single messageboard software out there use [tags] instead of XML then.

    Once upon a time, long ago when guestbooks made sense and people posted email addresses without javascript armor, the way proto-message-boards prevented content-injection attacks was to find everything that looked like "< tag >" and destroy it. Then, for simplicity, they added back in regexes to detect fixed things in square brackets like [b] and such that add in "approved" HTML. This would allow them the ability to add whatever site-specific font/bold/color tags back in (later CSS class definitions and styling) for the interpretation of the tags (i.e. [ code ]) before rendering it to the page, which made more sense from a themeing perspective than just accepting user-directed raw HTML tags. (Never mind that with proper CSS you can detect bare HTML inside of the thread DIV and style appropriately, but that would require understanding those COMPLICATED scoping rules and syntax).

    The lone exception is slashdot. This is because they actually know what they're doing, and have been around the longest.
    I wish it would go away. I like to put things in brackets for emphasis; I also tend to nest parens a lot and they help stuff stand out that's double nested.
     



  • Out of curiosity; what stops you from using it for emphasis still?  A forum I frequent has a [code] tag which wraps code in a formatted box.  When I'm instructing new posters to wrap their code in that tag I write it like this [code[i][/i]]/* your code goes here /[/code[i][/i]] which prevents the [code] tag from being rendered and instead outputs it as plain text so they can see what they're supposed to type i.e. [code]/ your code goes here */[/code].  I know some other forums have a tag you wrap around other tags to prevent them being parsed, the one I use doesn't.



  • It does kinda make sense from a whitelist point of view. There's a lot of things to protect from in regular HTML. The other day I discovered you can actually stick javascript inside CSS rules in IE. You either have a very paranoid HTML rewriter that strips out or encodes all the tags it doesn't like (see: Livejournal), or you go with BBcode, which you can secure simply by encoding <, & and >, and then add in the rest, without fear that some obscure tag will end up used in some strange way, for the purpose of sending all your password to Korean gangsters.



  • the site killed my BB syntax and using ascii didn't help at all.  it should look like this [ code [ i ] [ /i ] ] without all the spaces.



  • My suggestion would have been: Use a standard XML parser. Then present the finished module to the boss this afternoon and tell him that, if he still needs []ing, you will need the next 2 months to write a new XML parser.

    Another option: grab an OS XML parser, and change the delimiter characters.  



  • @Sunstorm said:

    It does kinda make sense from a whitelist point of view. There's a lot of things to protect from in regular HTML. The other day I discovered you can actually stick javascript inside CSS rules in IE. You either have a very paranoid HTML rewriter that strips out or encodes all the tags it doesn't like (see: Livejournal), or you go with BBcode, which you can secure simply by encoding <, & and >, and then add in the rest, without fear that some obscure tag will end up used in some strange way, for the purpose of sending all your password to Korean gangsters.

    That still makes no sense...

    1. Escape all HTML entities
    2. Unescape all whitelisted tags
    3. ???
    4. Profit

    Using &lt; and &gt; as delimiters requires nothing different than using [ and ].



  • @kirchhoff said:

    (Never mind that with proper CSS you can detect bare HTML inside of the thread DIV and style appropriately, but that would require understanding those COMPLICATED scoping rules and syntax).

     Can you elaborate on this and/or give some pointers? It sounds interesting, but I don't know what CSS construct you are referring to...

     

    Thanks!

    Sven
     



  • The poster sounds to me as if he doesn't even know what XML is and just assumes it's some slightly advanced version of HTML.

    There are a few justifications for using square brackets in certain cases I believe. PHP for example doesn't have a fully-fledged XML parser available in most cases. In those cases it would be overkill to write one just for simple formatting. But on the other hand, it would be incredibly complex to filter out "good" and "bad" angle brackets. Those problems are mostly result of the inherent WTFness of PHP but that doesn't mean, you can do much about it. After all, reality is that many many shared hosting packages today come with PHP as the main scripting languages and that most message board software is built upon it.

    @Talchas said:

    Couldn't you just do "sed -e 's/</&lt;/g' -e 's/>/&gt;/g' -e 's/[/</g' -e 's/]/>/g'" or equivalent and pass the result to a real XML parser? That seems to be the simplest solution.

    That's exactly what I thought too. But I think I know why they can't do that: The higher-up would reject the code "Only five lines??? That can't possibly work! It must be broken..."



  • (Double post due to stupid edit time out)

    Also, "pseudo XML" is nice if you want to cut down XML's notorious
    verbosity a bit. Common BBCode for example uses the form [ url=<link> ]<title>[ /url ]
    for hyperlinks. This is no well formed XML even if written with angle
    brackets but it's very handy in the context BBCode appears.

    All of this is of course no excuse if you have an XML parser readily available and still decide not to use it.

     



  • Why is everybody assuming that, setting aside the use of [] vs. <> , the OP's problem would be XML compatible?

    It could be HTML or some other tag soup. Valid HTML is not parseable XML.

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title></title><link></head><body></body></html>



  • @PSWorx said:

    PHP for example doesn't have a fully-fledged XML
    parser available in most cases. In those cases it would be overkill to
    write one just for simple formatting. But on the other hand, it would
    be incredibly complex to filter out "good" and "bad" angle brackets.
    Those problems are mostly result of the inherent WTFness of PHP but
    that doesn't mean, you can do much about it. After all, reality is that
    many many shared hosting packages today come with PHP as the main
    scripting languages and that most message board software is built upon
    it.

    It has at least four. At least two are usually available by default in PHP4, all by default in > 5.1.


    [quote user="JvdL"]It could be HTML or some other tag soup. Valid HTML is not parseable XML.[/quote]

    He says that it's for a CMS, so it's assumed that people are going to write it. So they might as well use XML. Outputting HTML based on XML is easy, you could even just pass it through an XSLT template or something.
     



  • How come nobody has suggested using regex?

    s/[/&lt;/g

    s/]/&gt;/g

    ?!?‽‽
     



  • I refuse to believe that he's under imposed constraints to use recursion.  Regardless what he says, it's either homework or busywork.  If he has to use the brackets for some reason, then so be it.  But if he's being dictated what programming technique to use, then I'll venture to assert that "by definition this is an assignment".

     



  • @Sunstorm said:


    [quote user="JvdL"]It could be HTML or some other tag soup. Valid HTML is not parseable XML.

    He says that it's for a CMS, so it's assumed that people are going to write it. So they might as well use XML. Outputting HTML based on XML is easy, you could even just pass it through an XSLT template or something.

    [/quote]

    I've worked on a CMS of sorts that allowed certain users to upload HTML snippets extended with tags to render dynamic data in text, charts or tables. Something like this would show whoever is looking at the snippet his or her assets in a pie chart:

    
    <h2>Your assets</h2>
    Hello <dynamic>~/firstname</dynamic> <dynamic>~/lastname</dynamic>, these are your assets:
    <br>
    <dynamic:chart type="pie">select name,quantity from ~/assets</dynamic:chart>
    

    The users allowed to upload such snippets where authenticated "power users" trusted to not abuse this for evil injections. They were able to produce HTML with or without the aid of wysiwig editors, but it would certainly not have been acceptable to force them to use XML.

    Upon requesting a page with such snippets, the implementation would parse the "tag soup", replace the dynamic:elements with appropriate HTML and ship it.
     



  • @suzilou said:

    I refuse to believe that he's under imposed constraints to use recursion.  Regardless what he says, it's either homework or busywork.  If he has to use the brackets for some reason, then so be it.  But if he's being dictated what programming technique to use, then I'll venture to assert that "by definition this is an assignment".

     

    That's what I thought at first, although his explanation seems pretty good (i.e., not homework).  More likely he's an intern or junior-level programmer whose manager/tech lead/whoever said, "Hey, we need you to write this. You can probably even do it recursively -- you studied that, right?" and he took it as a requirement.

     



  • @Sunstorm said:

    It has at least four. At least two are usually available by default in PHP4, all by default in > 5.1.





    Also worthy of note; all four actually require one library (gnome-xml2-dev iirc); and if you want to build PHP5 without XML support you have to manually tell configure to not use each one individually (which required me to run configure 5 times).



  • @CDarklock said:

    @vt_mruhlin said:

    Why would anybody ever want some sort of markup language that was like XML, but used  [ and ] instead of < and >?

    Because you could hurt yourself on the pointy bits.

    I dunno. "[" and "]" still have sharp corners. You could consider "(" and ")" but those still have ends that could cause damage. I think we should just use "O" and "O", because there are no sharp corners or ends anywhere! 



  • @too_many_usernames said:

    @CDarklock said:

    @vt_mruhlin said:

    Why would anybody ever want some sort of markup language that was like XML, but used  [ and ] instead of < and >?

    Because you could hurt yourself on the pointy bits.

    I dunno. "[" and "]" still have sharp corners. You could consider "(" and ")" but those still have ends that could cause damage. I think we should just use "O" and "O", because there are no sharp corners or ends anywhere! 

    Personally, I use question marks to hang my coat on. They are the perfect shape!



  • @too_many_usernames said:

    I think we should just use "O" and "O", because there are no sharp corners or ends anywhere! 

    You, my friend, are clearly not skilled in the ways of the ninja... 



  • @kirchhoff said:

    The lone exception is slashdot. This is because they actually know what they're doing, and have been around the longest

    I was about to point out that there are loads of forums that use proper HTML for user comments - perlmonks, for example, is another that's been around for donkey's years and has a decent comment system.

    Then I noticed that I was having to use real HTML here, in this very post, because that's the only way I can find of putting a line break between paragraphs. Hmm, maybe it's not such a rare feature after all.



  • @rbowes said:

    @too_many_usernames said:
    @CDarklock said:

    @vt_mruhlin said:

    Why would anybody ever want some sort of markup language that was like XML, but used  [ and ] instead of < and >?

    Because you could hurt yourself on the pointy bits.

    I dunno. "[" and "]" still have sharp corners. You could consider "(" and ")" but those still have ends that could cause damage. I think we should just use "O" and "O", because there are no sharp corners or ends anywhere! 

    Personally, I use question marks to hang my coat on. They are the perfect shape!

    In Spain



  • @Iago said:

    @kirchhoff said:
    The lone exception is slashdot. This is because they actually know what they're doing, and have been around the longest

    I was about to point out that there are loads of forums that use proper HTML for user comments - perlmonks, for example, is another that's been around for donkey's years and has a decent comment system.

    Then I noticed that I was having to use real HTML here, in this very post, because that's the only way I can find of putting a line break between paragraphs. Hmm, maybe it's not such a rare feature after all.


    The forum software here is a real WTF.  I haven't had the patience to test, but I'm fairly sure it's quite vulnerable to XSS/javascript injection.



  • I run a forum and I'll say this: "Thank God for BBcode!"  I don't feel like having to explain what the hell <a href means when the users just want to post a [link].  Not to mention that there's [url=http://ha.ckers.org/xss.html]a crazy amount of xss attacks to protect against[/url].  Restricting the markup you accept as much as possible is the only sane thing to do.



  • @Cap'n Steve said:

    I run a forum and I'll say this: "Thank God for BBcode!" I don't feel like having to explain what the hell <a href means when the users just want to post a [link]. Not to mention that there's [url=http://ha.ckers.org/xss.html]a crazy amount of xss attacks to protect against[/url]. Restricting the markup you accept as much as possible is the only sane thing to do.

    Well, most of the injections on that page can be taken out rather uncomplicated by a white list. But I agree, the "why" is a big question: There isn't much point in using a markup language that's about ten shoe sizes too large for what you're intending to do. Hence the XML approach.



  • @Carnildo said:

    @Iago said:
    @kirchhoff said:
    The lone exception is slashdot. This is because they actually know what they're doing, and have been around the longest

    I was about to point out that there are loads of forums that use proper HTML for user comments - perlmonks, for example, is another that's been around for donkey's years and has a decent comment system.

    Then I noticed that I was having to use real HTML here, in this very post, because that's the only way I can find of putting a line break between paragraphs. Hmm, maybe it's not such a rare feature after all.


    The forum software here is a real WTF. I haven't had the patience to test, but I'm fairly sure it's quite vulnerable to XSS/javascript injection.

    Well, there's no time like the present. Who wants to try breaking stuff? 🙂

    <script >alert('test?');</script > Click me! </body> </html> Pants.


  • @rbowes said:

    @Carnildo said:
    @Iago said:
    @kirchhoff said:
    The lone exception is slashdot. This is because they actually know what they're doing, and have been around the longest

    I was about to point out that there are loads of forums that use proper HTML for user comments - perlmonks, for example, is another that's been around for donkey's years and has a decent comment system.

    Then I noticed that I was having to use real HTML here, in this very post, because that's the only way I can find of putting a line break between paragraphs. Hmm, maybe it's not such a rare feature after all.


    The forum software here is a real WTF. I haven't had the patience to test, but I'm fairly sure it's quite vulnerable to XSS/javascript injection.

    Well, there's no time like the present. Who wants to try breaking stuff? 🙂

    <script >alert('test?');</script > Click me! </body> </html> Pants.

    Well, there's no totally obvious way. I wouldn't at all be surprised if I could do more than break the left margin, put a bullet in front of my post, and screw up the links at the bottom, though. 🙂

    Edit: Kind of funny that it sort of hides my "Report Abuse" link. Try reporting me now, suckers! Let's see if I can hide it altogether...





  • @Thief^ said:

    http://forums.worsethanfailure.com/forums/post/124763.aspx

    Very nice! I didn't quite consider going that far, but that's very cool. 🙂

    I wonder if using WTF forum software on this site is intentional?



  • Test...



  • Test

    </html>


  • As long as we're breaking things, let's see if we can break all the way out
    </form></body></html> Where are we now



  • </form> </body> </html> This is as far as it goes apparently.


  • Thread over.



  • </form></body></html>

    OK, maybe that was a bit premature



  • </form> </body> </html> </form> </body> </html> </form> </body> </html> </form> </body> </html> Can I save the thread? test


  • OK, back to some semblance of normalcy



  • That was fun.





  • Well, managed to completely hide my "report abuse" button - i think i broke it again though... well, we'll be on another page any minute


    • Log in to reply
       

      Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.