Yet another DiscoHTBBCoMLParser bug



  • @mott555 said:

    *easily not damage anything, not easily type

    Continuing the discussion from Why is Jeff everywhere:

    @mott555 said:

    @loopback0 said:
    @mott555 said:
    I can type 120 WPM.

    Do you type with a hammer?! :laughing:

    Since I mostly use mechanical keyboards I could easily* type with a hammer without damaging anything.

    * easily not damage anything, not easily type

    :unamused:

    * easily not damage anything, not easily type

    Check raw. See where "small" tags are, notice what text is actually small.



  • Nice one!



  • Erm, why are there two quotes? I didn't do that...not intentionally anyway.



  • @mott555 said:

    Check raw. See where "small" tags are, notice what text is actually small.

    :laughing:
    ...
    :facepalm:



  • I wonder if any other nested tags are broken...



  • I found this documentary of the DiscoDevs writing the parser:



  • _*)** abcdefghijklmnopqrstuvwxyz1234567890

    **View raw** :laughing:


  • _*what**

    **_*\*<small>what*</small>**



  • The MD5 fuckery topic is :arrows:



  • You mean :arrow_down:


  • BINNED

    The MD5 fuckery knows no bounds.



  • Why not write a program that finds all these parser bugs for us? Oh wait, nevermind, I guess unit tests are :doing_it_wrong:


  • mod

    Once you remember they are using RegEx to parse, this is easy.

    First let's take a look at what you typed. Since it is at the start of a word block, the parser expects it to be the start of an italics sequence, the second gets ignored, and the third, being at the end of a word block, is used as the close of the italics sequence. Because the <small> tag is now inside the generated <em> tag, the parser "helpfully" closes it before closing the <em>, which is what generates the results you see.

    *<small> *easily* not damage anything, not easily type </small>
    ^        ^      ^
    Start    |      End
    italics  |      italics
             Ignore
    

    Now, let's try the first solution of putting a space after the first asterisk. This method produces the results you want, but converts the first asterisk into a bullet:

    * *easily* not damage anything, not easily type

    Given sufficient knowledge of markup, this shouldn't be difficult to understand. So what is the solution? Why, escape the first asterisk, of course!

    \*<small> *easily* not damage anything, not easily type </small>
    
    \* *easily* not damage anything, not easily type

    Honestly, this one should probably be classified as user error. Trying to add a fix to the parser would add layers of complications and introduce lots of edge cases and bugs.





  • Can you do mine next?


  • mod

    No, that's a legit bug.



  • That MD5 thing is the gift that keeps on giving.



  • Which is why we want you to dissect it! :smile:



  • @abarker said:

    Trying to add a fix to the parser would add additional layers of complications and introduce lots of even more edge cases and bugs.

    FTFD


  • mod

    @rc4 said:

    Which is why we want you to dissect it! :smile:

    :rolleyes: All right, let's see what we can find.

    So @hungrier started with:

    **_*\*<small>what*</small>** to get: _*what**

    So what happens if we drop the HTML?

    **_*\*what*** to get: _*what**

    Well, that's no fun. So keep the HTML. Let's try dropping the underscore instead:

    ***\*<small>what*</small>** to get: *what

    That's boring, too. Ok, underscore stays. Hmmm. Remove the bold level asterisks?

    _*\*<small>what*</small> to get: _*what

    That fracks up the fun too … Put the bold back in. Time to try removing the italics:

    **_\*<small>what</small>** to get: _*what

    Well belgium. There's only two things left to remove, the escaped asterisk and the "what". I can't imagine that the fun would work without either of those, but just to be thorough, let's remove the "what".

    **_*\*<small>*</small>** to get: _***

    Well, now. That provides some interesting insight (I should really have picked up on this earlier). Note that more of the hash is visible in the version without the "what" than the one with the "what". Let's try adding "what back in, one letter at a time:

    1. **_*\*<small>w*</small>** to get: _*w**
    • **_*\*<small>wh*</small>** to get: _*wh**
    • **_*\*<small>wha*</small>** to get: _*wha**
    • **_*\*<small>what*</small>** to get: _*what**
    • **_*\*<small>what *</small>** to get: _*what
    • **_*\*<small>what t*</small>** to get: _*what t**
    • **_*\*<small>what th*</small>** to get: _*what th**
    • **_*\*<small>what the*</small>** to get: _*what the**
    • **_*\*<small>what the *</small>** to get: _*what the
    • **_*\*<small>what the b*</small>** to get: _*what the b**
    • **_*\*<small>what the be*</small>** to get: _*what the be**
    • **_*\*<small>what the bel*</small>** to get: _*what the bel**
    • **_*\*<small>what the belg*</small>** to get: _*what the belg**
    • **_*\*<small>what the belgi*</small>** to get: _*what the belgi**
    • **_*\*<small>what the belgiu*</small>** to get: _*what the belgiu**
    • **_*\*<small>what the belgium*</small>** to get: _*what the belgium**
    • **_*\*<small>what the belgium *</small>** to get: _*what the belgium
    • **_*\*<small>what the belgium n*</small>** to get: _*what the belgium n**
    • **_*\*<small>what the belgium no*</small>** to get: _*what the belgium no**
    • **_*\*<small>what the belgium now*</small>** to get: _*what the belgium now**
    • **_*\*<small>what the belgium now?*</small>** to get: _*what the belgium now?**

    Interestingly, there are additional complications noted in this progression:

    • Sequences ending in spaces completely hide the hash, cause an extra asterisk to appear in the baked test, and eliminate the duplicate text. (See #5, #9, and #17).
    • Sequences containing censored words completely hide the hash – likely because they are using the HTML escape sequence for the censoring blocks (&#9632;). This can result in the doubled sequence getting overwritten in addition to the hash, and the <small> tag being escaped early (See #16 and #18 - #21)
    • Only the first item in the list above is actually numbered.


  • That's...amazing. Where did the ; come from?



  • Looks like a partially-eaten &9632;.



  • That's what I was wondering...would make sense, I suppose.


  • mod

    @hungrier said:

    Looks like a partially-eaten &9632;.

    You missed a # in there. But yes, you would appear to be correct.



  • I wish I could like this post more than once.



  • Discourse wanted me to think I could



  • It knows you've been breaking the DiscoHTBBCoMLParser.


  • Discourse touched me in a no-no place

    @abarker said:

    they are using RegEx to parse

    :wtf:


  • mod

    I thought this was well known by now. Dickcorpse uses RegEx to parse DiscoHTBBCoML.



  • @abarker said:

    Once you remember they are using RegEx to parse, this is easy.

    Exactly. Anything you type is liable to produce something that is almost, but not entirely, unlike what you wanted to produce

    @abarker said:

    Honestly, this one should probably be classified as user error shit design. Trying to add a fix to the parser rather than doing the job properly would add layers of complications and introduce lots of edge cases and bugs. "Trying to add a fix to the parser" is a large part of why the parser is already chock-full of edge cases and bugs

    .


  • Discourse touched me in a no-no place

    Not sure if insane ideas thread or stupid things that people have actually done...



  • Don't worry, they "escape" special characters (* and _, for example) with MD5! :tada:


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.