Yet another DiscoHTBBCoMLParser bug

mott555

*easily not damage anything, not easily type

Continuing the discussion from Why is Jeff everywhere:

@loopback0 said:
@mott555 said:
I can type 120 WPM.

Do you type with a hammer?!

Since I mostly use mechanical keyboards I could easily* type with a hammer without damaging anything.

* easily not damage anything, not easily type

* easily not damage anything, not easily type

Check raw. See where "small" tags are, notice what text is actually small.

rc4

Nice one!

mott555

Erm, why are there two quotes? I didn't do that...not intentionally anyway.

loopback0

@mott555 said:

Check raw. See where "small" tags are, notice what text is actually small.

...

Nocha

I wonder if any ~~other~~ nested tags are broken...

mott555

I found this documentary of the DiscoDevs writing the parser:

http://i.imgur.com/dXFSate.gif

rc4

_*) abcdefghijklmnopqrstuvwxyz1234567890

**View raw**

hungrier

_*what

**_*\*what***

loopback0

The MD5 fuckery topic is

rc4

You mean

Onyx

The MD5 fuckery knows no bounds.

LB_

Why not write a program that finds all these parser bugs for us? Oh wait, nevermind, I guess unit tests are

abarker

Once you remember they are using RegEx to parse, this is easy.

First let's take a look at what you typed. Since it is at the start of a word block, the parser expects it to be the start of an italics sequence, the second gets ignored, and the third, being at the end of a word block, is used as the close of the italics sequence. Because the  tag is now inside the generated  tag, the parser "helpfully" closes it before closing the , which is what generates the results you see.

*<small> *easily* not damage anything, not easily type </small>
^        ^      ^
Start    |      End
italics  |      italics
         Ignore

Now, let's try the first solution of putting a space after the first asterisk. This method produces the results you want, but converts the first asterisk into a bullet:

* *easily* not damage anything, not easily type

Given sufficient knowledge of markup, this shouldn't be difficult to understand. So what is the solution? Why, escape the first asterisk, of course!

\*<small> *easily* not damage anything, not easily type </small>

\* *easily* not damage anything, not easily type

Honestly, this one should probably be classified as user error. Trying to add a fix to the parser would add layers of complications and introduce lots of edge cases and bugs.

DogsB

https://www.youtube.com/watch?v=7NZ04BG7TfA

hungrier

Can you do mine next?

abarker

No, that's a legit bug.

hungrier

That MD5 thing is the gift that keeps on giving.

rc4

Which is why we want you to dissect it!

ChaosTheEternal

@abarker said:

Trying to add a fix to the parser would add additional layers of complications and introduce ~~lots of~~ even more edge cases and bugs.

FTFD

abarker

@rc4 said:

Which is why we want you to dissect it!

All right, let's see what we can find.

So @hungrier started with:

**_*\*what*** to get: _*what

So what happens if we drop the HTML?

**_*\*what*** to get: _*what

Well, that's no fun. So keep the HTML. Let's try dropping the underscore instead:

***\*what*** to get: *what

That's boring, too. Ok, underscore stays. Hmmm. Remove the bold level asterisks?

_*\*what* to get: _*what

That fracks up the fun too … Put the bold back in. Time to try removing the italics:

**_\*what** to get: _*what

Well belgium. There's only two things left to remove, the escaped asterisk and the "what". I can't imagine that the fun would work without either of those, but just to be thorough, let's remove the "what".

**_*\**** to get: _*

Well, now. That provides some interesting insight (I should really have picked up on this earlier). Note that more of the hash is visible in the version without the "what" than the one with the "what". Let's try adding "what back in, one letter at a time:

**_*\*w*** to get: _*w

**_*\*wh*** to get: _*wh
**_*\*wha*** to get: _*wha
**_*\*what*** to get: _*what
**_*\*what *** to get: *_**what
**_*\*what t*** to get: _*what t
**_*\*what th*** to get: _*what th
**_*\*what the*** to get: _*what the
**_*\*what the *** to get: *_**what the
**_*\*what the b*** to get: _*what the b
**_*\*what the be*** to get: _*what the be
**_*\*what the bel*** to get: _*what the bel
**_*\*what the belg*** to get: _*what the belg
**_*\*what the belgi*** to get: _*what the belgi
**_*\*what the belgiu*** to get: _*what the belgiu
**_*\*what the belgium*** to get: _*what the belgium
**_*\*what the belgium *** to get: *_**what the belgium
**_*\*what the belgium n*** to get: _*what the belgium n
**_*\*what the belgium no*** to get: _*what the belgium no
**_*\*what the belgium now*** to get: _*what the belgium now
**_*\*what the belgium now?*** to get: _*what the belgium now?

Interestingly, there are additional complications noted in this progression:

Sequences ending in spaces completely hide the hash, cause an extra asterisk to appear in the baked test, and eliminate the duplicate text. (See #5, #9, and #17).
Sequences containing censored words completely hide the hash – likely because they are using the HTML escape sequence for the censoring blocks (■). This can result in the doubled sequence getting overwritten in addition to the hash, and the  tag being escaped early (See #16 and #18 - #21)
Only the first item in the list above is actually numbered.

rc4

That's...amazing. Where did the ; come from?

hungrier

Looks like a partially-eaten &9632;.

rc4

That's what I was wondering...would make sense, I suppose.

abarker

@hungrier said:

Looks like a partially-eaten &9632;.

You missed a # in there. But yes, you would appear to be correct.

loopback0

I wish I could like this post more than once.

hungrier

Discourse wanted me to think I could

NedFodder

It knows you've been breaking the DiscoHTBBCoMLParser.

antiquarian

@abarker said:

they are using RegEx to parse

abarker

I thought this was well known by now. Dickcorpse uses RegEx to parse DiscoHTBBCoML.

tufty

@abarker said:

Once you remember they are using RegEx to parse, this is easy.

Exactly. Anything you type is liable to produce something that is almost, but not entirely, unlike what you wanted to produce

@abarker said:

Honestly, this one should probably be classified as ~~user error~~ shit design. Trying to add a fix to the parser rather than doing the job properly would add layers of complications and introduce lots of edge cases and bugs. "Trying to add a fix to the parser" is a large part of why the parser is already chock-full of edge cases and bugs

.

antiquarian

Not sure if insane ideas thread or stupid things that people have actually done...

rc4

Don't worry, they "escape" special characters (* and _, for example) with MD5!