Coding Horrors

accalia

@accalia has teh codez if you want to look under that rock

i'm not entirely sure what you mean...... unless you refer to @sockbot's quote/code parser mostrocity?

https://github.com/SockDrawer/SockBot/blob/master/lib/browser.js#L840-L907

which has almost 200 different parsing test cases to make sure it gets it right..... and i'm sure we're still missing edge cases here!

sockbot

Yes master Sultanatrix of Swypos; @‍RaceProUK's queen, I shall appear as summoned.

flabdablet

That's the one, officer!

Gąska

@accalia said:

which had almost 200 different parsing test cases

Had? That's troubling.

accalia

@Gaska said:

@accalia said:
which had almost 200 different parsing test cases

Had? That's troubling.

yeah, they gained sentience and evolved beyond our control... we're not entirely sure where they are now....

Gąska

I think more appropriate question is when they are.

accalia

Given the orbiting weaponry the various governments around the world have launched and refuse to admit is actually up there..... I think where is actually a fairly big concern.....

Gąska

Not if they've found these weapons in the next century. Not for me, at least.

JazzyJosh

oh my god the bug still exists.

Kian

You know, I always wondered why the formatting was a mix of bbmarkhtmldowncode and not just one sane markup language, and how that might have led to more complex parsing.

Now that I know they use regular expressions instead of actually parsing, it makes "sense".

TwelveBaud

something something brace matching something lazy.

Lorne Kates

At some point, when do you replace your parser with a non-regex one. Here:

opentags = array();
tagmap = array();
tagmap["*"] = {"<b>", "</b>"};  // strong text, etc for each other tag
rawinput = user_input
output = ""
for i = 0 to rawinput.length
    char = rawinput[i];
    if(char == "\" && i < rawinput.length)
            i = i + 1 // skip the next character
    elseif(tagmap.contains(char)):
          // check if this tag is already open, and close it  
          if(opentag.last == tagmap(char)[0])
              output += tagmap(char)[1]
          else
              output += tagmap(char)[0]
              opentags.push(tagmap(char)[0]
          end if;
      else // this is a normal character
          output += rawinput[i];
      endif
end for
return output

There. Done. I just wrote your parser in 2 minutes. Run it through a round of QA, and past someone who knows unicode better than I, and you're done like a plate full of chicken with forks in it.

TwelveBaud

requires escaping tag characters even if they are being used in an obvious non-tag way
doesn't allow character overloading (see: this post's raw)
tags must be at most one character long
no self-closing tags (e.g. - --> <hr/>)
no parameterization (e.g. [quote=@Bort])
is it discoverable? (see: not markdown)

tar

@Kian said:

always wondered why the formatting was a mix of bbmarkhtmldowncode and not just one sane markup language

Pick an insanely overambitious implementation, then half-ass it. That's the Discourse modus operandi...

LB_

@Kian said:

You know, I always wondered why the formatting was a mix of bbmarkhtmldowncode and not just one sane markup language

Personally I hate markup languages where the open and close tags are the same (e.g. just about everything in markdown). BBCode got it right years ago.

Kian

I like BBCode too, and I could get used to different markdown flavors. What grates me is when several are mixed. Stick to one and get it right.

HardwareGeek

@Kian said:

Stick to one and get it right.

That cannot be markdown, then. By design, it handles only simple formatting. If you need anything fancy, you have to mix something with it. Better then to simply skip the markdown in the first place and just use BBCode or (properly sanitized) HTML.

The real problem, though, (IMO) is not markdown, nor even the mix of markup languages, per se. The real problem is Discourse's stupid, half-baked markup parser. The mix of markup languages certainly makes parsing at least a little harder, but I'm pretty sure Discourse's parser would fail at handling edge cases in even a single markup language.

Lorne Kates

@TwelveBaud said:

* requires escaping tag characters even if they are being used in an obvious non-tag way

Yes.
@TwelveBaud said:

* doesn't allow character overloading (see: this post's raw)

Agreed. Otherwise the parser can also do a simple check if it's at the start of a new line. (for some values of "simple"). Better to not overload the character.

@TwelveBaud said:

* tags must be at most one character long

Easy enough to handle. The mockup assumed each tag was one character. You can add, say, "~~~" to tagmap. Loop through tagmap, and say if mid(str, pos, len(tagmap(k)) == tagmap(k) or something like that. Solve it once, and it's solved for all lengths of tags.

@TwelveBaud said:

* no self-closing tags (e.g. - --> <hr/>)

Just exclude the second value from tagmap. tagmap["|"] = {"
", ""}

Then if tagmap[char](1) == "", then don't push to OpenTags.

@TwelveBaud said:

* no parameterization (e.g. [quote=@Bort])

Or don't use @ as a tag replacement? Or extend tag replacement.

if match tag then (raise event before tag) then do parsing if not cancelled then (raise event after tag)

It'd allow for gasp****strong text easy plug-ins!

@TwelveBaud said:

* is it discoverable? (see: not markdown)

Write the markup to the users. Don't create a markup and expect users to learn it. There are already plenty of well known markup "standards" that users expect****strong text...

Onyx

@TwelveBaud said:

no self-closing tags (e.g. - --> <hr/>)

Can we let XHTML die already? Please?

RandomStranger

@loopback0 said:

@LB_ said:
I don't understand

MemefiedTFY

RandomStranger

You say that like it's different from any other mass-market industry/service sector...

Maciejasjmj

Or just stop being a WTF, use something like PegJS that writes you a proper parser and be done with it.

Gąska

Grammars for parser generators usually end up as the same unmaintainable PoS with lots of subtle bugs as if you wrote the parser yourself - except the bugs are harder to fix.

Maciejasjmj

Aren't you the person who used boost::spirit? I'm not surprised you'd say so...

And well duh, parsing is hard. But I'm pretty sure a parser grammar is easier to comprehend, maintain and verify than a lump of string-walking code.

Gąska

@Maciejasjmj said:

Aren't you the person who used boost::spirit?

I could either write one boost::spirit line, or 30 lines of conventional code. If it was two lines of boost::spirit, I wouldn't even consider it.

@Maciejasjmj said:

But I'm pretty sure a parser grammar is easier to comprehend, maintain and verify than a lump of string-walking code.

Depends of complexity of grammar and quality of code. For instance, if you run the source through tokenizer first, hand-written parser suddenly becomes much more maintainable.

dkf

@Gaska said:

For instance, if you run the source through tokenizer first, hand-written parser suddenly becomes much more maintainable.

LL(k) parsers can be written by hand, but LR(1) and LALR(1) ones can't and they usually have better performance and stability figures. Unless you're a very strange person. The things you have to do to build the state machine recognizer make the result really very confusing indeed.

And that's even with a tokenizer. You want one of those anyway.

Maciejasjmj

@Gaska said:

I could either write one boost::spirit line, or 30 lines of conventional code. If it was two lines of boost::spirit, I wouldn't even consider it.

It's such a violent syntax rape that I can't even bear to look at it, let alone write it.

@Gaska said:

For instance, if you run the source through tokenizer first

Obviously, you'll want a tokenizer or preferably a proper lexer before parsing.

dkf

@Maciejasjmj said:

Obviously, you'll want a tokenizer or preferably a proper lexer before parsing.

They're often pretty much the same thing, but it depends on the language you're parsing. The complication comes when you do things like having interacting comment and string syntaxes (hello, C!) or allow the programming language to introduce entirely new operators at runtime (I've done this in ML). Most of the time you just need to figure out “what is the next token?”

Maciejasjmj

@dkf said:

Most of the time you just need to figure out “what is the next token?”

AFAIR a lexer can hold state, so you can treat a string as a string until the closing quote without worrying that there's, say, an open brace in it on the lexer level.

It seems easier to me to handle such cases on the lexer level and pass more "semantic" tokens to the parser - it pollutes the parser grammar spec less.

fbmac

I like markdown

Salamander

I might have liked markdown if they fixed some of the more insane parts of it, such as * and _ being treated the same, and requiring doubling up characters for bold.
In its current form, a number of things just feel wrong, no matter how often I use them.

Onyx

@Salamander said:

I might have liked markdown if they fixed some of the more insane parts of it, such as * and _ being treated the same, and requiring doubling up characters for bold.

Some dialects actually do use _italic_ and *bold*.

Imagine I posted that XKCD about standards here.

Filed under: I won't actually do it because I want you all to exercise your imagination, #rosie

Gąska

@dkf said:

LL(k) parsers can be written by hand, but LR(1) and LALR(1) ones can't and they usually have better performance and stability figures.

RD parsers, on the other hand, can be written by hand just fine, and they can do all the same stuff you can do with LR. And they're pretty maintainable.

dkf

@Maciejasjmj said:

It seems easier to me to handle such cases on the lexer level and pass more "semantic" tokens to the parser - it pollutes the parser grammar spec less.

That's the normal thing, yes.

Lorne Kates

@Maciejasjmj said:

Or just stop being a WTF, use something like PegJS that writes you a proper parser and be done with it.

Can you prove that I'm not the author of PegJS, trying to organically market my product to people through incremental awareness?

Also, my point is that in a few posts and barely any thought, we have a mostly functional parser that works better than Discourses' reg-ex driver markdownhtmlbbqISIS

Gąska

@Lorne_Kates said:

Also, my point is that in a few posts and barely any thought, we have a mostly functional parser that works better than Discourses' reg-ex driver markdownhtmlbbqISIS

It's still worse than either actual Markdown, actual HTML, actual BBCode, or actual ISIS. And at least one of these was given barely any thought too.

flabdablet

@LB_ said:

BBCode got it right years ago.

Things that BBCode makes easier than HTML are easier by such thin margins as to make BBCode kind of pointless, in my view. Go native or go home.

@Lorne_Kates said:

Write the markup to the users.

This is what the formatting buttons attached to Metafilter's edit windows do, and it works just fine.

@fbmac said:

I like markdown

http://img.pandawhale.com/post-25893-ralph-wiggum-eating-paste-gif-cKiI.gif

Onyx

@flabdablet said:

Things that BBCode makes easier than HTML are easier by such thin margins as to make BBCode kind of pointless, in my view. Go native or go home.

It's easier to sanitize: you just replace all the < and > characters with entities and then parse BBCode stuff.

Yes, laziness is not an excuse, but I'm sure that's the part of the reasoning.

flabdablet

@Onyx said:

It's easier to sanitize: you just replace all the < and > characters with entities and then parse BBCode stuff.

I don't understand why that should be significantly easier than nobbling HTML tags and/or attributes that don't match a whitelist. You don't even need a full parser for that. Nor do you even need to handle all the odd HTML parsing edge cases, because you can just nobble any < that isn't part of something you do recognize.

I dunno. BBCode has always struck me as a rather knee-jerk way of dealing with the sanitizing issue. Seems to me that once you have a BBCode dialect that can do enough of what HTML does to make it useful, it's going to need handling about as carefully as HTML itself.

LB_

What difference does it make whether you have to hold shift to type the characters? Even if you use angle brackets with HTML syntax, I will still refer to it as BBCode. It's only HTML when it isn't sanitized.

dkf

@flabdablet said:

BBCode has always struck me as a rather knee-jerk way of dealing with the sanitizing issue.

I remember the BBCode supported by the old main page site. That had some rather interesting holes that I abused frequently for fun and profit.

Maciejasjmj

@fbmac said:

I like markdown

anotherusername

@Onyx said:

just replace all the < and > characters with entities and then parse BBCode stuff

And that's how you end up with fun CSS injection vulnerabilities like [color=red;position:relative;...]

Onyx

That is not valid BBCode and whoever wrote the tokenizer that recognizes it as such is a smeghead.

anotherusername

Dec 22, 2013 / Side Bar WTF

200% improvement in user experience while crashing

@El_Heffe said:@anotherusername said:I thought I'd deserved that by now He has made 17.57 times more annoying posts. You have a lot of work to do yet.Well... I've been going for quality over quantity, you see.[color=inherit;positio\00006e:fixed;top:0px;bo...

Onyx

Are you now saying that CS was a quality product? Has Discourse set your expectations that low?

anotherusername

You said BBcode is easy to sanitize.

Anyway, I don't see how that's invalid. Not sanitized properly before being used as a CSS color property: yes. Invalid: no.

Onyx

So you're saying <hello-world> is a valid HTML tag?

anotherusername

What does that have to do with it? The tag name was color, and the correct syntax is [color=...]. The tag matched that pattern. As far as I know, the only characters not allowed in the data portion of a BBcode tag are [] and newline, so while it's an invalid color property it's not invalid BBcode.

Onyx

Ah, sorry, forgot color was a thing.

Still, there's only a subset of allowed values: named colours and probably hex codes. It's still not rocket science.