Coding Horrors
-
@accalia has teh codez if you want to look under that rock
i'm not entirely sure what you mean...... unless you refer to @sockbot's quote/code parser mostrocity?
https://github.com/SockDrawer/SockBot/blob/master/lib/browser.js#L840-L907
which has almost 200 different parsing test cases to make sure it gets it right..... and i'm sure we're still missing edge cases here!
-
Yes master Sultanatrix of Swypos; @RaceProUK's queen, I shall appear as summoned.
-
That's the one, officer!
-
-
@accalia said:
which had almost 200 different parsing test cases
Had? That's troubling.yeah, they gained sentience and evolved beyond our control... we're not entirely sure where they are now....
-
I think more appropriate question is when they are.
-
Given the orbiting weaponry the various governments around the world have launched and refuse to admit is actually up there..... I think where is actually a fairly big concern.....
-
Not if they've found these weapons in the next century. Not for me, at least.
-
oh my god the bug still exists.
-
You know, I always wondered why the formatting was a mix of bbmarkhtmldowncode and not just one sane markup language, and how that might have led to more complex parsing.
Now that I know they use regular expressions instead of actually parsing, it makes "sense".
-
something something brace matching something lazy.
-
At some point, when do you replace your parser with a non-regex one. Here:
opentags = array(); tagmap = array(); tagmap["*"] = {"<b>", "</b>"}; // strong text, etc for each other tag rawinput = user_input output = "" for i = 0 to rawinput.length char = rawinput[i]; if(char == "\" && i < rawinput.length) i = i + 1 // skip the next character elseif(tagmap.contains(char)): // check if this tag is already open, and close it if(opentag.last == tagmap(char)[0]) output += tagmap(char)[1] else output += tagmap(char)[0] opentags.push(tagmap(char)[0] end if; else // this is a normal character output += rawinput[i]; endif end for return output
There. Done. I just wrote your parser in 2 minutes. Run it through a round of QA, and past someone who knows unicode better than I, and you're done like a plate full of chicken with forks in it.
-
- requires escaping tag characters even if they are being used in an obvious non-tag way
- doesn't allow character overloading (see: this post's raw)
- tags must be at most one character long
- no self-closing tags (e.g.
-
--><hr/>
) - no parameterization (e.g.
[quote=@Bort]
) - is it discoverable? (see: not markdown)
-
always wondered why the formatting was a mix of bbmarkhtmldowncode and not just one sane markup language
Pick an insanely overambitious implementation, then half-ass it. That's the Discourse modus operandi...
-
You know, I always wondered why the formatting was a mix of bbmarkhtmldowncode and not just one sane markup language
Personally I hate markup languages where the open and close tags are the same (e.g. just about everything in markdown). BBCode got it right years ago.
-
I like BBCode too, and I could get used to different markdown flavors. What grates me is when several are mixed. Stick to one and get it right.
-
Stick to one and get it right.
That cannot be markdown, then. By design, it handles only simple formatting. If you need anything fancy, you have to mix something with it. Better then to simply skip the markdown in the first place and just use BBCode or (properly sanitized) HTML.
The real problem, though, (IMO) is not markdown, nor even the mix of markup languages, per se. The real problem is Discourse's stupid, half-baked markup parser. The mix of markup languages certainly makes parsing at least a little harder, but I'm pretty sure Discourse's parser would fail at handling edge cases in even a single markup language.
-
* requires escaping tag characters even if they are being used in an obvious non-tag way
Yes.
@TwelveBaud said:* doesn't allow character overloading (see: this post's raw)
Agreed. Otherwise the parser can also do a simple check if it's at the start of a new line. (for some values of "simple"). Better to not overload the character.
* tags must be at most one character long
Easy enough to handle. The mockup assumed each tag was one character. You can add, say, "~~~" to tagmap. Loop through tagmap, and say if mid(str, pos, len(tagmap(k)) == tagmap(k) or something like that. Solve it once, and it's solved for all lengths of tags.
* no self-closing tags (e.g.
-
--><hr/>
)Just exclude the second value from tagmap. tagmap["|"] = {"
", ""}Then if tagmap[char](1) == "", then don't push to OpenTags.
* no parameterization (e.g.
[quote=@Bort]
)Or don't use @ as a tag replacement? Or extend tag replacement.
if match tag then (raise event before tag) then do parsing if not cancelled then (raise event after tag)
It'd allow for gasp****strong text easy plug-ins!
* is it discoverable? (see: not markdown)
Write the markup to the users. Don't create a markup and expect users to learn it. There are already plenty of well known markup "standards" that users expect****strong text...
-
-
-
You say that like it's different from any other mass-market industry/service sector...
-
Or just stop being a WTF, use something like PegJS that writes you a proper parser and be done with it.
-
Grammars for parser generators usually end up as the same unmaintainable PoS with lots of subtle bugs as if you wrote the parser yourself - except the bugs are harder to fix.
-
Aren't you the person who used boost::spirit? I'm not surprised you'd say so...
And well duh, parsing is hard. But I'm pretty sure a parser grammar is easier to comprehend, maintain and verify than a lump of string-walking code.
-
Aren't you the person who used boost::spirit?
I could either write one boost::spirit line, or 30 lines of conventional code. If it was two lines of boost::spirit, I wouldn't even consider it.But I'm pretty sure a parser grammar is easier to comprehend, maintain and verify than a lump of string-walking code.
Depends of complexity of grammar and quality of code. For instance, if you run the source through tokenizer first, hand-written parser suddenly becomes much more maintainable.
-
For instance, if you run the source through tokenizer first, hand-written parser suddenly becomes much more maintainable.
LL(k) parsers can be written by hand, but LR(1) and LALR(1) ones can't and they usually have better performance and stability figures. Unless you're a very strange person. The things you have to do to build the state machine recognizer make the result really very confusing indeed.
And that's even with a tokenizer. You want one of those anyway.
-
I could either write one boost::spirit line, or 30 lines of conventional code. If it was two lines of boost::spirit, I wouldn't even consider it.
It's such a violent syntax rape that I can't even bear to look at it, let alone write it.
For instance, if you run the source through tokenizer first
Obviously, you'll want a tokenizer or preferably a proper lexer before parsing.
-
Obviously, you'll want a tokenizer or preferably a proper lexer before parsing.
They're often pretty much the same thing, but it depends on the language you're parsing. The complication comes when you do things like having interacting comment and string syntaxes (hello, C!) or allow the programming language to introduce entirely new operators at runtime (I've done this in ML). Most of the time you just need to figure out “what is the next token?”
-
Most of the time you just need to figure out “what is the next token?”
AFAIR a lexer can hold state, so you can treat a string as a string until the closing quote without worrying that there's, say, an open brace in it on the lexer level.
It seems easier to me to handle such cases on the lexer level and pass more "semantic" tokens to the parser - it pollutes the parser grammar spec less.
-
I like markdown
-
I might have liked markdown if they fixed some of the more insane parts of it, such as
*
and_
being treated the same, and requiring doubling up characters for bold.
In its current form, a number of things just feel wrong, no matter how often I use them.
-
I might have liked markdown if they fixed some of the more insane parts of it, such as * and _ being treated the same, and requiring doubling up characters for bold.
Some dialects actually do use
_italic_
and*bold*
.Imagine I posted that XKCD about standards here.
Filed under: I won't actually do it because I want you all to exercise your imagination, #rosie
-
LL(k) parsers can be written by hand, but LR(1) and LALR(1) ones can't and they usually have better performance and stability figures.
RD parsers, on the other hand, can be written by hand just fine, and they can do all the same stuff you can do with LR. And they're pretty maintainable.
-
It seems easier to me to handle such cases on the lexer level and pass more "semantic" tokens to the parser - it pollutes the parser grammar spec less.
That's the normal thing, yes.
-
Or just stop being a WTF, use something like PegJS that writes you a proper parser and be done with it.
Can you prove that I'm not the author of PegJS, trying to organically market my product to people through incremental awareness?
Also, my point is that in a few posts and barely any thought, we have a mostly functional parser that works better than Discourses' reg-ex driver markdownhtmlbbqISIS
-
@Lorne_Kates said:
Also, my point is that in a few posts and barely any thought, we have a mostly functional parser that works better than Discourses' reg-ex driver markdownhtmlbbqISIS
It's still worse than either actual Markdown, actual HTML, actual BBCode, or actual ISIS. And at least one of these was given barely any thought too.
-
BBCode got it right years ago.
Things that BBCode makes easier than HTML are easier by such thin margins as to make BBCode kind of pointless, in my view. Go native or go home.
@Lorne_Kates said:
Write the markup to the users.
This is what the formatting buttons attached to Metafilter's edit windows do, and it works just fine.
I like markdown
http://img.pandawhale.com/post-25893-ralph-wiggum-eating-paste-gif-cKiI.gif
-
Things that BBCode makes easier than HTML are easier by such thin margins as to make BBCode kind of pointless, in my view. Go native or go home.
It's easier to sanitize: you just replace all the
<
and>
characters with entities and then parse BBCode stuff.Yes, laziness is not an excuse, but I'm sure that's the part of the reasoning.
-
It's easier to sanitize: you just replace all the < and > characters with entities and then parse BBCode stuff.
I don't understand why that should be significantly easier than nobbling HTML tags and/or attributes that don't match a whitelist. You don't even need a full parser for that. Nor do you even need to handle all the odd HTML parsing edge cases, because you can just nobble any < that isn't part of something you do recognize.
I dunno. BBCode has always struck me as a rather knee-jerk way of dealing with the sanitizing issue. Seems to me that once you have a BBCode dialect that can do enough of what HTML does to make it useful, it's going to need handling about as carefully as HTML itself.
-
What difference does it make whether you have to hold shift to type the characters? Even if you use angle brackets with HTML syntax, I will still refer to it as BBCode. It's only HTML when it isn't sanitized.
-
BBCode has always struck me as a rather knee-jerk way of dealing with the sanitizing issue.
I remember the BBCode supported by the old main page site. That had some rather interesting holes that I
abused frequently for fun and profit.
-
-
just replace all the
<
and>
characters with entities and then parse BBCode stuffAnd that's how you end up with fun CSS injection vulnerabilities like
[color=red;position:relative;...]
-
That is not valid BBCode and whoever wrote the tokenizer that recognizes it as such is a smeghead.
-
-
Are you now saying that CS was a quality product? Has Discourse set your expectations that low?
-
You said BBcode is easy to sanitize.
Anyway, I don't see how that's invalid. Not sanitized properly before being used as a CSS color property: yes. Invalid: no.
-
So you're saying
<hello-world>
is a valid HTML tag?
-
What does that have to do with it? The tag name was
color
, and the correct syntax is[color=...]
. The tag matched that pattern. As far as I know, the only characters not allowed in the data portion of a BBcode tag are[]
and newline, so while it's an invalid color property it's not invalid BBcode.
-
Ah, sorry, forgot
color
was a thing.Still, there's only a subset of allowed values: named colours and probably hex codes. It's still not rocket science.