System "downloaded local copies of images" job mangles img tags

DoctorJones

Bug: I made a post that contained an image from an external site. Discourse downloaded a local copy and edited my post to point at that copy, but in the process managed to mangle the img tag.

It changed this

<img src="http://exampledomain.com/imagename.jpg" />

to this

<img src="<img src='/uploads/default/2045/e2d65c2e8c6b59be.jpg'>">

Repro: Make a post that includes an image tag that points to an image on an external site.

Live example: Here's an example image, hopefully system will come along and edit this so this post can be your repro.

If you can't see the image below, then it means that system has been along an mangled the img tag. Take a look at the source and edit history of this post to see what I mean.

P.S. I think this is a cool feature. It used to work, but seems to have gotten borked at some point.

Edit: System has been along and mangled the tag. The old version of the tag looked like this

<img src="http://static2.fjcdn.com/comments/A+Repost+It+can+t+be+true+_e63731036759224c3c80d8f133a814c9.jpg" />

and the new mangled version is this

<img src="<img src='/uploads/default/2045/e2d65c2e8c6b59be.jpg'>" />

skotl

Ah - that explains it!

I saw this yesterday but thought I was misusing the upload functionality.

In that case, consider this as "reproduced".

riking

Found the bug. Kinda.

        src = "http:" + src if src.start_with?("//")

...

            # have we successfully downloaded that file?
            if downloaded_urls[src].present?
              url = downloaded_urls[src]
              escaped_src = src.gsub("?", "\\?").gsub(".", "\\.").gsub("+", "\\+")
              # there are 6 ways to insert an image in a post
              # HTML tag - <img src="http://...">
              raw.gsub!(/src=["']#{escaped_src}["']/i, "src='#{url}'")
              # BBCode tag - [img]http://...[/img]
              raw.gsub!(/\[img\]#{escaped_src}\[\/img\]/i, "[img]#{url}[/img]")
              # Markdown linked image - [![alt](http://...)](http://...)
              raw.gsub!(/\[!\[([^\]]*)\]\(#{escaped_src}\)\]/) { "[<img src='#{url}' alt='#{$1}'>]" }
              # Markdown inline - ![alt](http://...)
              raw.gsub!(/!\[([^\]]*)\]\(#{escaped_src}\)/) { "![#{$1}](#{url})" }
              # Markdown reference - [x]: http://
              raw.gsub!(/\[(\d+)\]: #{escaped_src}/) { "[#{$1}]: #{url}" }
              # Direct link
              raw.gsub!(src, "<img src='#{url}'>")
            end

Specifically, the assumption that if you have an <img> tag, the protocol is present.

So the first gsub! fails and the last one executes.

~~Workaround: Always specify your protocol (or a //) in the src=.~~ NOT_A_WORKAROUND

chubertdev

Bug: editor allows image tags without alt attribute

dhromed

@riking said:

Workaround: Always specify your protocol (or a //) in the src=.

DoctorJones clearly specified the protocol in his example img tags.

riking

It wasn't there before, so I made an incorrect inference. Will strike.

DoctorJones

My bad, it wasn't in my top example with the fake URL, but it was in my bottom real world example.

DoctorJones

To clarify, it fails if you specify the protocol. I haven't tested it without the protocol.

dkf

If I've understood jwz right, you've now got at least 2⁵ problems…

end

@zogstrip you should take a look at this.

Maciejasjmj

I wonder if you could throw this thing into an infinite loop...

fatbull

Yes, you can.

How do I report it? I'd rather not discuss this in public.

end

email team@discourse.org

fatbull

Done.

Also, the RegExps don't catch every possibility. For example, it is allowed to add whitespace around [img] tags. I'm sure there are many more subtle differences between this and the actual parser.

Wouldn't it be a lot easier to substitute images after a post has been "cooked"?

Onyx

@fatbull said:

Yes, you can.

How do I report it? I'd rather not discuss this in public.

Damn. There goes some of that precious little fun we can have here...

Filed under: Bring back signature guy!

DoctorJones

@Onyx said:

Filed under: Bring back signature guy!

Filed under: New feature request: petitions

subscript_error

@DoctorJones said:

Filed under: New feature request: petitions

Petitions are already supported, they're auto-detected just like polls. You just need to make sure your post starts with the phrase "I'VE GOT A LOT OF PROBLEMS WITH YOU PEOPLE" (case sensitive) and then a bulleted list of your demands. Ordered lists are supported too, but only if the numbers are non-sequential and sometimes decrease. Then include 15 or more consecutive <hr> elements at the bottom for signature lines, and you're done!

If you've formatted it correctly, the system user will edit your post to include the appropriate <manifesto> markup and will e-mail three copies of your post to all registered users. Easy!

Filed under: Give it a try!

lushr

@riking said:

raw.gsub!(/src=["']#{escaped_src}["']/i, "src='#{url}'")

TRWTF: using regex to parse non-regular languages (Markdown, HTML, BBCode, etc.).

Seriously, why? There are a number of projects that use RDP, LL, LR, LALR, and PEG parsers that produce ASTs that allow you do do things, like, say, pattern matching in a disciplined and rigorous manner. They also allow you to do quotes easily. There's a small additional overhead to using a complex parser, but much of this can be regained though clever caching, and you gain a lot of nice tools.

Maciejasjmj

Meh. Do you really need to switch to an actual parser just to eliminate a few edge cases? Sure, the general consensus is that parsing HTML with regexes is an ungodly abomination that opens up the gates of R'lyeh, but that's the general case - and for a fairly limited subset of HTML, regexes will work just fine (if not faster).

Seriously, I have some experience with YACC and stuff, but if somebody told me to write a grammar for Markdown, I'd probably go "fuck that shit" and do it with regexes.

lushr

Markdown is a bit of a nasty language to parse, because it's layout-sensitive, so the parsers have to be able to handle that. Otherwise, the grammar itself isn't all that bad.

There are a number of different approaches to the issue. The one that I'm most familiar with uses the Copper LALR parser and its context-sensitive scanning to create the correct layout tokens (linky). There are a number of other approaches, though, including one in JavaCC (LL(k), here) and a suprisingly nice JavaScript PEG here.

Markdown isn't a particularly terrible language to parse, other than its abysmal specification and the layout sensitivity.

Using an actual parser provides a few major advantages. The first is that (if you use an LR parser, that is) ambiguities are detected at parser generation time, instead of in a bug report. They also keep you from making as many mistakes writing the parser, as it's easier to debug a declarative grammar description language than 4000 lines of handwritten recursive descent parser.

Honestly, though, the main issue for Discourse is that their abomination of a Markdown parser doesn't have an intermediate AST phase. It directly produces HTML. This precludes them from doing quotes in any sane way, and also causes issues like the above, as they can't match on document symbols, instead having to examine the source code itself.

Edit to add:
I should note that I'm an academic working in parsing as one of my primary research topics, so my perspective is extremely biased. It would be nice if someone who isn't quite as wildly skewed in their perspective could give their opinion on what works in practice.

end

The Markdown spec itself is ambiguous, so you can't parse it formally, at least not without modifying the spec.

Filed under: John Gruber is not a computer scientist

Onyx

@codinghorror said:

The Markdown spec itself is ambiguous, so you can't parse it formally, at least not without modifying the spec.

error

@Onyx said:

These are emoji we desperately need.

Legitimate feature request: Admin-defined custom emoji.

HardwareGeek

@Onyx said:

If we're going to have lame emoji, we definitely need support for those two.

lushr

Well, any (sane, non GLR/GLL, and you can emulate these relatively easily) implementation of Markdown defines an unambiguous interpretation of the "Markdown" language (because everyone hates parse forests). Let's call the one that's implemented here Markdown-DIS. Markdown-DIS is probably in the set of languages recognizable by a predictive recursive descent parser, e.g. LL(k). Then, you could certainly use a PEG or JavaCC to generate a parser that recognized Markdown-DIS, albeit likely with an extra stage of lexing.

In any case, why not just add an AST intermediate form to your existing markdown parser? It would solve many of your issues, most notably quotes.

end

Well, this is why babelmark exists:

Babelmark 2 - Compare markdown implementations

The only "spec" is to gather results from (n) different implementations and decide which one is "more correct" based on consensus of result. Good times..

error

@codinghorror said:

The only "spec" is to gather results from (n) different implementations and decide which one is "more correct" based on consensus of result. Good times..

You've seen this. You saw that this was the case and said to yourself, "yes, this is the future. This is the language I will integrate into my software."

end

Well, we are working to fix the spec. But that takes time. You can ping John MacFarlane if you want to speed it up, he's currently doing the lion's share of the work at the moment.

John MacFarlane - Home

lushr

What formalism is he using?

Edit to ask: Still, why not use an AST intermediate form? As I mentioned before, it would make your life a whole lot easier.

Edit 2: sent email.

sam

Tracked here:

https://meta.discourse.org/t/system-is-mangling-img-tags/16432

Feel free to close

PJH