That's just metadata



  • Infiniscrollgate and the accompanying performance discussion reminded me of a WTF from a few years back:

    A tool was responsible for displaying varying amounts of HTML in an embedded browser control. The source data was stored in a different format elsewhere but converted to HTML for editing purposes. The workflow being, essentially

    • load
    • convert to HTML
    • user edits
    • convert to other format
    • save

    99% of the time, performance was great (as great as can be expected with the data conversion going on in both directions, at least). But that one percent of the time, it could take 10-30 seconds to load even a paragraph of text. It didn't seem to depend on the length of the content at all. So I looked into the database and there was the paragraph of text, all three sentences... and 10.8 MB.

    Turns out that part of the non-HTML format involved storing metadata along with the text. In non-HTML, this used escape sequences (think \n but on steroids). In HTML mode, the metadata was maintained inside . The parser had an obscure off-by-one when handling the escape sequences, so instead of going from \stuff to , it left in the \ like so: .

    Which was then dutifully converted back to \\stuff for storage. Which re-expanded to during the next user session. And saved as \\\\stuff...

    The bad 1% of workflows involved three sentences of text, with a trailing HTML comment containing several million escape characters. And an execution time that doubled whenever anyone looked at it.


    Filed under: [No wonder Discourse is so speedy][1], [I had to double up the \ characters to get them to display correctly, coincidence?][2]


  • I encountered a similar problem with the ERP system in use at a company I worked for. The system would allow export and import in CSV format. Unfortunately, the export and import routines followed different rules.

    When exporting, the system would escape double quotes in the data by doubling them up (which is correct behaviour as far as I'm aware). Thus an item with the name
    12" Turnip Purifier would be exported as "12"" Turnip Purifier".

    The import routine, however, didn't seem to know about the quote escaping rule and would update the system record to 12"" Turnip Purifier. This would be exported as "12"""" Turnip Purifier" in the next run, leading to a vicious doubling-of-quotes cycle.

    To "fix" it, I had to write a VBA script to output the data in the broken format that the system expected. Sometimes you just have to go renegade in the turnip purification industry.



  • As the saying goes: garbage in, ggaarrbbaaggee oouutt.


Log in to reply