Character encoding

Mason_Wheeler

So we're testing some new functionality, to make our systems work with Initrode's by exchanging XML messages. Initrode sends us an initialization message (over 40 MB of XML) containing the initial state of their data so our system can sync up. The message begins with:

<?xml version="1.0"?>

No encoding info. According to the XML standard, that means it's supposed to be in UTF-8. But our XML parser chokes and dies on several places in the message, stuff like accented characters and em-dashes. So I pull out my trusty hex editor and find out that the whole thing is encoded in ISO-8859-1; they just neglected to mention that in the actual XML message!

So we write to Initrode requesting that if they're going to use characters outside the ANSI range, to please ensure that their XML message is tagged with the proper encoding attribute, and describing which entries were causing problems.

A few days later, we get back a response from Initrode's developers containing a small XML snippet and saying that it looks like proper UTF-8 to them, that they're able to open the file just fine with no encoding attribute and they're not sure what our problem is. And lo and behold, the hex editor reveals that the special characters in this new version *are* UTF-8 encoded, where the old one was not. So they're either trying to pull a fast one or too technically ignorant to be working in this problem space.

It's getting on towards 10 years now since Joel wrote The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). And even if they haven't seen that particular article, you'd at least think developers working with accented characters would have absorbed this knowledge one way or another by now... and yet stuff like this still happens.

[mod - fixed link; removed trailing space - PJH]

morbiuswilters

Are you getting this over HTTP? Because it should be valid for the encoding to be set in the headers. Although, it's still a good idea for it to be included in the declaration, too.

Mason_Wheeler

@morbiuswilters said:

Are you getting this over HTTP? Because it should be valid for the encoding to be set in the headers. Although, it's still a good idea for it to be included in the declaration, too.

No, it's actually being generated by their program and written out to a specific folder as a text file, to be read in by our program. (Long story. We can accept HTTP, and we do in various other integrations, but their system can't handle it.)

Jaime

This means they must have hand-rolled an XML library. I love it when people spend extra time to make something broken.

Something similar happened to me a few years ago when another department asked me for an example file, which I provided for them in UTF-16 encoding with a proper declaration. They then created their output by parroting my example and ended up generating a UTF-8 XML file with UTF-16 encoding specified in the declaration. Any attempt to tell them what was wrong was met with blank stares.

morbiuswilters

Fuck it, everybody in the world just needs to learn to speak ASCII.

Sutherlands

I've had similar happen with teams using PHP trying to write their own SOAP client. Spend half a day tracking down an extra space before the "<?xml" about a week or so ago, hidden somewhere.

Another instance we changed the generation of the XML to use the proxy generated by .Net (as opposed to the previous hand-rolled we had), and a client broke because the prefix on the namespace changed (i.e. xmlns:ns1 -> xmlns:ns2). Trying to explain how namespaces worked was like pulling teeth. Yay string searches for "<ns1:NodeName"!!

morbiuswilters

@Sutherlands said:

I've had similar happen with teams using PHP trying to write their own SOAP client. Spend half a day tracking down an extra space before the "<?xml" about a week or so ago, hidden somewhere.

Probably someone with whitespace after the closing ?> tag. I never use closing tags in PHP-only files precisely for this reason.

Weng

@Sutherlands said:

Another instance we changed the generation of the XML to use the proxy generated by .Net (as opposed to the previous hand-rolled we had), and a client broke because the prefix on the namespace changed (i.e. xmlns:ns1 -> xmlns:ns2). Trying to explain how namespaces worked was like pulling teeth. Yay string searches for "<ns1:NodeName"!!

At least you only changed the namespace prefix - not the fucking namespace. Announced by email on the day after the rollout. On a web service consumed by thousands of mission-critical, SLA-protected applications with millions of dollars in hourly penalties payable to external customers. In response to a dev team that couldn't figure out how to work with two variations on the same subtree with the same root element name and same namespacebut different combinations of subnode. (No, I don't have ANY idea how anyone who has any idea how XML works or is even rolling their own parser could screw that up)

My Monday was awesome.

Sutherlands

Lol Weng.

spamcourt

That's nothing!
We got cp1251 sent either as utf8 or iso-8859-15. Their machines were configured to use only cp1251 so everything worked there... On the other hand mostly everyone else did they homework and used the proper encoding hint.

ender

Speaking of encodings, I'm always surprised how some services manage to mangle č's in my last name to è's or e's (I'm looking at you, Microsoft) - this implies that the text gets stored as cp1250 somewhere and then read as cp1252, though I can't fathom how that happens.

Rootbeer

@Weng said:

Announced by email on the day after the rollout. On a web service consumed by thousands of mission-critical, SLA-protected applications with millions of dollars in hourly penalties payable to external customers.

I'd like to think that the response to this email was "You fucked up royally. Roll it back NOW," and they did, but a team that couldn't foresee the problems of what they did probably doesn't have anything in place that would allow them to do a clean rollback.

tchize

I had to deal with an application that generated xml with encoding information set to "ISO-8859-1" but actually encoded in CP1252. According to original dev "CP1252 is pretty much the same as iso" :'(

morbiuswilters

@tchize said:

According to original dev "CP1252 is pretty much the same as iso" :'(

In MySQL, it's exactly the same.. D:

TheCPUWizard

@morbiuswilters said:

---- it, everybody in the world just needs to learn to speak ASCII.

No, EBCDIC is superior <ducking>

morbiuswilters

@TheCPUWizard said:

@morbiuswilters said:
---- it, everybody in the world just needs to learn to speak ASCII.
No, EBCDIC is superior <ducking>

Yeah, but EBCDIC isn't easy to pronounce. It's like the last name of one of them brides you can buy from Europe. "I Tatanya Ebcdic, me love you long time.."

Mason_Wheeler

Another XML/character encoding gripe: Why is it that, if the default encoding for XML is UTF-8, that Microsoft's XML parser will throw an error if the document it's trying to parse begins with a bloody UTF-8 BOM?!?

ender

Because having the BOM in UTF-8 files is an abomination (and specifically, because there must be nothing before <? in xml files)

powerlord

@Mason Wheeler said:

Another XML/character encoding gripe: Why is it that, if the default encoding for XML is UTF-8, that Microsoft's XML parser will throw an error if the document it's trying to parse begins with a bloody UTF-8 BOM?!?

A better question would be: Why are you using a BOM if UTF-8 is the expected default? UTF-8 doesn't require a BOM unless the default is ISO-8859.

I've seen lots of stuff that bombs if UTF-8 is being used with a BOM. SourceMod's translation system, for example.

morbiuswilters

@powerlord said:

UTF-8 doesn't require a BOM unless the default is ISO-8859.

I don't understand this sentence.

boomzilla

@ender said:

Because having the BOM in UTF-8 files is an abomination (and specifically, because there must be nothing before <? in xml files)

Are you sure about that?

@W3C Recommendation 04 February 2004, edited in place 15 April 2004 said:

Entities encoded in UTF-16 must and entities encoded in UTF-8 may begin with the Byte Order Mark described in ISO/IEC 10646 [ISO/IEC 10646] or Unicode [Unicode] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.

boomzilla

@powerlord said:

I've seen lots of stuff that bombs if UTF-8 is being used with a BOM.

Yay, bugs!

morbiuswilters

@ender said:

Because having the BOM in UTF-8 files is an abomination (and specifically, because there must be nothing before <? in xml files)

BOM is permitted in UTF-8 XML. It comes before the <?xml and parsers should be able to handle it. Also, I thought a lot of Microsoft tools added a BOM to UTF-8 files, so it's weird their parser can't work with that.

henke37

rest of the message that the forum ate.

morbiuswilters

@henke37 said:

rest of the message that the forum ate.

Shit, I forgot to encode my entities! Fixed now.

MiffTheFox

@Jaime said:

This means they must have hand-rolled an XML library. I love it when people spend extra time to make something broken.

I wouldn't give them that much credit-- they're probably just generating the XML via string concatenation.

morbiuswilters

@MiffTheFox said:

@Jaime said:
This means they must have hand-rolled an XML library. I love it when people spend extra time to make something broken.

I wouldn't give them that much credit-- they're probably just generating the XML via string concatenation.

It depends on the circumstances, but I'd say there are definitely times when string concatenation is the superior solution. Although I would prefer to avoid XML altogether.

@morbiuswilters said:

Yeah, but EBCDIC isn't easy to pronounce.

How is it hard? Ibby C. Dick.

morbiuswilters

@Spectre said:

@morbiuswilters said:
Yeah, but EBCDIC isn't easy to pronounce.

How is it hard? Ibby C. Dick.

I don't know where you got "Ibby". Ebb C Dick would be appropriate, but it doesn't flow off the tongue as neatly as ass-key.

dhromed

I got your ASCII right here.

Prepare to be boarded.

@morbiuswilters said:

@Spectre said:
@morbiuswilters said:
Yeah, but EBC[url=http://www.youtube.com/watch?v=NHO84rOp8FQ]DIC[/url] isn't easy to pronounce.

How is it hard? Ibby C. Dick.

I don't know where you got "Ibby".

Just spell EB out.

daveime

As if it wasn't bad enough that XML is used in the way a bad carpernter uses a hammer for every job, people who write specs where vital file information isn't actually contained in the file itself but is considered "valid enough" in the transmission protocol wrapper need to be shot.

XML is supposed to be portable, and HTTP isn't the only transmission protocol used to move it around from one place to another. FTP and it's secure counterparts SCP and SFTP don't send headers, what you get is the actual file, nothing more.

It's the same kind of idiocy where text documents are generated containing UTF-8 or UTF-16 and they neglect to include a BOM header, so upon receipt of the file we have to do all kinds of fancy scanning to determine if any character in the file *might* be encoded in one of multiple flavors before we can even start to use it.

If there's no encoding specified in the file, it should be rejected as invalid on general principles. If more people did this, maybe the people creating these files would be forced to show a little disciple. Hell, if you've already wrapped 8 bytes of data in 5k of unneccessary XML "wrapper", what's an extra 20 bytes to specify the encoding too ?

morbiuswilters

@daveime said:

As if it wasn't bad enough that XML is used in the way a bad carpernter uses a hammer for every job, people who write specs where vital file information isn't actually contained in the file itself but is considered "valid enough" in the transmission protocol wrapper need to be shot.

Is this your first experience with a W3C spec? Because XML is one of the better thought-out ones.. I'd favor shooting them, but they'd probably get off on it, the sick bastards.

Cassidy

@morbiuswilters said:

@daveime said:
As if it wasn't bad enough that XML is used in the way a bad carpernter uses a hammer for every job, people who write specs where vital file information isn't actually contained in the file itself but is considered "valid enough" in the transmission protocol wrapper need to be shot.

Is this your first experience with a W3C spec? Because XML is one of the better thought-out ones.. I'd favor shooting them, but they'd probably get off on it, the sick bastards.

I think he was referring to people that write schema/DTDs (or the specs for one) for some XML, rather than write the W3C XML specs themselves.

I've had this very issue - XML rejected by the customer because it was missing some other information, but their schema claimed it to be valid... because the schema said this information was optional and I omitted it. So I asked for a revised schema, added in missing content, validated it against this new XSD and all was fine, until they complained that something else was missing...

When I asked what was wrong with the XML I sent, they responded with some reformatted XML of mine containing additional elements but lacking all the comments and formatting of mine. Right. Let's respond to the question with an example but no explanation, that ought to clear it up. Yet my XML still "passed" their updated schema.

Grrrr....

@Cassidy said:

@morbiuswilters said:
@daveime said:
As if it wasn't bad enough that XML is used in the way a bad carpernter uses a hammer for every job, people who write specs where vital file information isn't actually contained in the file itself but is considered "valid enough" in the transmission protocol wrapper need to be shot.

Is this your first experience with a W3C spec? Because XML is one of the better thought-out ones.. I'd favor shooting them, but they'd probably get off on it, the sick bastards.
I think he was referring to people that write schema/DTDs (or the specs for one) for some XML, rather than write the W3C XML specs themselves.

I don't.

TheCPUWizard

@Spectre said:

@Cassidy said:
I think.

I don't.

Might explain quite a few things <grin>

morbiuswilters

@Cassidy said:

I think he was referring to people that write schema/DTDs (or the specs for one) for some XML, rather than write the W3C XML specs themselves.

No, he was referring to the fact that XML doesn't require an encoding be specified in-band. Specifically, you can specify the encoding in HTTP headers but leave it out of the XML itself. (I always include it in both headers and inline.) Combined with the fact that UTF-8 BOM is optional and determining encoding becomes much more difficult (hence his comment about having to scan the document looking for certain byte sequences that might indicate encoding). This is all valid according to the W3C. It's also the most sensible spec to come out of the W3C.

pkmnfrk

@morbiuswilters said:

@Cassidy said:
I think he was referring to people that write schema/DTDs (or the specs for one) for some XML, rather than write the W3C XML specs themselves.

No, he was referring to the fact that XML doesn't require an encoding be specified in-band. Specifically, you can specify the encoding in HTTP headers but leave it out of the XML itself. (I always include it in both headers and inline.) Combined with the fact that UTF-8 BOM is optional and determining encoding becomes much more difficult (hence his comment about having to scan the document looking for certain byte sequences that might indicate encoding). This is all valid according to the W3C. It's also the most sensible spec to come out of the W3C.

It seems to make sense to me.

If the first two bytes are a BOM, then read it, note the endianness, and subsequently ignore it.
If there is an encoding specified, use that encoding.
If there is no encoding specified, use UTF-8.

It's not as though the encoding is undefined if missing. The problem is people who don't understand the spec, and assume different defaults.

MiffTheFox

@morbiuswilters said:

[quote user="Spectre"][quote user="morbiuswilters"]

Yeah, but EBCDIC isn't easy to pronounce.

How is it hard? Ibby C. Dick.[/quote]

I don't know where you got "Ibby". Ebb C Dick would be appropriate, but it doesn't flow off the tongue as neatly as ass-key.[/quote]

I've always just said "ebb-kuh-dick".

morbiuswilters

@pkmnfrk said:

It seems to make sense to me.

If the first two bytes are a BOM, then read it, note the endianness, and subsequently ignore it.

If there is an encoding specified, use that encoding.

If there is no encoding specified, use UTF-8.

It's not as though the encoding is undefined if missing. The problem is people who don't understand the spec, and assume different defaults.

The first three bytes might be a BOM. Also, the problem is with default encodings. Do you people need this spelled out for you? The spec sucks (among other reasons) because it does not enforce consistency. There should always be an encoding explicitly specified and it should be a fatal error to not have one.

Cassidy

@morbiuswilters said:

No, he was referring to the fact that XML doesn't require an encoding be specified in-band

Ah, okay. I've always included it inline as the "final say" in case something happens to the encoding part-way.

@morbiuswilters said:

(I always include it in both headers and inline.)

What if there's a conflict between the two?

morbiuswilters

@Cassidy said:

@morbiuswilters said:
(I always include it in both headers and inline.)
What if there's a conflict between the two?

There won't be because I'm not incompetent.

Sutherlands

@morbiuswilters said:

@Cassidy said:
@morbiuswilters said:
(I always include it in both headers and inline.)

What if there's a conflict between the two?
There won't be because I'm not incompetent.

But some of the people I work with are, so what do I do?

Cassidy

@Sutherlands said:

@morbiuswilters said:
There won't be because I'm not incompetent.
But some of the people I work with are, so what do I do?

That was my question: I'm not making the assumption that you're always going to be in control of the feed (that there could be someone that farted around with a webserver config and now headers are tainted, etc) or that it's actually originating from you but from someone who also had a similar belt-n-braces approach.

It's more the point of: if someone receives two items of data that conflict, which would take priority? I'd have thought the embedded XML declaration would win overall.

morbiuswilters

@Sutherlands said:

@morbiuswilters said:
@Cassidy said:
@morbiuswilters said:
(I always include it in both headers and inline.)

What if there's a conflict between the two?
There won't be because I'm not incompetent.
But some of the people I work with are, so what do I do?

What do you do if they can't figure out their pants zipper and piss all over the bathroom wall? Are you allowed to punch them until they learn? I really don't know, I don't spend my days babysitting retarded people.

morbiuswilters

@Cassidy said:

It's more the point of: if someone receives two items of data that conflict, which would take priority? I'd have thought the embedded XML declaration would win overall.

In a sensible spec, the inline encoding would be used. This is a W3C spec, however, so they just punted: basically it depends on whatever transport protocol is used. So for HTTP we refer to RFC 3023. Other transport protocols may have their own rules.

PJH

@Sutherlands said:

@morbiuswilters said:
@Cassidy said:
@morbiuswilters said:
(I always include it in both headers and inline.)

What if there's a conflict between the two?
There won't be because I'm not incompetent.
But some of the people I work with are, so what do I do?

Wouldn't the sensible (hah!) thing to do would be to use the most recent specification? i.e. use the one in the headers if specified, and replace it with the one inline if one is specified?

morbiuswilters

@PJH said:

Wouldn't the sensible (hah!) thing to do would be to use the most recent specification? i.e. use the one in the headers if specified, and replace it with the one inline if one is specified?

HTTP headers take precedence over inline declarations. The theory being that the server may transcode the data without altering the declaration.

Jaime

@morbiuswilters said:

@PJH said:
Wouldn't the sensible (hah!) thing to do would be to use the most recent specification? i.e. use the one in the headers if specified, and replace it with the one inline if one is specified?
HTTP headers take precedence over inline declarations. The theory being that the server may transcode the data without altering the declaration.

This shows why putting the encoding in the document is so stupid. Any time an XML document is stored or transmitted, there is a risk that it will be transcoded. The likelihood of the software responsible for transmitting/storing being "XML aware" and fixing the document is quite low. It's not like this is a new problem the the W3C couldn't have expected, FTP implementions are infamous for causing problems by doing improper CRLF translations.

morbiuswilters

@Jaime said:

This shows why putting the encoding in the document is so stupid. Any time an XML document is stored or transmitted, there is a risk that it will be transcoded. The likelihood of the software responsible for transmitting/storing being "XML aware" and fixing the document is quite low.

Transmit? Sure. Store? Seems unlikely. But, see, the problem here is software that transcodes my data. Why are you doing that? You shouldn't be doing that (or if you absolutely must you should be XML-aware so you can make the appropriate change to the document).

@Jaime said:

It's not like this is a new problem the the W3C couldn't have expected, FTP implementions are infamous for causing problems by doing improper CRLF translations.

I haven't used FTP for, like, 14 years, so I'll take your word on it. Also, the problem sounds like the CRLF translations: why are you doing that? Stop doing that.