Hello, anybody hear of XML?

amischiefr

So my company decides that they would like to display news and economic information on the company web site. Somebody in Marketing (I know, doomed from the start) found a vendor for each type of feed and *blam* the project was under way.

After meeting (via teleconference) with the vendors my team and I came up with some pretty standard design documents that modeled the process based off of the specifications given to us by Marketing and by example XML documents (that is the way that the vendors would provide information to us) provided as examples of the feed. Everything was pretty standard, only a few XML tags, and this seemed like a very simple task/assignment.

After about a day or so my team had managed to write the code that would read in the XML files and process them. Being as we use Java, XMLBeans , this was a very trivial task. Now, in order to receive actual data from the vendor we had to wait two weeks before they would be able to provide us with files to parse. WAIT UP!! So, you sold us a product that isn't finished? These vendors told us time and time again how "our other customers love the feeds, they have no problems parsing the XML files... yada yada". So, if your other customers "LOVE" the feeds: where are the feeds?

Never-the-less, three weeks later we finally start to get files. Now, we had agreed upon certain standards for indicating countries: i.e. United Kingdom (not UK, Brittian, Great Brittian ect.) and many other standards such as United States (not USA, or United States of America) in order to ensure that we were on the same page. Espically since both of us wanted to use the country names as a lookup table for referencing articles from the DB. What did we start getting? UK, US of A, Portuguals... ect. Some were just typo's and others were completely retarded and against the specs. This confusion took two weeks for them to clear up in their feed (that, remember, EVERYBODY loved...).

So the names are agreed upon and we are back in business parsing XML when all of a sudden we get complete crap in parts of the files. Some files had junk data and we could not figure out why. Finally I decided to see if it was an encoding issue. I changed the XML header from ISO to UTF-8 and *blam* the crap data started showing up as valid 1/2 symbols and many other neat characters. I then inform the vendor of this slight error in their feed to which they responded that it would take them a week to make sure that things were encoded in ISO instead of UTF-8. All they had to do was change the XML header, in retrospect maybe I should have just changed it myself....

A week later and we are getting ISO headers and ISO formatted XML, and everybody was happy, and there was much rejoicing! Wait... the Euro symbol and many others are not showing up correctly. Could it be that these characters are only readable in UTF-8 and not in ISO? Yes, it appears so. Another week goes by while they format things into UTF-8 and give the files a UTF-8 header. (hey, didn't I mention just changing the header a month ago?????)

Well, we finally get our nifty UTF-8 files but now they won't process!!! We are getting a neat little CDATA error. Now, keep in mind the guys on my team are Java programmers. Not XML experts, not web programmers. Both of us are junior programmers (I myself just graduating in December). A google search finally found the problem: they are now encoding these files for Windows, which adds a character to the begining of the file! YAY! Is it too much to ask for a vendor that knows what they are doing?

This whole project has had me just wondering: WTF????

New to this site and I LOVE it... I hope my story didn't bore you too much, sorry if I am not that great of a story teller :)

mrprogguy

That extra character shouldn't be an issue--whitespace is whitespace is whitespace, even in XML.

pitchingchris

What is that extra character, is it a space, etc ? They must be doing it as a custom thing, because this character is not really required. And one character isn't enough to specify a byte order mark either, so throw that possibility out also.

If just your parsing routine is failing and you simply process it as a big string and its a space, just trim the string.

danixdefcon5

@mrprogguy said:

That extra character shouldn't be an issue--whitespace is whitespace is whitespace, even in XML.

It is an issue, actually. Can't remember what that character is, I think it was something called the byte-order mark that has no use outside of Windows-centric apps. Most XML readers/parsers will barf when they find these, because they are not valid UTF-8.

jetcitywoman

I totally feel your pain. I work on a "legacy" product doing enhancements, so the bulk of my work is implementing new interfaces to hardware and 3rd party software applications. I experience this kind of thing on almost every project, and it's made me seriously doubt the quality of programmers in the industry. Well, to be fair, I doubt even more the quality of management in the industry because generally these are organizational, communication, or collaboration problems.

For example, my most recent interface was to a 3rd party database app. First I have to say that these vendors are chosen by my customer, who then asks us (me) to implement the interface. We're easygoing and we've done it for decades, so we're always like "give us the specs and we'll rock it out". So in May, I was asked to do this database interface. Just dump the data from my system into a tab-delimited file and make a batch job to automatically ftp all the accumulated files over to the database system every few minutes. We have the very same interface to the very same vendor in place at two other customers, and we implement this one exactly the same as those. It took me about 2 weeks to complete in a leisurely fashion - ready to test. The customer then tells me that the other vendor is still working on their end. No problem, I work on other projects while I wait. A few weeks later, the customer asks if I've had any contact with the vendor. Nope. She says she's frustrated with them because the project deadline was April 2007. Jeez.

Long story short, many weeks ensue with me and her and her boss pummelling the 3rd party people to try to get their issues resolved. They kept pointing fingers at something wrong with my data. I check it immediately every time they complain. I count the fields, verify the data in the fields, count the tabs to make sure I have them right. It's all good. For the life of me I simply can't get them to tell me approximately which part of the data their import is choking on. A few weeks go by where they dither by saying "we're looking into it". Finally I beat up the vendor's Vice President (I doubt his qualifications) to the point where he tells me to just stick an additional tab at a certain point in the file. No problem. Makes no sense because I have the correct tabs in the right places according to the spec. But whatever, I'll do it if it makes my customer happy. It works. But why did it take them over a year to complete their end of the project which they've done before, a full year of which passed by even before I was asked to do our end of it?

arty

@danixdefcon5 said:

I think it was something called the byte-order mark that has no use outside of Windows-centric apps

That is a surprising pain in the ass, and actually is mentioned in the non-normative section of the XML standard. Unfortunately, it's an extension that generally only windows readers implement.

realmerlyn

@danixdefcon5 said:

It is an issue, actually. Can't remember what that character is, I think it was something called the byte-order mark that has no use outside of Windows-centric apps. Most XML readers/parsers will barf when they find these, because they are not valid UTF-8.

Actually, a BOM is always valid in XML, and if your so-called XML parser barfs, it's not a valid XML parser. Time to contact your vendor.

dlikhten

There is a bright side to op's article: He does not have to deal with Oracle Consultants. Trust me, jsut be thankful. Your job aint so bad :)

stratos

AFAIK a BOM is optional for a UTF-8 files and it has absolutely nothing to do with XML. its just a text encoding thing, and if i recall correctly it should default to a existing unicode char if its interpeted incorrectly. So i find it weird that standard java components barf on it. If this is indeed a weak spot for java (which i doubt) you should simply write a text-encoding detection that will verify if the text is utf-8, iso-latin# or windows-####. Because my guess is that they are just sending you a mix of whatever some data entry person copy pastes from word without converting it to the encoding they actually say they will send.

And just to help you out, here's the first hit from google: http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html

I'm quite sure the company that is supplying you these files is a big WTF, but you don't sound much better either. I mean just from the start you agreed on full text country names? iso country codes too easy or something? and why invent a new format at all? there already are a dozens of formats to exchange this kind of information. The easiest would be RSS or ATOM.

fluffy777

<pedantic>
Wrong: The BOM is not valid for UTF-8, it's only needed to mark utf16-little-endian v.s. uft16-big-endian encodings. utf8 can't be big or little endian, they're probably converting from utf16le-windows to utf8 w/o stripping the BOM.

</pedantic>

stratos

@fluffy777 said:

<pedantic>
Wrong: The BOM is not valid for UTF-8, it's only needed to mark utf16-little-endian v.s. uft16-big-endian encodings. utf8 can't be big or little endian, they're probably converting from utf16le-windows to utf8 w/o stripping the BOM.
</pedantic>

If you want to be a nitpicker at least get your facts.

http://unicode.org/faq/utf_bom.html#29

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?
A: Yes, UTF-8 can contain a BOM. ...... (explaining that it isn't usefull)

But even if it wasn't valid, that doesn't change a thing. In real world situations you will at some point get UTF-8 text with a BOM. So either you support it, or your program is broken. (Supporting it shouldn't be more then a couple of lines of code anyway)

FraGag

@fluffy777 said:

<pedantic>
Wrong: The BOM is not valid for UTF-8, it's only needed to mark utf16-little-endian v.s. uft16-big-endian encodings. utf8 can't be big or little endian, they're probably converting from utf16le-windows to utf8 w/o stripping the BOM.
</pedantic>

There is a "BOM" for UTF-8, but it is actually a misnomer. This so-called BOM is only a sort of signature to tell the program that reads the data that it is UTF-8 encoded. I've worked with UTF-8 XML documents (made by myself) that had a BOM and I had no problem with them. Granted, I don't program in Java.

EDIT: On the other hand, if there's whitespace before the XML declaration (<?xml ...), it is indeed an error.

element_0

@FraGag said:

On the other hand, if there's whitespace before the XML declaration (<?xml ...), it is indeed an error.

damn beat me to it, i've had this problem before. Luckily the .net parser gives you a very clear error message ie. "whitespace found before xml declaration" not sure why this is, you'd think the parser would just TrimStart and TrimEnd but i think it is actually in the xml definition.

DrJokepu

@element[0] said:

damn beat me to it, i've had this problem before. Luckily the .net parser gives you a very clear error message ie. "whitespace found before xml declaration" not sure why this is, you'd think the parser would just TrimStart and TrimEnd but i think it is actually in the xml definition.

That's because whitespaces are essentially zero bits and lots of unnecessary zero bits put too much negative charge on CPU and might damage it. Turing machines are particularly vulnerable to negative charges, you know.

GrahamS

@arty said:

@danixdefcon5 said:
I think it was something called the byte-order mark that has no use outside of Windows-centric apps
That is a surprising pain in the ass, and actually is mentioned in the non-normative section of the XML standard. Unfortunately, it's an extension that generally only windows readers implement.

Yeah, stupid Microsoft following the published Unicode standard. WiNdoZe sUks.*nIx 4 eVa etc etc

pitchingchris

@GrahamS said:

@danixdefcon5 said:
I think it was something called the byte-order mark that has no use outside of Windows-centric apps

That is a surprising pain in the ass, and actually is mentioned in the non-normative section of the XML standard. Unfortunately, it's an extension that generally only windows readers implement.

Some editors use the byte order mark to determine the encoding in any text file whether it be xml or not. I think its very helpful, because without it, some chinese byte sequence could be identical to an english ascii byte sequence. It was a good way to know right from the start which encoding to use. Although I do agree it is kind of redundant to use in xml, since the declaration can specify the encoding, but the declaration would be in the documents encoding, so it does serve a legitamate purpose.

I don't see that using it makes it a windows thing or not, but helps to conform to a standard. Going the extra mile to make it easier to determine the encoding is different, but shouldn't be a pain in the ass. The worst thing you would have to do is read it all in and strip off the byte order mark. But most parsers shouldn't bark at this.

ender

@FraGag said:

On the other hand, if there's whitespace before the XML declaration (<?xml ...), it is indeed an error.

That's exactly what the BOM is.

DrJokepu

@ender said:

@FraGag said:
On the other hand, if there's whitespace before the XML declaration (<?xml ...), it is indeed an error.
That's exactly what the BOM is.

FYI in XML whitepsace is "spaces, tabs, and blank lines", that's the definition. A BOM is none of these. It's a fact. Therefore, a BOM, even if it is unexpected, is not a whitespace. Therefore, it does not violate the "no whitespace before the XML declaration" rule.

pitchingchris

@amischiefr said:

which adds a character to the begining of the file!

Like I said in my earlier post, it doesn't seem like the byte order mark is correct. Heres a website with the BOM's listed. http://www.unicode.org/faq/utf_bom.html#BOM

Note that a utf-16 BOM would take up 2 bytes or 1 wide character. But UTF-8 BOM actually takes up 3 bytes, which would be 3 charcters in your case.

Random832

@DrJokepu said:

@ender said:
@FraGag said:
On the other hand, if there's whitespace before the XML declaration (<?xml ...), it is indeed an error.
That's exactly what the BOM is.

FYI in XML whitepsace is "spaces, tabs, and blank lines", that's the definition. A BOM is none of these. It's a fact. Therefore, a BOM, even if it is unexpected, is not a whitespace. Therefore, it does not violate the "no whitespace before the XML declaration" rule.

Actually, the BOM is U+FEFF which is, guess what? zero-width no-break space. And if it's unexpected, it's going to be treated as an ordinary ZWNBSP instead of a byte-order mark.

(edit) it turns out I'm wrong - however, if even [i]whitespace[/i] before the declaration isn't allowed, do you think random other garbage is allowed?

Random832

@pitchingchris said:

And one character isn't enough to specify a byte order mark either, so throw that possibility out also.

@pitchingchris said:

But UTF-8 BOM actually takes up 3 bytes, which would be 3 charcters in your case.

Remember, at this point it's a UTF-8 file and he's reading it as UTF-8, so it's one character (which is three bytes)

amischiefr

Actually, a BOM is always valid in XML, and if your so-called XML parser barfs, it's not a valid XML parser. Time to contact your vendor.

Well, last time I checked XMLBeans was a valid XML parser, but hey what do I know :)

I'm quite sure the company that is supplying you these files is a big WTF, but you don't sound much better either. I mean just from the start you agreed on full text country names? iso country codes too easy or something? and why invent a new format at all? there already are a dozens of formats to exchange this kind of information. The easiest would be RSS or ATOM.

And what is wrong with full text country names? So, you are trying to imply that I, or my team, are stupid because we agreed with the vendor to use United states instead of USA? That' s pretty... WTF in itself kid :)

pitchingchris

I've found a few examples on the net with people using xml beans with the UTF-8 BOM. Without more information, I still figure that the BOM included in your xml documents is incorrect.

amischiefr

That is quite possible. I wonder why everybody has chosen to focus on one aspect of this article. The entire UTF-8 BOM character was such a SMALL part of the WTF lol.

Oh well, sorry my wtf sucked I guess lol...

pitchingchris

@amischiefr said:

I wonder why everybody has chosen to focus on one aspect of this article.

The reason I was focused on that aspect is because even though there were probably other problems, something along these lines was causing your parsing to fail. Once that works, the rest of the stuff is not so important cause mission accomplished.

However, I really wonder how they were generating the xml file. If they were doing it manually, then you can't really trust the xml declaration if they didn't follow encoding rules. For example, I could say encoding="UTF-16" in my xml header declaration and encode the text file in UTF-8 and you would have big problems. Just an example.

Ragnax

@amischiefr said:

Well, last time I checked XMLBeans was a valid XML parser, but hey what do I know :)

Define valid.

A few months back I had to do some work on a product that used no less than 7 'valid' XML/Xpath/XQuery parsers and processors. None could manage to work correctly with each others output without tinkering, yet they were all 'valid'. The specifications for these standards are so large, ambiguous and vague that subtly differences in implementation can and will happen. And of course 'valid' is also not synonymous with 'bug-free'.

@amischiefr said:

And what is wrong with full text country names? So, you are trying to imply that I, or my team, are stupid because we agreed with the vendor to use United states instead of USA?

Yes, you are stupid for wasting bandwith on full text names.
Yes, you are stupid for intending to use full-text names as indices for your database.
Yes, you are stupid for inventing your own, probably ill-documented standard instead of using a well established, easily found standard like two-letter or three-letter ISO abbreviations.
Yes, you are stupid for giving the formatting for your country names out of hand, instead of keeping them in a local database table/XML file/etc., keyed on the ISO abbreviation. Pray you never end up having to provide localized country names.
Yes, you are stupid for expecting your suppliers to actually stick to your invented 'standard' in a real life scenario.

djork

@amischiefr said:

Java programmers, not XML experts

Why is XML so [i]insane[/i] that now someone can be considered an [i]expert[/i] in a text-based data format?

S-expressions FTW.

Ragnax

@pitchingchris said:

However, I really wonder how they were generating the xml file. If they were doing it manually, then you can't really trust the xml declaration if they didn't follow encoding rules. For example, I could say encoding="UTF-16" in my xml header declaration and encode the text file in UTF-8 and you would have big problems. Just an example.

Possibly his vendor is using the .NET framework to generate the XML. It has known issues with respect to UTF-8 encoding in its XML writing classes. If you specify the encoding to be used as UTF-8 and configure the XML writer class to write to an internal string buffer instead of to a file, then the XML declaration will (correctly) state UTF-8 encoding, while the actual file will be encoded as UTF-16 as all .NET strings internally use that encoding.

danixdefcon5

@amischiefr said:

I'm quite sure the company that is supplying you these files is a big WTF, but you don't sound much better either. I mean just from the start you agreed on full text country names? iso country codes too easy or something? and why invent a new format at all? there already are a dozens of formats to exchange this kind of information. The easiest would be RSS or ATOM.

And what is wrong with full text country names? So, you are trying to imply that I, or my team, are stupid because we agreed with the vendor to use United states instead of USA? That' s pretty... WTF in itself kid :)

Maybe the WTF that he refers to is that there is already a standard for that, the ISO country codes. I've seen them used in usernames and DN's for LDAP entries at large organizations, as they are unique and guarantee some numpty won't type it wrong, like the company WTF you suffered. Check ISO 3166 :)

Physics_Phil

@amischiefr said:

And what is wrong with full text country names? So, you are trying to imply that I, or my team, are stupid because we agreed with the vendor to use United states instead of USA? That' s pretty... WTF in itself kid :)

Apart from anyting else, you specified "United States" not "United States of America" and "United Kingdom" instead of 'United Kingdom of Great Britain and Northern Ireland", so you are not even using the full names. Much better just to use the ISO codes designed for this.

amischiefr

[quote user="danixdefcon5"]

Maybe the WTF that he refers to is that there is already a standard for that, the ISO country codes. I've seen them used in usernames and DN's for LDAP entries at large organizations, as they are unique and guarantee some numpty won't type it wrong, like the company WTF you suffered. Check ISO 3166 :)

[/quote]

Well, if it was stated as politely as you just did, that would be fine. His insistance to be some 'know it all' and come off as a total dick demeans the point he was trying to make, if any productive point at all.

I graduated in December with my BS in CS, I do NOT pretend to know everything :). Maybe Ragnax knows everything, maybe he's Jesus!!! *shrug* Who am i to say :)

Ragnax, calling people stupid because they did not know of some ISO standard for naming of countries is compltely retarded. Just because you have personally heard of them does not mean that every programmer out there has. Please refrain from commenting on any of my posts. Your ignorant responses do nothing but provoke flaming responses. Were you touched as a child? Do you not get enough respect for your 'vast' knowledge at your job? Well, whatever it is, please keep your ill tempered ignorant voice to yourself ok? :)

bstorer

@amischiefr said:

snip disastrous explosion of HTML from quoting danixdefcon5

Dear, God. That's an entirely new way to ruin the quote system. Do you people see what danixdefcon5's failed attempt at Signature Guy has wrought? NOW DO YOU SEE?!

MasterPlanSoftware

@bstorer said:

Dear, God. That's an entirely new way to ruin the quote system. Do you people see what danixdefcon5's failed attempt at Signature Guy has wrought? NOW DO YOU SEE?!

AmmoQ is angry!

danixdefcon5

@bstorer said:

Dear, God. That's an entirely new way to ruin the quote system. Do you people see what danixdefcon5's failed attempt at Signature Guy has wrought? NOW DO YOU SEE?!

I had already done Sig Guy successfully in the past. Anyway, my FUBAR was on another thread :)

bstorer

@danixdefcon5 said:

@bstorer said:
Dear, God. That's an entirely new way to ruin the quote system. Do you people see what danixdefcon5's failed attempt at Signature Guy has wrought? NOW DO YOU SEE?!
I had already done Sig Guy successfully in the past. Anyway, my FUBAR was on another thread :)

Oh, no! It's spreading between threads! It's like a rift in the universe that cannot be closed, and gets larger all the time. RUN FOR YOUR LIVES, PEOPLE!

stratos

@amischiefr said:

Ragnax, calling people stupid because they did not know of some ISO standard for naming of countries is compltely retarded. Just because you have personally heard of them does not mean that every programmer out there has. Please refrain from commenting on any of my posts. Your ignorant responses do nothing but provoke flaming responses. Were you touched as a child? Do you not get enough respect for your 'vast' knowledge at your job? Well, whatever it is, please keep your ill tempered ignorant voice to yourself ok? :)

lighten up, this is the internet.

Secondly iso country codes aren't the most obscure subject around. That you don't know about text encoding i can understand, unless your doing work where i18n is important you won't walk into the problems associated with it. But country codes? come one, that's not even programmers only stuff, its just basic knowledge.

You say you have recently left school so perhaps you simply didn't knew, but it will take a long time before you will find a third party that actually delivers what they promise or deliver it in the spec they mailed you.The number of times i've seen a flawless integration with any third party system can be counted on my fingers, the times something wasn't right, something wasn't documented or some corner case broke something is quite more common.

Just so you know if you didn't already.

3 in 5 web companies don't have a clue about text encoding, wih manual data entry expect windows-#### encoding, or mac roman if th're a design shop.
As illustrated perfectly in this thread, not everyone who has a strong opinion has a clue.
A program/app that implements a spec is only half done. A app that swallows all the crap you can give it and gives results on that, is done.
There probebly isn't much that doesn't already have a standard in one way or another. Learn to search for rfc's, know at least what kinds of ISO's their are, and of course the domain knowledge of you projects.

And finally grow some damn balls, You are a programmer, that means you should have a god complex and a attitude to match. Don't just cry mommy, but research the facts and the problems with his proposed solution and call him a idiot for even suggesting it, because of X and Y. This will ensure that at the end of the flame war both of you will know just about everything there is to know about that subject, which will make you a better programmer in the end. Flame wars aren't just fun! They are educational and a good example of survival of the fittest in the IT world.

Obfuscator

@pitchingchris said:

For example, I could say encoding="UTF-16" in my xml header declaration and encode the text file in UTF-8 and you would have big problems.

Which makes you wonder why the encoding is written as text in the actual encoding that it is supposed to hint about. That is about as helpful as encrypting the recipient address of your emails. I never understood why magic numbers couldn't be used for this.

morbiuswilters

@Obfuscator said:

Which makes you wonder why the encoding is written as text in the actual encoding that it is supposed to hint about.

The encoding line is written in the default character set and once the parser encounters it it is supposed to restart parsing from the beginning with the newly-set character set. With HTML this is done in meta tags, in XML it is done on the first line of the document. It is useful in environments where you have control of the document but not the metadata of the document, like an HTML/XML doc you are distributing to web servers that might serve it up in dozens of different encodings.

pitchingchris

@Obfuscator said:

Which makes you wonder why the encoding is written as text in the actual encoding that it is supposed to hint about.

Well, Radnax was right on one point. If you would have used the parser to write the file, you wouldn't have these problems: the parser would use the processing instruction from the declaration to do everything correctly. However, if you output the document to text, you get a string as Radnax mentioned. From that point on its the programmers responsibility to write the file correctly if he chooses to do so. And if the vendor was using .net, then it would have ouput the string in utf-16 by default unless you use the Encoding.UTF8.GetBytes thing.

ammoQ

@morbiuswilters said:

The encoding line is written in the default character set and once the parser encounters it it is supposed to restart parsing from the beginning with the newly-set character set. With HTML this is done in meta tags, in XML it is done on the first line of the document.

IMO this is an inherently bad idea. A standard text editor is not able to display or write such files. Basically, such an XML file becomes a binary file that only tools which know the trick can decipher.

Ragnax

@amischiefr said:

Well, if it was stated as politely as you just did, that would be fine. His insistance to be some 'know it all' and come off as a total dick demeans the point he was trying to make, if any productive point at all.

I graduated in December with my BS in CS, I do NOT pretend to know everything :). Maybe Ragnax knows everything, maybe he's Jesus!!! *shrug* Who am i to say :)

Ragnax, calling people stupid because they did not know of some ISO standard for naming of countries is compltely retarded. Just because you have personally heard of them does not mean that every programmer out there has. Please refrain from commenting on any of my posts. Your ignorant responses do nothing but provoke flaming responses. Were you touched as a child? Do you not get enough respect for your 'vast' knowledge at your job? Well, whatever it is, please keep your ill tempered ignorant voice to yourself ok?

I'm not calling you stupid any more than you were asking if you were stupid: you asked if you or your team were stupid for doing this, I answered that you were stupid for doing this and gave you the reasons. Bearing in mind that your post wasn't made in an all that courteous manner (in fact; it was downright condescending, 'kid'...), I don't think it warranted a more polite answer than the straight-to-the-point I gave you.

As for the ad hominem: that had me laughing. So; it's not okay to call people that ask for it stupid, but it is okay to attack the character of those whose answers you don't like? Pot; meet Kettle. Kettle; this is Pot.

Just for clarification: I am in fact not the bitter or cynical cubicle dweller that you are making me out to be. In actuality I am still attending university: while I do already have a BSc in Computer Science, I'm continuing with a master's degree in Computer Science & Engineering (CSE). As such I certainly do not pretend to know everything, but I do know that the mistakes you made are things that should have been covered in any decent Computer Science BSc program.

As you requested for me to butt out of your thread, I will respect that and stop posting in it. Some parting advice: the real WTF is you. Think about that for a while.

pitchingchris

I guess while we're still on this topic, I'd mention that using XSL stylesheets for generation and validation would be more likely to solve the problem on both sides and make everything more consistent. Whenever you are dealing with an xml generated by someone else, you have to be aware of subtle problems that could appear.

While there might be some differences between the vendors parser and yours, it should be relatively painless to find a medium that would work for both sides.

morbiuswilters

@ammoQ said:

IMO this is an inherently bad idea. A standard text editor is not able to display or write such files. Basically, such an XML file becomes a binary file that only tools which know the trick can decipher.

What other solution do you propose to allow the content encoding to be included with the document? The default XML character set is UTF-8. Most text editors and programming libraries can handle ASCII which is what the encoding instructions will be written in anyway. The same basically applies to HTML and the meta tag as well, since that will be written in ASCII. If the software that is reading or writing the file can't understand the character set, then it doesn't make a difference if it is specified in file metadata or in the document itself.

Random832

Fortunately, the W3C actually has a recommendation here: http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info - in a nutshell: look at the first four bytes for the start of "<?xml" in all possible charsets (where 'all possible charsets' is defined as "UCS-4 with four byte-order possibilities, UTF-16/UCS-2 with two byte order possibilities, UTF-8/ASCII/Latin1/etc, EBCDIC", which is enough to get you as far as the character set declaration) - you're definitely NOT supposed to put the encoding instruction in ASCII/UTF-8 in what is otherwise a UTF-16/UCS-4/EBCDIC document, and I've never seen that proposed anywhere but here. If there is no <?xml found, the document is to be interpreted as UTF-8 unless there's external information such as from a content-type header. Oh, and, as you can see, a BOM before the xml declaration is permitted in UTF-8.

arty

@GrahamS said:

Yeah, stupid Microsoft following the published Unicode standard. WiNdoZe sUks.*nIx 4 eVa etc etc

Sorry, just opining that other readers should implement this admittedly informational part of the standard.