Java/xml stupidity

sootzoo

Custom CRM integration, two systems (VB on the client, Java on the server, XML for data exchange). Server sends an XML document w/ encrypted credit card numbers. Client calls a Java utility program to decrypt them using a local key store. Decrypt utility opens the file, outputs a new one, making replacements as desired.

Original file: no XML header.

New file: explicit UTF-8 declaration.

New file's actual encoding: system default. WTF.

So as you might expect, loading the resulting file, which contains the actual useful information, fails spectacularly if there are any high-ASCII characters, because, guess what, it's not UTF-8 encoded.

WTF Solution: Parse the file using VB's binary reader, character by character. Strip the header. Load the resulting string into MSXML. Beat head on desk.

The really frustrating part is that the server software devs claim this is part of "the platform", meaning Java. Funny, every XML parser I've seen lets you declare a content encoding as part of the document header. I assume this is complete BS, but I don't have any of the server software source, and apparently neither do they. WTF.

no_name

@sootzoo said:

Original file: no XML header.

Wouldn't it be simpler to just fix the original file?

sootzoo

Actually, that brings up a good point. I am not any kind of expert when it comes to this, but I assume somewhere along the process, the external app assumed the document it was modifying was UTF-8. Is that a default encoding assumption in the absence of a proper header?

MSXML happily loaded the initial document (no encoding) assuming the system default (ISO-8859-1), which turned out to be correct. Slapping on UTF-8 without actually encoding the document confused the parser to all hell, but perhaps it was relying on a default encoding of some type.

Or maybe I'm being to generous to MSXML and the server app's devs?

PSWorx

@sootzoo said:

Actually, that brings up a good point. I am not any kind of expert when it comes to this, but I assume somewhere along the process, the external app assumed the document it was modifying was UTF-8. Is that a default encoding assumption in the absence of a proper header?
MSXML happily loaded the initial document (no encoding) assuming the system default (ISO-8859-1), which turned out to be correct. Slapping on UTF-8 without actually encoding the document confused the parser to all hell, but perhaps it was relying on a default encoding of some type.
Or maybe I'm being to generous to MSXML and the server app's devs?

The XML spec has a section about character encodings, though I don't know if it's still actual and followed.

pitchingchris

@PSWorx said:

The XML spec has a section about character encodings, though I don't know if it's still actual and followed.

It usually depends on who generated the file. Usually parsers such as microsoft will add the appropriate byte order mark on the file. Since the xml declaration is missing, then you may be out of luck, since it could have been removed. Most of the time, if it is handcrafted, the user doesn't put it on there (even though they should). Since we only use 2 encodings in our company, we usually put a byte order mark on our Unicode files, and nothing on UTF8. Some editors won't display unicode files nicely without the byte order mark.