You tried to feed me invalid XML?
-
Argh. So the third-party product I'm integrating will allow users to fill out the form in Spanish. When it does, sometimes I get back "XML" with HTML entities in it. E.g., <tag>annóying</tag>.
MSXML takes offense. Anyone have to deal with this? "Preprocess the incoming XML" is a nonstarter, for reasons not the least of which is "the most straightforward thing to do is to remove all & characters, which might be problematic if there are any legit ones in there."
Edit: fixed markup to be visible.
-
Solution (if possible):
Complain about title question with third party vendor.
-
Is it not possible to give MSXML a list of entities? That way you can give it a list of HTML entities and it'll process them.
-
Well, the problem seems to be that by the time he can all the ampersands have been parsed out so there's no way to know what was an entity.
-
The real question is why is this not just using UTF-8.
-
That was in the question, and in my solution to complain to said third party vendor that they didn't implement correctly.
-
Complain about title question with third party vendor.
I've already asked them about it. Haven't heard back yet, but I only discovered this about long enough ago to rant about it to everyone in earshot who'd understand it, and then send them a question.
-
Is it not possible to give MSXML a list of entities?
Got me. I've never used it before this project. I'm looking at MSDN but that's a mook's game as I have no idea how many entities the third-party software might generate.
-
Well, the problem seems to be that by the time he can all the ampersands have been parsed out so there's no way to know what was an entity.
Nah. For reasons, the way it works is I'm getting a POST with an XML body. I immediately load it into MSXML (or another parser) and in either case, it literally fails with "I don't understand this entity."
-
The real question is why is this not just using UTF-8.
It claims to be UTF-8, right there in the XML tag.
-
That was in the question, and in my solution to complain to said third party vendor that they didn't implement correctly.
It claims to be UTF-8, right there in the XML tag.
Is there some image or code in here that work is firewalling on me or something... or am I supposed to be psychic and automatically know that the xml tag had the encoding set to UTF-8 as opposed to, say, ISO-8859?
.
-
Because Windows?
-
or am I supposed to be psychic
Naw, I'm just saying it was claiming to be UTF-8. As I said, the Googling I've done suggests that HTML entities like ó simply aren't allowed in XML, regardless of encoding. Whether putting the actual ó in there would help, I dunno. (Oh crap, reading back up, I forgot Discurse parsed that. The XML doesn't have an accented o, it has the 7 ascii characters that represent an accented o.)
-
-
-
Dischorse strikes again!
-
Anyone who's already gone up this far might go back and re-read the OP, which I have fixed to properly illustrate the problem.
-
If I recall correctly, MSXML only comes with the standard XML entities defined; &, ', " > and <. You need a DTD to bring in the html entities.
-
-
-
I don't know if that's even feasible.
That's how XHTML works. The syntax is very close to doing it in SGML (i.e., awful).
It might be worth adding in a line which states that there's an external DTD in that document, straight after the
<?xml…?>
stuff. That would be just WTFy, but not deeply so.
-
I really, really, really do not want to have to do any kind of preprocessing of the XML. I have opened up a " is this I'm seeing?" ticket with them.
-
I really, really, really do not want to have to do any kind of preprocessing of the XML
You may have to…
-
That will be exceedingly awkward.
-
This is from the XML Spec:
**Well-formedness constraint: Entity Declared** In a document without any DTD, a document with only an internal DTD subset which contains no parameter entity references, or a document with " standalone='yes' ", for an entity reference that does not occur within the external subset or a parameter entity, the Name given in the entity reference MUST match that in an entity declaration that does not occur within the external subset or a parameter entity, except that well-formed documents need not declare any of the following entities: amp, lt, gt, apos, quot. The declaration of a general entity MUST precede any reference to it which appears in a default value in an attribute-list declaration.
The XML they are sending you violates the spec and a standards-conforming XML processor is obligated to error on it.This also means that the xml could not have been generated by a standards-conforming XML processor. Why do people never learn? The best quote I've seen regarding this is:
**Don’t think of XML as a text format**
Even people who have used compilers and seen the error and warning messages seem to think that text formats can be written casually and the piece of software in the other end will be able to fix small errors like a human reader. This is not the case with XML. If the document is not well-formed, it is not XML and an XML processor has to cease normal processing upon finding a fatal error.
It helps if you think of XML as a binary format like PNG—only with the added bonus that you can use text tools to see what is in the file for debugging.
If you ever encounter someone who is hand-building XML, ask them if they wrote their own zip file library too.
-
The XML they are sending you violates the spec and a standards-conforming XML processor is obligated to error on it.
I'm not surprised. But unless they fix it it's going to give me heartburn to do it myself, and the reason why is both moderately complicated (more so than I have the energy to explain this evening, after discovering this) and moderately WTF[1], and centers around the phrase "too many bytes to fit into a ~31K buffer".
[1] if I get around to explaining it at least some people will plausibly argue "moderate" is too mild a word.
-
If you ever encounter someone who is hand-building XML, ask them if they wrote their own zip file library too.
I don't know that they are, in fact, hand-building it, although I could probably decompile the servlet and find out, not that I want to.
-
I don't know that they are, in fact, hand-building it, although I could probably decompile the servlet and find out, not that I want to.
You don't need to decompile it. An XML processor that spits out "ó" is broken, unless it also embeds a DTD or includes a DTD reference. Therefore, they either aren't using one, or they are using a broken one.
-
Just popping it to say that I write XML by hand when libraries are not already available and I'm not going to apologize for it.
-
popping it
Not sure what you're popping, but that sounds painful and/or bad and/or kinky. We do not advise popping "it" in the future.
-
An XML processor that spits out "ó" is broken, unless it also embeds a DTD or includes a DTD reference.
IIRC it does spit out
standalone="true"
, which I don't care to bother to look up what that means.I did mention that Google suggests they're not supposed to do that, but who knows whether they'll admit it or not.
-
You could probably do
s/&/&/g
without any problems.
-
which I don't care to bother to look up what that means.
standalone="yes"
means ignore the DTD, if there is one.
-
You could probably do s/&/&/g without any problems.
Imagine that my options to even get the XML are rougly equivalent to "load it into MSXML" or "attempt to read it directly into a variable that can only obtain the first 31K bytes, and it's always going to be at least 100K", if I want to edit it in any way.
That's not entirely accurate, but it's close. Now imagine that the second option isn't actually true, except in the development environment I have available to me, it actually is.
-
standalone="yes" means ignore the DTD, if there is one.
While a .XSD file has been provided under separate cover, the XML never contains a reference to it.
-
While a .XSD file has been provided under separate cover, the XML never contains a reference to it.
That's actually the sane way to do it. A consuming app doesn't care if a document is valid according to the author's rules, it cares if the document is valid according to the rules used to construct the consumer.
Also, entities can only be defined in a DTD, as XSD has no capability of defining one. So, your entity references aren't in there.
-
Well, I really hope they are willing to throw out the the entities.
If the charset is UTF-8, is putting an actual ó in there legit? I know enough about character sets from reading stuff like Joel Spolsky's blog post and the like to know I don't know much about it.
-
If Go's encoding/xml library conforms to the standard, yes.
-
Ben, you're going to be charming the pants off the girls in a couple of years.
-
You mean because I can write code with fancy French words like póóp in it?
-
That is exactly what I mean.
-
If the charset is UTF-8, is putting an actual ó in there legit?
Provided it is encoded correctly (two bytes: 0xc3 0xb3), yes. Totally legit.
-
Yes, but it needs to be in CDATA IIRC
-
Not sure that's right, otherwise you can't use Unicode characters in tag/attribute names.
-
Don't think so. Charset is UTF-8; any UTF-8 is fine.
-
Yes, but it needs to be in CDATA IIRC
It’s fine as long as the input string is valid UTF-8 (or whatever encoding specified in the
charset
attribute):>>> import xml.etree.ElementTree as ET >>> item = ET.fromstring('<?xml version="1.0"?><test>é</test>') >>> print repr(item.text), item.text u'\xe9' é >>> item = ET.fromstring('<?xml version="1.0" encoding="macroman" ?><test>\x8e</test>') >>> print repr(item.text), item.text u'\xe9' é >>>
EDIT: Hanzo’d multiple times (but I took the time to actually check...)
-
-
-
Yes, but it needs to be in CDATA IIRC
It's legal in element names too, and probably other places like attribute and entity names too; I've just not checked the EBNF that thoroughly. You might want to avoid them though, as you need to be careful with being consistent in your normalization. (NFC recommended…)
-
All relevant characters are legal in XML text (the only ones not allowed are ASCII characters less than 0x20, e.g. tab and backspace). The following are allowed in tag names:
":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]