HTML entities in XML

Weng

So, I have a system that feeds me XML. That's great!

Except it's not really XML.

They appear to be hand assembling it and using tools intended to build HTML.

Which leads to HTML entities appearing in the XML output.

Normally this isn't a problem - all the data contained therein either comes out of a mainframe in plain ASCII or is just an echo of what we had previously sent them.

But if we send them any "strange" (read: Unamerican) characters, they end up coming back to us as HTML entities, making the XML invalid.

Their stance on the matter is that it's our problem: We shouldn't be sending them those characters in the first place (excuse me, cockbags, you're accepting UTF8 encoded XML. Therefore it should be fine).

I'm not about to "fix" that. But I will, grudgingly, munge their output until it's valid XML.

Any thoughts on how?

lucas1

@Weng You could use HTML Parser to parse it to XHTML which is valid HTML and XML.

Weng

@lucas1 I don't want to change our actual parsing (which is actually .Net's XML deserializer)

accalia

@lucas1 said in HTML entities in XML:

@Weng You could use HTML Parser to parse it to XHTML which is valid HTML and XML.

that would be my recommendation.

lucas1

@Weng HTML Agility pack and then run a different endpoint via IP for that client, or a different UserAgent.

It is better to use either tell them to use a different endpoint or redirect their crap input to an endpoint that can deal with it properly than crap infesting the rest of your code.

EDIT: Clarity

Adynathos

@Weng Clear the entities before parsing the string.
In Python:

import html
print(html.unescape('&pound;&amp;'))

Result £&

(full question)

Maciejasjmj

@Weng said in HTML entities in XML:

@lucas1 I don't want to change our actual parsing (which is actually .Net's XML deserializer)

If the issue is only with the entities (and not, say, HTML's lovely unclosed tags), then maybe a simple preprocessing step which replaces the entities without looking at the document structure would do? Don't quite remember all the edge cases, though.

Weng

Ah. Let me rephrase. The endpoint doesn't matter, it's only used to communicate with this system anyway.

I want to continue actually running their output through the existing deserializer code because it should be fucking XML, and it nominally is XML, and it looks and works just like all 9000000 other places we deserialize XML and can be maintained exactly the same when fields are added/removed/etc.

What I actually want to do is to preprocess their output so that it actually is XML, and wrap the code that does that in a giant comment block that contains the entire email flamewar about this topic.

Weng

Take for instance, the regex based approach of "find all instances of &.....; they aren't on the short list of XML entities, look them up in a table somewhere and replace with their plaintext equivalent"

Except I really don't want to use fucking regex and I really don't want to have to build a giant entity table, handle numeric entities, and handle hex entities.

lucas1

@Weng Honestly I've had stuff like this before. I gave them a different endpoint and just mudged it from there via User Agent or IP.

Is this really off the table?

My other option is to tell them to send the right stuff over.

PleegWat

@Weng If you can just track down the entities, you can use a standard library function to generate the replace string?

lucas1

@PleegWat I am guessing the XML parser will throw a fit before then

PleegWat

@lucas1 Before it goes into the xml parser. Of course.

Hm, isn't it also possible to define additional entities in a DTD? But then you'd need to inject that, and maintain the list.

Maciejasjmj

@Weng said in HTML entities in XML:

Except I really don't want to use fucking regex and I really don't want to have to build a giant entity table, handle numeric entities, and handle hex entities.

System.Net.WebUtility.HtmlDecode?

Hm. Although that wouldn't take care of <s and >s correctly, I suppose.

lucas1

@Maciejasjmj But you would have to loop through that for each node.

System.Net.WebUtility.HtmlDecode(node.InnerText)

On everything. Providing you could parse it. It is massively inefficient .

lucas1

@Weng Honest Question if you got a lot of other customers sending you valid requests. Why can't you just tell them to change their end?

Maciejasjmj

@lucas1 said in HTML entities in XML:

@Maciejasjmj But you would have to loop through that for each node.
System.Net.WebUtility.HtmlDecode(node.InnerText)
On everything. Providing you could parse it. It is massively inefficient .

WTF? You don't have node.InnerText before you've parsed the XML, and you can't parse the XML because it's invalid. Obviously whatever preprocessing would run on the input as-is.

lucas1

@Maciejasjmj Sorry It was kinda pseudo code. I didn't bother actually looking whether it was valid C#

Weng

@lucas1 said in HTML entities in XML:

@Weng Honest Question if you got a lot of other customers sending you valid requests. Why can't you just tell them to change their end?

WtfCorp internal system. The lead programmer is their fucking director. Therefore they have the power to resist change.

They're actually an integration gateway to a bunch of other WtfCorp systems, so we can't just cut them out.

Maciejasjmj

@lucas1 said in HTML entities in XML:

@Maciejasjmj Sorry It was kinda pseudo code. I didn't bother actually looking whether it was valid C#

It's invalid conceptually, not syntactically. You can't loop over nodes before you have nodes as a result of XML parsing. Which you can't do because the XML is not valid in the first place.

Maciejasjmj

@Weng said in HTML entities in XML:

WtfCorp internal system. The lead programmer is their fucking director. Therefore they have the power to resist change.

If it's internal, don't other people complain about the same thing? I agree that yelling at whoever fucked it up is probably the best idea and anything else should only be tried when that fails.

lucas1

@Maciejasjmj Fair point. You are right.

Maciejasjmj

@Maciejasjmj said in HTML entities in XML:

@Weng said in HTML entities in XML:

Except I really don't want to use fucking regex and I really don't want to have to build a giant entity table, handle numeric entities, and handle hex entities.

System.Net.WebUtility.HtmlDecode?

Hm. Although that wouldn't take care of <s and >s correctly, I suppose.

Actually that could be workable...

Step 1: search-and-replace the 5 actual XML entities (quot, amp, apos, lt, gt) into &quot; and so on
Step 2: run a good old HTML decode
Step 3: pass that to the XML parser.

Should work I guess?

masonwheeler

@lucas1 said in HTML entities in XML:

My other option is to tell them to send the right stuff over.

This. If someone is sending bad input, reject it. If they're big and internal, reject it anyway.

Weng

@Maciejasjmj said in HTML entities in XML:

@Weng said in HTML entities in XML:

WtfCorp internal system. The lead programmer is their fucking director. Therefore they have the power to resist change.

If it's internal, don't other people complain about the same thing? I agree that yelling at whoever fucked it up is probably the best idea and anything else should only be tried when that fails.

This particular integration is only used by WtfFramework. No alternative integration points with the same functionality are available.

Yes. This interface exists only as a link between them and me, and they don't give a fuck about fixing it because nobody noticed until it had been "working fine" for six years.

anonymous234

@Weng said in HTML entities in XML:

Their stance on the matter is that it's our problem: We shouldn't be sending them those characters in the first place

This makes me angry.

If you have the chance to do so, please insult those idiots on my behalf.

And be sure to put in a formal complaint in writing to their boss or something, because that ain't right.

Weng

@masonwheeler said in HTML entities in XML:

@lucas1 said in HTML entities in XML:

My other option is to tell them to send the right stuff over.

This. If someone is sending bad input, reject it. If they're big and internal, reject it anyway.

If we reject it, we drop orders on the floor. This means we eat the penalty when customers notice. Unacceptable to management.

Weng

@anonymous234 said in HTML entities in XML:

@Weng said in HTML entities in XML:

Their stance on the matter is that it's our problem: We shouldn't be sending them those characters in the first place

This makes me angry.

If you have the chance to do so, please insult those idiots on my behalf.

And be sure to put in a formal complaint in writing to their boss or something, because that ain't right.

Btdt, nobody cares what I say. The other system handles 50% of our corporate revenues. Therefore they are infallible.

lucas1

@anonymous234 Stuff like that is difficult politically, You can't just shoot your mouth off, sometimes it is just taboo to criticise even if it is valid and polite.

I learned the hard way because I got sacked mid last year because I tend to say exactly what I think. For me it was a good thing, because it pushed me to run my own company properly. It isn't for everyone.

@Weng may not care for that option.

EDIT: Clarity.

Weng

@lucas1 It doesn't help that my predecessor never grokked the system and his predecessor was literally random tinkering by whatever developer felt a change or enhancement was needed all the way back to the origins of WtfFramework.

Hilariously, I have a copy of the original interface agreement governing this integration. What it describes bears no resemblance to what exists.

lucas1

@Weng Working in other large orgs I can sympathise.

djls45

@Weng said in HTML entities in XML:

Hilariously, I have a copy of the original interface agreement governing this integration. What it describes bears no resemblance to what exists.

Has the agreement changed since then? Can it be re-negotiated (to be more in your favor) based on the differences?

boomzilla

@lucas1 said in HTML entities in XML:

I gave them a different endpoint and just mudged it from there via User Agent or IP.

It's the "mudging" that he's trying to figure out how to do, if I read this right.

anotherusername

@Weng XML has built-in support for entities. It just has 5 that are predefined:

&quot;
&amp;
&apos;
&lt;
&gt;

So, if they're giving you invalid entity names -- any named HTML entity that's not in that list -- they ought to massage it by adding a corresponding <!ENTITY> declaration in the DOCTYPE. Then the XML file will be valid and you shouldn't have to do anything special to it before sending it through your parser.

lucas1

@boomzilla said in HTML entities in XML:

Yes I get that.

loopback0

@lucas1 said in HTML entities in XML:

@boomzilla said in HTML entities in XML:

Yes I get that.

lucas1 is a boomzilla alt CONFIRMED.

RaceProUK

@anotherusername But that requires them to fix things, which it sounds like they won't.

boomzilla

@lucas1 said in HTML entities in XML:

@boomzilla said in HTML entities in XML:

Yes I get that.

OK, I guess I just didn't understand why you were hand waving that part away and talking about user agents or IPs or whatever.

lucas1

@boomzilla That is a perfectly valid solution in some circumstances.

anotherusername

@RaceProUK well, if all else failed, he'd still have the option of massaging the DOCTYPE himself. Which, I think, is a somewhat less WTF-y way of dealing with this than doing global search-and-replaces on it.

JBert

@Weng said in HTML entities in XML:

Take for instance, the regex based approach of "find all instances of &.....; they aren't on the short list of XML entities, look them up in a table somewhere and replace with their plaintext equivalent"

Except I really don't want to use fucking regex and I really don't want to have to build a giant entity table, handle numeric entities, and handle hex entities.

Numeric entities and hex entities are perfectly acceptable in XML.

My best guess right now is to somehow "inject" a reference to the XHTML DTD into your XML stream and then let the XML parser figure it out. Since that DTD declares all entities in a list, the parser should be able to look them up in there.

You might also need an XmlResolver to actually find the DTD, but that should also be doable.

Buddy

I'm no xml expert, but I think you should be able to insert the entity definitions after the xml header, like

<?xml version="1.0"?>
<!DOCTYPE whatever [
 <!ENTITY % HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
%HTMLlat1;
 <!ENTITY % HTMLspecial PUBLIC "-//W3C//ENTITIES Special for XHTML//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent">
%HTMLspecial;
 <!ENTITY % HTMLsymbol PUBLIC "-//W3C//ENTITIES Symbols for XHTML//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent">
%HTMLsymbol;
] >

Or if you prefer you can copy the individual entities themselves from https://www.w3.org/TR/xhtml1/DTD.html section 2.

If there is already a DOCTYPE, it should be possible to merge them http://mailman.ic.ac.uk/pipermail/xml-dev/1999-November/015894.html

boomzilla

@lucas1 said in HTML entities in XML:

@boomzilla That is a perfectly valid solution in some circumstances.

There you go again.

You're assuming that he figured out a way to transform the data. You're solving the wrong problem. He knows exactly when he needs to do that (all the time). But he doesn't have a solution for actually doing the transformation.

lucas1

@boomzilla No I understood the problem.

My solutions are:

Make the other side fix (not possible from previous comments
turn it into XHTML (which other people are suggesting now)
make a separate endpoint based on IP or UA to deal with their specific fuck ups.

All are valid solutions.

loopback0

@lucas1 said in HTML entities in XML:

All are valid solutions.

No because this:

@lucas1 said in HTML entities in XML:

make a separate endpoint based on IP or UA to deal with their specific fuck ups.

Doesn't specify how to deal with them, which is the exact bit Weng needs.

lucas1

@loopback0 said in HTML entities in XML:

Doesn't specify how to deal with them, which is the exact bit Weng needs.

I said in the previous reply how he should deal with it, and make a separate endpoint. It is trivial in .net to redirect to another method.

You aren't reading the thread.

boomzilla

@lucas1 said in HTML entities in XML:

make a separate endpoint based on IP or UA to deal with their specific fuck ups.

All are valid solutions.

So please explain how that solves anything for Weng.

lucas1

@boomzilla I said it in the first reply to him. Read the fucking thread.

@Buddy and others are saying similar things.

lucas1

@loopback0 said in HTML entities in XML:

Doesn't specify how to deal with them, which is the exact bit Weng needs.

I already told him how to deal with it. @accalia agreed.

anotherusername

@lucas1 said in HTML entities in XML:

@loopback0 said in HTML entities in XML:

Doesn't specify how to deal with them, which is the exact bit Weng needs.

I said in the previous reply how he should deal with it, and make a separate endpoint. It is trivial in .net to redirect to another method.

You aren't reading the thread.

It sounds like you're suggesting that one option is to build a whole new XML parser that'll accept the invalid XML. And the answer to that is no. Just no.

It's eventually going into their existing XML parser, so it obviously needs to either be valid XML when he gets it (unlikely to happen) or he needs to massage it so that it's valid XML. And still, since he's very likely going to be having to massage it, the question is how he should do that, so we've been floating some ideas around.