C# Unicode To ASCII transliteration

Weng

Continuing the discussion from It's 2015, can we as a company please achieve basic late 90s technical competency?:

You know it's tragic when this is a one-line solution in PHP with iconv to TRANSLIT to ASCII. It wreaks it, sure but it would be compliant with their bullshit.

Given I'm probably going to need to implement this on Monday (see background in the Lounge thread)....

In C#, what is the least-incorrect (the entire problem class is incorrect) way to take an arbitrary string of Unicode, replace any diacritically modified characters with their unadorned plain-ASCII relatives and, I dunno, I guess drop everything else above codepoint xFF?

Trying to Google this leads to all sorts of idiocy on Stack Overflow and Expert sexchange that broadly fall into two categories:

Maintain your own custom translation table. FUCK THAT. That resembles work.
Convert the unicode-encoded text to a byte array. Create an ASCII-encoded string from that. This "works" in the same way that the idiot solution I outlined in the Lounge "works". You end up with ASCII, but it totally borks characters with diacritics.

I suspect there's some variation on #2 whereby first you normalize the diacritics-included characters to combining-diacritics multi-byte characters and remove the combining diacritics, but I'll be fucked if I know how to do that.

lolwhat

This post is deleted!

Weng

@lolwhat said:

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)

I saw your answer, and yeah, that's probably involved in my vague suspicion.

Followed by a lot of unholy byte-bashing. You can tell the thing you're trying to do is crap when you type "byte[]"

lolwhat

Well... there's this one, but I've not tested this at all:

How to convert from unicode to ASCII

Is there any way to convert unicode values to ASCII?

ben_lubar

Here's an incorrect solution that works:

Slugify and Character Transliteration in C#

I'm trying to translate the following slugify method from PHP to C#: http://snipplr.com/view/22741/slugify-a-string-in-php/ Edit: For the sake of convenience, here the code from above: /** * Mo...

Weng

[code]var str = "éåäöíØ";
var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str)); [/code]
............. Yeah no. That's the cargo-cultiest, wrongest thing in the entire universe. In fact, I am almost certain that won't work.

ben_lubar

If it works, I'm concerned about converting to ASCII not doing that on its own.

riking

@Weng said:

var noApostrophes

ben_lubar

C# Online Compiler | .NET Fiddle

Apparently Microsoft decided that eaaoiO made sense in Cyrillic, but in ASCII, we needed ??????? instead.

Weng

Yep. Just arrived at the same conclusion. I'm using BLNS as test data.

big-list-of-naughty-strings/blns.txt at master · minimaxir/big-list-of-naughty-strings

The Big List of Naughty Strings is a list of strings which have a high probability of causing issues when used as user-input data. - minimaxir/big-list-of-naughty-strings

Because I demand that my enterprise system be able to handle Zalgo.

Weng

I give you the destroyer of worlds:
[code]public static String PretendItIs1970AndPlacesThatAreNotTheMidwesternUnitedStatesDoNotExist(String s)
{
String st = s.Normalize(NormalizationForm.FormD);
byte[] bitbash = Encoding.UTF8.GetBytes(st);
List<byte> thisIsAListOfBytesAndIHateMyselfGiveMeABeer = new List<byte>();
foreach (byte b in bitbash)
{
if (b < 127) thisIsAListOfBytesAndIHateMyselfGiveMeABeer.Add(b);
}
return Encoding.UTF8.GetString(thisIsAListOfBytesAndIHateMyselfGiveMeABeer.ToArray());
}
[/code]
Yeah. That's right. Normalize to Form D (this decomposes decomposable characters) and THROW AWAY every byte that isn't also identical to a low-order ASCII character. And then interpret it as a UTF8 string again. For reasons I didn't bother to understand it doesn't play nicely when you reinterpret it as an ASCII string.

HardwareGeek

@Weng said:

THROW AWAY every byte that isn't also identical to a low-order ASCII character.

Are any of the strings possibly in non-Latin scripts, as opposed to merely Latin with diacritics? If so, you have a rather serious problem.

Weng

Oh, yeah. Absolutely. They're names and addresses and I've read those articles*. But this is for the event where the downstream system between me and the mainframes refuses to do the string sanitization the mainframes require. Since I can't transmit both unsanitized and sanitized data, if sanitization involves data loss, the downstream system doesn't get the data.

IOW, that's a feature. Not a useful one for the end user, but a useful one for my organizational politics.

* http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/ and 
https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/

HardwareGeek

So, how many users do you have named " "?

Weng

Users? Probably none. People users want us to handle names and addresses for? Who knows.

blakeyrat

I hate this.

ben_lubar

Can't you just percent-escape the names and then unescape them when you get them back?

Khudzlin

@Weng said:

For reasons I didn't bother to understand it doesn't play nicely when you reinterpret it as an ASCII string.

The WTF is strong with this one. Code points up to 127 correspond exactly with ASCII, so that a UTF-8 string with no byte over 127 is an ASCII string.

Weng

Just looked at it this morning and it turns out ASCII works on sane cases. There might be something odd about one of the BLNS strings. I'll debug it later.

Weng

I not only hate the solution, and I not only hate the problem. I hate the entire problem class.

blakeyrat

So don't do it.

boomzilla

LOL. This reminds me of people telling you to stop using git. Though I expect @Weng's response won't be a ragey.

HardwareGeek

@Weng said:

Users? Probably none. People users want us to handle names and addresses for? Who knows.

Ok, so not your users. But I rather doubt your users are going to be too happy with, say

Προκοπις Παυλοπουλος
Προεδρικό Μέγαρο
Αθήνα 106 74
Greece

turning into

 
 
 106 74
Greece

That's even worse than it turning into mojibake. With mojibake, at least it's obvious something is getting mangled. Disappearing entirely is just, " ?"

blakeyrat

Fair enough, I don't know what prompted this since it's in a lounge topic (apparently?) but you could just generate and send "user name keys" or something to the other system and keep the names in your own DB which apparently supports Unicode correctly.

Weng

You see, someone has to do it. The mainframe apps can't do Unicode. Sanity and best practices both state that the system interfacing with the shitty mainframe apps should do all the string fuckery, particularly if it's I/O channels support Unicode.

In this case, the system that acts as my gateway to the mainframe apps accepts and transmits Unicode, but does not clean it up for the mainframes and has heinous bugs where it treats Unicode as ASCII even when it isn't going to the mainframe.

However, there is no bug reporting system other than shouting at each other via email. This system owner is, however, effectively immune to outside scrutiny and criticism. And has chosen to deny that there is a problem beyond "you sent extended characters. The mainframes don't like those".

Therefore my system needs to do the sanitization to ensure we don't trigger heinous bugs downstream which cause it to produce invalid return data which we cannot ingest because it's invalid.

This isn't a technical problem I'm coding around. It's a dick wolf management problem.

blakeyrat

Which of the systems interfaces with a user? Because it's clearly wrong to do this transform in a way that's invisible to the user involved-- HardwareGeek has a good example of why. You need to have the user type in their name or address or whatever, then do the transform, then have them confirm the transform makes sense.

So if that's your system, then yes, you put the fix in the right place. But it doesn't sound like it is.

Weng

The only case where the blanks really matter is when the mainframe apps produce shipping labels. Which is fortunately an edge case - labels usually come out of my system. The mainframes don't support unicode anyway, so it's not a regression if those are fucked.

Kian

What does the mainframe do with the strings anyway, that passing random blanks doesn't affect it?

Weng

The UIs are all completely different systems, but I have the canonical datastore used for everything but the raw UIs for the various legacy systems.

Let me be clear: At no point am I ever destroying characters in the canonical representation. They only get destroyed going downstream to the jerks who refuse to support unicode properly. So users directly in those systems get fucked up data and outputs from those systems are potentially fucked up. But everything else gets the unfucked version.

Weng

Oh, little things. Invoicing. Inventory control. Warehouse management. Shipping labels. Taxation. Only the ship labels and taxation care about these fields. And taxation doesn't care about foreign countries much. So that leaves ship labels.

And yeah. That's a huge problem. But nobody is going to change a 30 year old behemoth.

blakeyrat

Right; but if the fucked-up data is actually used for anything, you need to have a human pair of eyes verify it before it gets fucked-up. Which algorithm you use to fuck-up the text is kind of secondary to the actual problem here.

dkf

@Weng said:

It's a dick wolf management problem.

@blakeyrat said:

So if that's your system, then yes, you put the fix in the right place. But it doesn't sound like it is.

The right solution does appear to involve something that's probably illegal. Or at least some political clout that @Weng doesn't have…

Weng

Ah yes. That. Unfortunately from my company's perspective the data comes in bulk feeds from our clients (usually dumps from their existing IT systems). Presenting it back to the user is effectively impossible.

We could present it to a human somewhere, but the question that follows from that is "what human and why do they care?"

None of the available humans care until it shows up in warehouse with a blank address label. They can then cross reference that label back to the canonical data and copypasta that into the shipping company's label software to produce a corrected label.

It's dumb as fuck, but that's the solution senior management is discussing right now.

blakeyrat

Presumably the company that sent you this data did it for some actual reason, and it's in their best-interest to know you are fudging it and (since you are fudging it) to verify that the fudged data is rational.

ben_lubar

se& - vlasisku

PA1 - vlasisku

PA2 - vlasisku

Or if you don't want to lojban it up, replace unicode characters with U+0001F4A9 and the like.

Gąska

U+1F4A9 is outside ASCII.

Gąska

Also... why the fuck does Lojban have character escapes?

ben_lubar

I meant replace the string " " with the string "U+0001F4A9"

Weng

Yep. One would expect that. We'll give them the option, certainly, but none of them will take it.

We already provide data correction services for domestic addresses. Virtually none of our customers are in any way interested in receiving the corrections we make, and even fewer want to action those corrections in their own systems.

This leads to amusing situations where somebody moves, their address gets updated, we apply the correction, but the company they actually do business with has no idea what the hell the new address is.

Maciejasjmj

@Weng said:

So that leaves ship labels.

Well, someone has to provide for the Friday Error'ds...

Anyway, I don't think you have a sensible solution at all, save for the mapping table. And even then it's a tossup on whether the munged name will actually make sense in the target language.

Weng

Is there a dataset out there for that? Because I'll totally use it if so.

Because it's more dramatic and has a better sense of theatre to it than just dropping characters on the floor.

ben_lubar

You could make something out of http://unicode.org/charts/charindex.html

HardwareGeek

That won't help much with transliterating Greek, Cyrillic, Hebrew, Japanese, Arabic, etc. to their ASCII (phonetic) equivalents, which I think is what would really be the ideal thing to do here. That might be sorta reasonable for alphabets like Greek or Cyrillic, where there is a (more-or-less) 1-to-1 correspondence with the basic Latin alphabet (although I don't know if any pre-made mapping tables exist). Syllabaries such as Kana might also be reasonable. Ideographic and pictographic scripts might be rather more challenging; I know diddly-squat about any of the languages that use those scripts, so I don't know if there is any systematic way to map them to phonetic alphabets, but I suspect not. I'm pretty sure you'd wind up with huge mapping tables just for CJK, much less the less widely used ones.

dkf

@HardwareGeek said:

I don't know if there is any systematic way to map them to phonetic alphabets, but I suspect not

Languages/alphabets that mostly omit vowels would be challenging to do a good job with.

HardwareGeek

The two scripts that I know of that do that have diacritics ("points") that are, or can be, used to indicate vowels. I don't know, though, whether they are typically used in everyday writing. That is to say, I know they exist, but I don't actually know any of the languages that use those scripts, so I don't know whether you would expect the input text to have vowels or not. You'd probably have to handle both cases.

dkf

@HardwareGeek said:

I don't know, though, whether they are typically used in everyday writing.

What is or isn't in everyday writing isn't important in this case. What is or isn't in addresses is. :)

flabdablet

@Weng said:

The mainframe apps can't do Unicode. Sanity and best practices both state that the system interfacing with the shitty mainframe apps should do all the string fuckery, particularly if it's I/O channels support Unicode.

If the mainframe apps need pure ASCII, then why not do as Ben suggests, and make sure everything that goes in or out of a mainframe is percent-escaped UTF-8? It's still fucked up, but at least it's unfuckable on the way back.