C# Unicode To ASCII transliteration


  • Discourse touched me in a no-no place

    Continuing the discussion from It's 2015, can we as a company please achieve basic late 90s technical competency?:

    @Arantor said:

    You know it's tragic when this is a one-line solution in PHP with iconv to TRANSLIT to ASCII. It wreaks it, sure but it would be compliant with their bullshit.

    Given I'm probably going to need to implement this on Monday (see background in the Lounge thread)....

    In C#, what is the least-incorrect (the entire problem class is incorrect) way to take an arbitrary string of Unicode, replace any diacritically modified characters with their unadorned plain-ASCII relatives and, I dunno, I guess drop everything else above codepoint xFF?

    Trying to Google this leads to all sorts of idiocy on Stack Overflow and Expert sexchange that broadly fall into two categories:

    1. Maintain your own custom translation table. FUCK THAT. That resembles work.
    2. Convert the unicode-encoded text to a byte array. Create an ASCII-encoded string from that. This "works" in the same way that the idiot solution I outlined in the Lounge "works". You end up with ASCII, but it totally borks characters with diacritics.

    I suspect there's some variation on #2 whereby first you normalize the diacritics-included characters to combining-diacritics multi-byte characters and remove the combining diacritics, but I'll be fucked if I know how to do that.


  • Fake News

    This post is deleted!

  • Discourse touched me in a no-no place

    @lolwhat said:

    (post withdrawn by author, will be automatically deleted in 24 hours unless flagged)
    I saw your answer, and yeah, that's probably involved in my vague suspicion.

    Followed by a lot of unholy byte-bashing. You can tell the thing you're trying to do is crap when you type "byte[]"


  • Fake News

    Well... there's this one, but I've not tested this at all:



  • Here's an incorrect solution that works:


  • Discourse touched me in a no-no place

    [code]var str = "éåäöíØ";
    var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str)); [/code]
    ............. Yeah no. That's the cargo-cultiest, wrongest thing in the entire universe. In fact, I am almost certain that won't work.



  • If it works, I'm concerned about converting to ASCII not doing that on its own.



  • @Weng said:

    var noApostrophes

    undefined



  • Apparently Microsoft decided that eaaoiO made sense in Cyrillic, but in ASCII, we needed ??????? instead.


  • Discourse touched me in a no-no place

    Yep. Just arrived at the same conclusion. I'm using BLNS as test data.

    Because I demand that my enterprise system be able to handle Zalgo.


  • Discourse touched me in a no-no place

    I give you the destroyer of worlds:
    [code]public static String PretendItIs1970AndPlacesThatAreNotTheMidwesternUnitedStatesDoNotExist(String s)
    {
    String st = s.Normalize(NormalizationForm.FormD);
    byte[] bitbash = Encoding.UTF8.GetBytes(st);
    List<byte> thisIsAListOfBytesAndIHateMyselfGiveMeABeer = new List<byte>();
    foreach (byte b in bitbash)
    {
    if (b < 127) thisIsAListOfBytesAndIHateMyselfGiveMeABeer.Add(b);
    }
    return Encoding.UTF8.GetString(thisIsAListOfBytesAndIHateMyselfGiveMeABeer.ToArray());
    }
    [/code]
    Yeah. That's right. Normalize to Form D (this decomposes decomposable characters) and THROW AWAY every byte that isn't also identical to a low-order ASCII character. And then interpret it as a UTF8 string again. For reasons I didn't bother to understand it doesn't play nicely when you reinterpret it as an ASCII string.



  • @Weng said:

    THROW AWAY every byte that isn't also identical to a low-order ASCII character.

    Are any of the strings possibly in non-Latin scripts, as opposed to merely Latin with diacritics? If so, you have a rather serious problem.


  • Discourse touched me in a no-no place

    Oh, yeah. Absolutely. They're names and addresses and I've read those articles*. But this is for the event where the downstream system between me and the mainframes refuses to do the string sanitization the mainframes require. Since I can't transmit both unsanitized and sanitized data, if sanitization involves data loss, the downstream system doesn't get the data.

    IOW, that's a feature. Not a useful one for the end user, but a useful one for my organizational politics.

    * http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/ and 
    https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/


  • So, how many users do you have named " "?


  • Discourse touched me in a no-no place

    Users? Probably none. People users want us to handle names and addresses for? Who knows.



  • I hate this.



  • Can't you just percent-escape the names and then unescape them when you get them back?



  • @Weng said:

    For reasons I didn't bother to understand it doesn't play nicely when you reinterpret it as an ASCII string.

    The WTF is strong with this one. Code points up to 127 correspond exactly with ASCII, so that a UTF-8 string with no byte over 127 is an ASCII string.


  • Discourse touched me in a no-no place

    Just looked at it this morning and it turns out ASCII works on sane cases. There might be something odd about one of the BLNS strings. I'll debug it later.


  • Discourse touched me in a no-no place

    I not only hate the solution, and I not only hate the problem. I hate the entire problem class.



  • So don't do it.



  • LOL. This reminds me of people telling you to stop using git. Though I expect @Weng's response won't be a ragey.



  • @Weng said:

    Users? Probably none. People users want us to handle names and addresses for? Who knows.

    Ok, so not your users. But I rather doubt your users are going to be too happy with, say

    Προκοπις Παυλοπουλος
    Προεδρικό Μέγαρο
    Αθήνα 106 74
    Greece
    

    turning into

     
     
     106 74
    Greece
    

    That's even worse than it turning into mojibake. With mojibake, at least it's obvious something is getting mangled. Disappearing entirely is just, " undefined?"



  • Fair enough, I don't know what prompted this since it's in a lounge topic (apparently?) but you could just generate and send "user name keys" or something to the other system and keep the names in your own DB which apparently supports Unicode correctly.


  • Discourse touched me in a no-no place

    You see, someone has to do it. The mainframe apps can't do Unicode. Sanity and best practices both state that the system interfacing with the shitty mainframe apps should do all the string fuckery, particularly if it's I/O channels support Unicode.

    In this case, the system that acts as my gateway to the mainframe apps accepts and transmits Unicode, but does not clean it up for the mainframes and has heinous bugs where it treats Unicode as ASCII even when it isn't going to the mainframe.

    However, there is no bug reporting system other than shouting at each other via email. This system owner is, however, effectively immune to outside scrutiny and criticism. And has chosen to deny that there is a problem beyond "you sent extended characters. The mainframes don't like those".

    Therefore my system needs to do the sanitization to ensure we don't trigger heinous bugs downstream which cause it to produce invalid return data which we cannot ingest because it's invalid.

    This isn't a technical problem I'm coding around. It's a dick wolf management problem.



  • Which of the systems interfaces with a user? Because it's clearly wrong to do this transform in a way that's invisible to the user involved-- HardwareGeek has a good example of why. You need to have the user type in their name or address or whatever, then do the transform, then have them confirm the transform makes sense.

    So if that's your system, then yes, you put the fix in the right place. But it doesn't sound like it is.


  • Discourse touched me in a no-no place

    The only case where the blanks really matter is when the mainframe apps produce shipping labels. Which is fortunately an edge case - labels usually come out of my system. The mainframes don't support unicode anyway, so it's not a regression if those are fucked.



  • What does the mainframe do with the strings anyway, that passing random blanks doesn't affect it?


  • Discourse touched me in a no-no place

    The UIs are all completely different systems, but I have the canonical datastore used for everything but the raw UIs for the various legacy systems.

    Let me be clear: At no point am I ever destroying characters in the canonical representation. They only get destroyed going downstream to the jerks who refuse to support unicode properly. So users directly in those systems get fucked up data and outputs from those systems are potentially fucked up. But everything else gets the unfucked version.


  • Discourse touched me in a no-no place

    Oh, little things. Invoicing. Inventory control. Warehouse management. Shipping labels. Taxation. Only the ship labels and taxation care about these fields. And taxation doesn't care about foreign countries much. So that leaves ship labels.

    And yeah. That's a huge problem. But nobody is going to change a 30 year old behemoth.



  • Right; but if the fucked-up data is actually used for anything, you need to have a human pair of eyes verify it before it gets fucked-up. Which algorithm you use to fuck-up the text is kind of secondary to the actual problem here.


  • Discourse touched me in a no-no place

    @Weng said:

    It's a dick wolf management problem.

    @blakeyrat said:

    So if that's your system, then yes, you put the fix in the right place. But it doesn't sound like it is.

    The right solution does appear to involve something that's probably illegal. Or at least some political clout that @Weng doesn't have…


  • Discourse touched me in a no-no place

    Ah yes. That. Unfortunately from my company's perspective the data comes in bulk feeds from our clients (usually dumps from their existing IT systems). Presenting it back to the user is effectively impossible.

    We could present it to a human somewhere, but the question that follows from that is "what human and why do they care?"

    None of the available humans care until it shows up in warehouse with a blank address label. They can then cross reference that label back to the canonical data and copypasta that into the shipping company's label software to produce a corrected label.

    It's dumb as fuck, but that's the solution senior management is discussing right now.



  • Presumably the company that sent you this data did it for some actual reason, and it's in their best-interest to know you are fudging it and (since you are fudging it) to verify that the fudged data is rational.



  • Or if you don't want to lojban it up, replace unicode characters with U+0001F4A9 and the like.



  • U+1F4A9 is outside ASCII.



  • Also... why the fuck does Lojban have character escapes? undefined



  • I meant replace the string " 💩 " with the string "U+0001F4A9"


  • Discourse touched me in a no-no place

    Yep. One would expect that. We'll give them the option, certainly, but none of them will take it.

    We already provide data correction services for domestic addresses. Virtually none of our customers are in any way interested in receiving the corrections we make, and even fewer want to action those corrections in their own systems.

    This leads to amusing situations where somebody moves, their address gets updated, we apply the correction, but the company they actually do business with has no idea what the hell the new address is.



  • @Weng said:

    So that leaves ship labels.

    Well, someone has to provide for the Friday Error'ds...

    Anyway, I don't think you have a sensible solution at all, save for the mapping table. And even then it's a tossup on whether the munged name will actually make sense in the target language.


  • Discourse touched me in a no-no place

    Is there a dataset out there for that? Because I'll totally use it if so.

    Because it's more dramatic and has a better sense of theatre to it than just dropping characters on the floor.



  • You could make something out of http://unicode.org/charts/charindex.html



  • That won't help much with transliterating Greek, Cyrillic, Hebrew, Japanese, Arabic, etc. to their ASCII (phonetic) equivalents, which I think is what would really be the ideal thing to do here. That might be sorta reasonable for alphabets like Greek or Cyrillic, where there is a (more-or-less) 1-to-1 correspondence with the basic Latin alphabet (although I don't know if any pre-made mapping tables exist). Syllabaries such as Kana might also be reasonable. Ideographic and pictographic scripts might be rather more challenging; I know diddly-squat about any of the languages that use those scripts, so I don't know if there is any systematic way to map them to phonetic alphabets, but I suspect not. I'm pretty sure you'd wind up with huge mapping tables just for CJK, much less the less widely used ones.


  • Discourse touched me in a no-no place

    @HardwareGeek said:

    I don't know if there is any systematic way to map them to phonetic alphabets, but I suspect not

    Languages/alphabets that mostly omit vowels would be challenging to do a good job with.



  • The two scripts that I know of that do that have diacritics ("points") that are, or can be, used to indicate vowels. I don't know, though, whether they are typically used in everyday writing. That is to say, I know they exist, but I don't actually know any of the languages that use those scripts, so I don't know whether you would expect the input text to have vowels or not. You'd probably have to handle both cases.


  • Discourse touched me in a no-no place

    @HardwareGeek said:

    I don't know, though, whether they are typically used in everyday writing.

    What is or isn't in everyday writing isn't important in this case. What is or isn't in addresses is. 🙂



  • @Weng said:

    The mainframe apps can't do Unicode. Sanity and best practices both state that the system interfacing with the shitty mainframe apps should do all the string fuckery, particularly if it's I/O channels support Unicode.

    If the mainframe apps need pure ASCII, then why not do as Ben suggests, and make sure everything that goes in or out of a mainframe is percent-escaped UTF-8? It's still fucked up, but at least it's unfuckable on the way back.


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.