Unicode Collation



  • I have generally been a proponent of Unicode. At least, I tend to use "wchar_t" as much as possible in my C++ code, avoiding "char" as obsolete. I used TCHAR back when it made a difference, and have generally tried to stay up-to-date about string representation. 

    However, the more I actually learn about Unicode - particularly about use of the Unicode "code points" beyond traditional ASCII - the more skepticism it inspires in me.

    Consider, for example, the category of ambiguity that exists between U+212B, the "Angstrom sign" and U+00C5, "Latin A with ring above." Both of these look like the Norwegian letter Å. The problem is that according to the standard, the Norwegian letter can be encoded with either sign. Consider the Unicode "Collation Algorithm":

    For collation, sequences that are canonically equivalent must sort the same. In the table below are some examples. For example, the angstrom symbol was encoded for compatibility, and is canonically equivalent to an A-ring.

    The standard goes on to give a list of similar "examples" (which is presumably not even comprehensive).

    To me, this is as if ASCII had stated that either 0 or O could be used to encode the letter O, and that any code written to use ASCII ought to be able to detect and deal with any ambiguities. This is no small task, and Unicode extends this absurdity to a long list of ambiguities, of which the standard can only provide examples. 

    In fact, it's also possible to encode Å at least one more way: as a plain "Latin A" followed by a "Combining ring above" code point. Similarly, there is a combining diaresis (umlaut), combining accents acute and grave, etc. This complicates collation further, and one nasty side effect is that Unicode strings are difficult or impossible (depending on how one reads the standard) to reverse.

    I think a better encoding scheme would separate alphabets entirely. The French alphabet, complete with Ç, Ô, etc. would have its own range of numbers in the encoding, and would fall naturally in alphabetical order (as ASCII does for English). This would be independent of the English alphabet, the German alphabet, etc. Collation of a document written in a single language would be almost effortless. If someone (erroneously) wrote, say, French using the English portion of the encoding, then things would sort, quite reasonably, as English.

    Such an encoding might still include the concept of combining accents. But these should not be "optional," they should be used in languages that have markings that do not affect collation. Spanish, for instance, places a diaresis over the letter U whenever it is used to make a G hard. But this does not affect collation, since the letter is still considered a U. On the other hand, in Hungarian Ü really is a distinct letter, which is not collated as U.

    To me, the best approach is to make Ü  a first-class letter in the Hungarian portion of the encoding, and to provide a diaresis (double-dot) modifier in the Spanish portion. To confuse the two letters simply because they happen to look the same (as Unicode does) seems like a recipe for disaster.

    It seems to me that the world is ready for a clean slate approach to the problem. Unicode seems like a real quagmire. I think that if more people were actually trying to use it as anything more than "ASCII with a hat on," we would be seeing some big problems and some nasty "holy wars." For one thing, many of the problems I've cited are in some sense just rehashes of the tab-versus-space dilemma: there are two ways to make the same "letter" or "character" show up on the screen. But the Unicode problems are apocalyptic in scale compared to tab-versus-space.

    Am I wrong about this?



  •  Gaah!  Seperating by language would be a bad idea.  What happens if you have one french person with an accented character in their name?  How do you sort it?

    There is a Unicode table that can be use to "decompose" any composite characters.  So the first thing to do is decompose anything that comes in.  Assuming the table is complete, A with a ring and the Angstrom symbol should both decompose to the letter A followed by the combining ring thingy.

    If you know you're decomposing for the purpose of comparing two strings, just throw away the combining symbol, leaving the A.

    Now you can just compare the strings. 

    Sorting gets messier when dealing with Kanji and Hangul, but this method takes care of 99% of the Latin alphabet variants.



  • @bridget99 said:

    I have generally been a proponent of Unicode. At least, I tend to use "wchar_t" as much as possible in my C++ code, avoiding "char" as obsolete. I used TCHAR back when it made a difference, and have generally tried to stay up-to-date about string representation. 


    wchar_t is TRWTF. Unicode is still young, but it already has a lot of historical quirks—and UTF-16 is one of the bigger ones. Initially it was thought that 16-bits will be enough (currently unicode uses 21 bits), so 16-bit encoding was proposed and companies jumped on the bandwagon and started creating parallel API using 16-bit characters. Meanwhile somebody invented UTF-8, which didn't require any of that (plus is shorter on average except for Chinese/Japaneese/Korean text) and than the extension points were defined that make UTF-16 a variable-length encoding too, so it no longer has any advantage, but now instead of one encoding for everything we have UTF-8, UTF-16LE, UTF-16BE, UCS4 (can be LE or BE too, of course) (and a few backward-compatibility quirks like UTF-7, but those are not so big problem).
    @bridget99 said:

    However, the more I actually learn about Unicode - particularly about use of the Unicode "code points" beyond traditional ASCII - the more skepticism it inspires in me.

    Consider, for example, the category of ambiguity that exists between U+212B, the "Angstrom sign" and U+00C5, "Latin A with ring above." Both of these look like the Norwegian letter Å. The problem is that according to the standard, the Norwegian letter can be encoded with either sign. Consider the Unicode "Collation Algorithm":

    For collation, sequences that are canonically equivalent must sort the same. In
    the table below are some examples. For example, the angstrom symbol was encoded
    for compatibility, and is canonically equivalent to an A-ring.

    The standard goes on to give a list of similar "examples" (which is presumably not even comprehensive).

    To me, this is as if ASCII had stated that either 0 or O could be used to encode the letter O, and that any code written to use ASCII ought to be able to detect and deal with any ambiguities. This is no small task, and Unicode extends this absurdity to a long list of ambiguities, of which the standard can only provide examples. 

    In fact, it's also possible to encode Å at least one more way: as a plain "Latin A" followed by a "Combining ring above" code point. Similarly, there is a combining diaresis (umlaut), combining accents acute and grave, etc. This complicates collation further, and one nasty side effect is that Unicode strings are difficult or impossible (depending on how one reads the standard) to reverse.


    All of that is for backward compatibility reasons. Whenever there existed a legacy encoding that made difference between the characters, unicode makes difference between them, so you can convert to unicode and back to the original encoding and get the original data back.

    There is of course an exhaustive list of the equivalent characters—the tables for converting between the normal forms. That's how you normally
    do comparison, collation and such—convert to common normal form.
    @bridget99 said:

    I think a better encoding scheme would separate alphabets entirely. The French alphabet, complete with Ç, Ô, etc. would have its own range of numbers in the encoding, and would fall naturally in alphabetical order (as ASCII does for English). This would be independent of the English alphabet, the German alphabet, etc. Collation of a document written in a single language would be almost effortless. If someone (erroneously) wrote, say, French using the English portion of the encoding, then things would sort, quite reasonably, as English.

    Such an encoding might still include the concept of combining accents. But these should not be "optional," they should be used in languages that have markings that do not affect collation. Spanish, for instance, places a diaresis over the letter U whenever it is used to make a G hard. But this does not affect collation, since the letter is still considered a U. On the other hand, in Hungarian Ü really is a distinct letter, which is not collated as U.

    To me, the best approach is to make Ü  a first-class letter in the Hungarian portion of the encoding, and to provide a diaresis (double-dot) modifier in the Spanish portion. To confuse the two letters simply because they happen to look the same (as Unicode does) seems like a recipe for disaster.

    It seems to me that the world is ready for a clean slate approach to the problem. Unicode seems like a real quagmire. I think that if more people were actually trying to use it as anything more than "ASCII with a hat on," we would be seeing some big problems and some nasty "holy wars." For one thing, many of the problems I've cited are in some sense just rehashes of the tab-versus-space dilemma: there are two ways to make the same "letter" or "character" show up on the screen. But the Unicode problems are apocalyptic in scale compared to tab-versus-space.

    Am I wrong about this?


    Your suggesting would only make all the problems you mention much, much worse. You would have different "English A" and "French A" and "German A" etc. But they do look the same, always, so user, rightfully assumes they are the same.

    The Unicode standard does in fact provide solution for these problems using the normal forms. You simply have to choose a normal form (usually NFKC, which collapses the the compatibility characters and has all composable characters composed) and do things like collation and comparison in that form. Just make sure you return imputs unconverted (big grief with MacOS X filesystem is, that it converts all filenames to decomposed form before storing them, so when you create a file and read the directory listing back, you'll get a different string).



  • @bridget99 said:

    Consider, for example, the category of ambiguity that exists between U+212B, the "Angstrom sign" and U+00C5, "Latin A with ring above." Both of these look like the Norwegian letter Å. The problem is that according to the standard, the Norwegian letter can be encoded with either sign.

     Consider all the phishing scams these ambiguities will lead to when Unicode domain names become popular...

     



  • You may drop the "When". See IDN homograph attack.



  • @Bulb said:

    Meanwhile somebody invented UTF-8

    'somebody' were Ken Thompson and Rob Pike. You might have heard of them. Designed on a placemat in a diner, implemented over a couple of days throughout Plan 9. Stuff of wonders. http://en.wikipedia.org/wiki/Utf-8



  • @bridget99 said:

    I think a better encoding scheme would separate alphabets entirely. The French alphabet, complete with Ç, Ô, etc. would have its own range of numbers in the encoding, and would fall naturally in alphabetical order (as ASCII does for English). This would be independent of the English alphabet, the German alphabet, etc. Collation of a document written in a single language would be almost effortless. If someone (erroneously) wrote, say, French using the English portion of the encoding, then things would sort, quite reasonably, as English.

    Am I wrong about this?

    Yes, this way will not fit in under the new world government, where everyone must speak in newspeak.



  • Collation is difficult and needs to be done in a specific language. There is no point to Unicode collation; if your program is used by English speakers you should collate using English rules etc. If you need a quick solution you can just sort by character code.

    Giving individual languages different characters is a very bad idea and it does not solve your collation problem. This would require a constant administration from the Unicode Consortium to add new languages, add new characters to languages and so on. There are about 6900 languages in the world, then there are also dialects and regional differences (British, American). Also one language can have different collations for different purposes (German standard and phone book collation). Also one language can be spoken in different countries (German in Germany, Austria and Swizerland) where the collation may or may not be different. Some languages are written with many different scripts (Serbian in latin and cyrillic).

    Even if you have single purpose character sets it is still impossible to collate using simple character code comparison. German phone book collation is a good example (Ä = AE, Ö = OE, Ü = UE, etc.). It also does not take into account the possibility that you may have to sort names from different languages in the same list. For example in Finnish sorting Y = Ü, even though Ü is not a Finnish letter.

    Unicode is complex but they have all the data avalable for your own use. But there are always a number of language-specific operations that you have to code yourself. I don't know how much support the win32 API gives you in this regard but modern frameworks like .NET and Java have different locales, collations and Unicode normalization built in already.



  • Don't worry, any day now Esperanto will catch on and make all your concerns obsolete.

    Any day now...


  • Discourse touched me in a no-no place

    @blakeyrat said:

    Don't worry, any day now Esperanto will catch on and make all your concerns obsolete.

    Any day now...

      ...  -.-.  ---  .-.  -.-.  ....  ..  ---

    !!!

    Oh, thought you said espania, not esperanto. Soz!!one!

    (and I hope to $ghod some people get those!)



  • "Collation"? Heck, I live in Thailand and even Thai people don't even know how to collate the Thai language. Admittedly, it's a bitch. Sort a syllable first by the consonants spoken before the vowel, then by the pieces that make up the vowel (even if they are written to the left of the consonants), then by the trailing consonants. And heaven help you if you can't tell where one syllable ends and the next one begins. Ask a Thai to look up a word in a dictionary. He'll find the section for the first consonant, then search that SEQUENTIALLY looking for the word.

    Me, I just sort strings in binary. Makes similar words land near each other; good enough.



  • @Bulb said:

    wchar_t is TRWTF. Unicode is still young, but it already has a lot of historical quirks—and UTF-16 is one of the bigger ones. Initially it was thought that 16-bits will be enough (currently unicode uses 21 bits), so 16-bit encoding was proposed and companies jumped on the bandwagon and started creating parallel API using 16-bit characters. Meanwhile somebody invented UTF-8, which didn't require any of that (plus is shorter on average except for Chinese/Japaneese/Korean text) and than the extension points were defined that make UTF-16 a variable-length encoding too, so it no longer has any advantage, but now instead of one encoding for everything we have UTF-8, UTF-16LE, UTF-16BE, UCS4 (can be LE or BE too, of course) (and a few backward-compatibility quirks like UTF-7, but those are not so big problem).

    Attempting to paraphrase your response, which I basically agree with:

    "Fixed-length 16-bit encoding is TRWTF. Not only is it bloated compared to UTF-8 (which is a variable-length encoding using 8 or more bits per character), it ultimately proved to be insufficient, and had to be bastardized into UTF-16LE, UTF-16BE, etc."

    If I got all of that right, then I agree. But isn't support for UTF-8 in Windows somewhat limited? I think I've read that the only UTF-8-compatible operations in the Win32 API are conversion to/from wchar_t. And I think wchar_t is used, and implemented, at a very low level in Windows. Is that, perhaps, The Real WTF® ?

     



  • @PSWorx said:

    You may drop the "When". See IDN homograph attack.
     

    Lulz. I move for summary judgment in favor of me. 



  • @bridget99 said:

    @Bulb said:

    wchar_t is TRWTF. Unicode is still young, but it already has a lot of historical quirks—and UTF-16 is one of the bigger ones. Initially it was thought that 16-bits will be enough (currently unicode uses 21 bits), so 16-bit encoding was proposed and companies jumped on the bandwagon and started creating parallel API using 16-bit characters. Meanwhile somebody invented UTF-8, which didn't require any of that (plus is shorter on average except for Chinese/Japaneese/Korean text) and than the extension points were defined that make UTF-16 a variable-length encoding too, so it no longer has any advantage, but now instead of one encoding for everything we have UTF-8, UTF-16LE, UTF-16BE, UCS4 (can be LE or BE too, of course) (and a few backward-compatibility quirks like UTF-7, but those are not so big problem).

    Attempting to paraphrase your response, which I basically agree with:

    "Fixed-length 16-bit encoding is TRWTF. Not only is it bloated compared to UTF-8 (which is a variable-length encoding using 8 or more bits per character), it ultimately proved to be insufficient, and had to be bastardized into UTF-16LE, UTF-16BE, etc."

    If I got all of that right, then I agree. But isn't support for UTF-8 in Windows somewhat limited? I think I've read that the only UTF-8-compatible operations in the Win32 API are conversion to/from wchar_t. And I think wchar_t is used, and implemented, at a very low level in Windows. Is that, perhaps, The Real WTF® ?

     

     

    The counter point to this argument is that random access into an array of variable size objects (a very common string task) is going to be much slower.   You can't count on a correct byte offset, you have to read the array from the beginning every time.



  • @DescentJS said:

    The counter point to this argument is that random access into an array of variable size objects (a very common string task) is going to be much slower.   You can't count on a correct byte offset, you have to read the array from the beginning every time.

    Except the whole point is, (at least if you want to be really correct), there is no fixed-width 16-bit encoding for unicode. There can't be given the code points go up to 0x10ffff with a hole between 0xd800 and 0xdfff!

    Yeah, that's right. Those funny number. That's because how they've made UTF-16 into variable-length encoding—the 16-bit value is the code-point, except when it is between 0xd800 and 0xdbff, in which case the next must be between 0xdc00 and 0xdfff, and you take low 10 bits from the first plus low 10 bits of the second, shifted by 10 to the left and add 0x10000. Now if you call that a sane encoding, I certainly don't.

    Not to mention that unicode itself is a variable-length encoding due to the various combining stuff and even if it was, some languages are variable-length encodings too. So you have the concept of codewords (bytes or words depending on the encoding used), codepoints, characters and letters and each can be composed from more than one of the previous.

    So it's actually never correct to just do random access to a unicode string, even if it is encoded in UCS4 (4-byte codewords, so now one codepoint is just one codeword)—you always need to find the thing you are interested in in the string and than direct access to that is OK.



  •  @Bulb said:

    @DescentJS said:

    The counter point to this argument is that random access into an array of variable size objects (a very common string task) is going to be much slower.   You can't count on a correct byte offset, you have to read the array from the beginning every time.

    Except the whole point is, (at least if you want to be really correct), there is no fixed-width 16-bit encoding for unicode. There can't be given the code points go up to 0x10ffff with a hole between 0xd800 and 0xdfff!

    Yeah, that's right. Those funny number. That's because how they've made UTF-16 into variable-length encoding—the 16-bit value is the code-point, except when it is between 0xd800 and 0xdbff, in which case the next must be between 0xdc00 and 0xdfff, and you take low 10 bits from the first plus low 10 bits of the second, shifted by 10 to the left and add 0x10000. Now if you call that a sane encoding, I certainly don't.

    Not to mention that unicode itself is a variable-length encoding due to the various combining stuff and even if it was, some languages are variable-length encodings too. So you have the concept of codewords (bytes or words depending on the encoding used), codepoints, characters and letters and each can be composed from more than one of the previous.

    So it's actually never correct to just do random access to a unicode string, even if it is encoded in UCS4 (4-byte codewords, so now one codepoint is just one codeword)—you always need to find the thing you are interested in in the string and than direct access to that is OK.

    That's correct for Unicode, but not for wchar_t, which is what the post I was replying to was talking about.


Log in to reply