More badly-named functions
-
What does this function do? Hint: the output isn't UTF-8.
char caUTF8[128] = {0xC4, 0xC5, 0xC7, 0xC9, 0xD1, 0xD6, 0xDC, 0xE1, 0xE0, 0xE2, 0xE4, 0xE3, 0xE5, 0xE7, 0xE9, 0xE8, //0x8# 0xEA, 0xEB, 0xED, 0xEC, 0xEE, 0xEF, 0xF1, 0xF3, 0xF2, 0xF4, 0xF6, 0xF5, 0xFA, 0xF9, 0xFB, 0xFC, //0x9# 0x3F, 0xB0, 0xA2, 0xA3, 0xA7, 0x3F, 0xB6, 0xDF, 0xAE, 0xA9, 0x3F, 0xB4, 0xA8, 0x3F, 0xC6, 0xD8, //0xA# 0x3F, 0xB1, 0x3F, 0x3F, 0xA5, 0xB5, 0x3F, 0x3F, 0x3F, 0x3F, 0x3F, 0xAA, 0xBA, 0x3F, 0xE6, 0xF8, //0xB# 0xBF, 0xA1, 0xAC, 0x3F, 0x3F, 0x3F, 0x3F, 0xAB, 0xBB, 0x3F, 0xA0, 0xC0, 0xC3, 0xD5, 0x3F, 0x3F, //0xC# 0x3F, 0x3F, 0x3F, 0x3F, 0x3F, 0x3F, 0xF7, 0x3F, 0xFF, 0x3F, 0x3F, 0xA4, 0x3F, 0x3F, 0x3F, 0x3F, //0xD# 0x3F, 0xB7, 0x3F, 0x3F, 0x3F, 0xC2, 0xCA, 0xC1, 0xCB, 0xC8, 0xCD, 0xCE, 0xCF, 0xCC, 0xD3, 0xD4, //0xE# 0x3F, 0xD2, 0xDA, 0xDB, 0xD9, 0x3F, 0x3F, 0x3F, 0xAF, 0x3F, 0x3F, 0x3F, 0xB8, 0x3F, 0x3F, 0x3F }; //0xF# static void S_TranslateToUTF8( char *text, long length ) { long ctr; for (ctr = 0; ctr < length; ++ctr) { if (text[ctr] >= 0x80) { text[ctr] = caUTF8[text[ctr] - 0x80]; } } }
-
I'm going to say that the net effect is pissing you off.
As for the side effect, I would guess (offhand) that it replaced special characters or something.
-
I gotta say, you're a bastard for posting this. After looking at an ASCII table for a few minutes and skimming the wikipedia article on UTF-8, I'm tempted to stay up late trying to figure this out...
But instead I'm going to beg you to tell me the answer.
As best as I can tell, if you interpret the output as UTF-8, the string will be completely fucked up because its translating extended ASCII characters into multi-byte Unicode code points.
If you interpret the output as ASCII, it looks like it's scrambling the extended ASCII set, except a bunch of the replacement values are question marks (0x3f)
At the moment, I can't even fathom what the intent of this was.
-
Looks like a Mac OS Roman (before OS 8.5) to ISO/IEC 8859-1 converter. Tell-tale signs:
- It converts between single-byte encodings with the same lower 128 characters (ASCII).
- No characters in the source map to 0x80-0x9f in the destination (ISO/IEC 8859-1 leaves these undefined). For example, daggers (0xa0) are converted to question marks despite being in Windows-1252. ISO/IEC 8859-1 lacks these.
- 0xdb is converted to 0xa4 ('¤'), not 0x80 ('€' in Windows-1252, missing in ISO/IEC 8859-1), in line with pre-Euro Mac OS.
textÄctrÅ = caUTF8ÄtextÄctrÅ - 0x80Å;
-
I believe the intent was to convert MacRoman into a sort of "truncated Unicode" containing only codepoints that can fit into a single byte, but as you noted, the net result is ISO 8859-1.
-
And, confirmed - interpreting the byte values as ISO-8859-1, the table ends up being:
Ä Å Ç É Ñ Ö Ü á à â ä ã å ç é è ê ë í ì î ï ñ ó ò ô ö õ ú ù û ü ? ° ¢ £ § ? ¶ ß ® © ? ´ ¨ ? Æ Ø ? ± ? ? ¥ µ ? ? ? ? ? ª º ? æ ø ¿ ¡ ¬ ? ? ? ? « » ? À Ã Õ ? ? ? ? ? ? ? ? ÷ ? ÿ ? ? ¤ ? ? ? ? ? · ? ? ? Â Ê Á Ë È Í Î Ï Ì Ó Ô ? Ò Ú Û Ù ? ? ? ¯ ? ? ? ¸ ? ? ?