More badly-named functions



  • What does this function do? Hint: the output isn't UTF-8.


    char caUTF8[128] = {0xC4, 0xC5, 0xC7, 0xC9, 0xD1, 0xD6, 0xDC, 0xE1, 0xE0, 0xE2, 0xE4, 0xE3, 0xE5, 0xE7, 0xE9, 0xE8, 	//0x8#
    		    0xEA, 0xEB, 0xED, 0xEC, 0xEE, 0xEF, 0xF1, 0xF3, 0xF2, 0xF4, 0xF6, 0xF5, 0xFA, 0xF9, 0xFB, 0xFC,	//0x9#
    		    0x3F, 0xB0, 0xA2, 0xA3, 0xA7, 0x3F, 0xB6, 0xDF, 0xAE, 0xA9, 0x3F, 0xB4, 0xA8, 0x3F, 0xC6, 0xD8, 	//0xA#
    		    0x3F, 0xB1, 0x3F, 0x3F, 0xA5, 0xB5, 0x3F, 0x3F, 0x3F, 0x3F, 0x3F, 0xAA, 0xBA, 0x3F, 0xE6, 0xF8, 	//0xB#
    		    0xBF, 0xA1, 0xAC, 0x3F, 0x3F, 0x3F, 0x3F, 0xAB, 0xBB, 0x3F, 0xA0, 0xC0, 0xC3, 0xD5, 0x3F, 0x3F, 	//0xC#
    		    0x3F, 0x3F, 0x3F, 0x3F, 0x3F, 0x3F, 0xF7, 0x3F, 0xFF, 0x3F, 0x3F, 0xA4, 0x3F, 0x3F, 0x3F, 0x3F, 	//0xD#
    		    0x3F, 0xB7, 0x3F, 0x3F, 0x3F, 0xC2, 0xCA, 0xC1, 0xCB, 0xC8, 0xCD, 0xCE, 0xCF, 0xCC, 0xD3, 0xD4, 	//0xE#
    		    0x3F, 0xD2, 0xDA, 0xDB, 0xD9, 0x3F, 0x3F, 0x3F, 0xAF, 0x3F, 0x3F, 0x3F, 0xB8, 0x3F, 0x3F, 0x3F };	//0xF#
    
    static void	S_TranslateToUTF8( char *text, long length )
    {
    	long ctr;
    	
    	for (ctr = 0; ctr < length; ++ctr)
    	{
    		if (text[ctr] >= 0x80)
    		{
    			text[ctr] = caUTF8[text[ctr] - 0x80];
    		}
    	}
    }


  • I'm going to say that the net effect is pissing you off.

     

    As for the side effect, I would guess (offhand) that it replaced special characters or something.



  • I gotta say, you're a bastard for posting this. After looking at an ASCII table for a few minutes and skimming the wikipedia article on UTF-8, I'm tempted to stay up late trying to figure this out...

    But instead I'm going to beg you to tell me the answer.

    As best as I can tell, if you interpret the output as UTF-8, the string will be completely fucked up because its translating extended ASCII characters into multi-byte Unicode code points.

    If you interpret the output as ASCII, it looks like it's scrambling the extended ASCII set, except a bunch of the replacement values are question marks (0x3f)

    At the moment, I can't even fathom what the intent of this was.



  •  Looks like a Mac OS Roman (before OS 8.5) to ISO/IEC 8859-1 converter. Tell-tale signs:

    • It converts between single-byte encodings with the same lower 128 characters (ASCII).
    • No characters in the source map to 0x80-0x9f in the destination (ISO/IEC 8859-1 leaves these undefined). For example, daggers (0xa0) are converted to question marks despite being in Windows-1252. ISO/IEC 8859-1 lacks these.
    • 0xdb is converted to 0xa4 ('¤'), not 0x80 ('€' in Windows-1252, missing in ISO/IEC 8859-1), in line with pre-Euro Mac OS.
    I really can't be bothered to check the entire table, but it does convert the stuff I'm used to seeing break (Scandinavian characters and the Euro sign). Working with plain text files in Scandinavian languages leaves you with a certain awareness of character encodings. At least it's not ISO-646-SF, which would have shown the code as something like:
    textÄctrÅ = caUTF8ÄtextÄctrÅ - 0x80Å;


  • I believe the intent was to convert MacRoman into a sort of "truncated Unicode" containing only codepoints that can fit into a single byte, but as you noted, the net result is ISO 8859-1.



  • And, confirmed - interpreting the byte values as ISO-8859-1, the table ends up being:

    ÄÅÇÉÑÖÜáàâäãåçéè
    êëíìîïñóòôöõúùûü
    ?°¢£§?ß®©?´¨?ÆØ
    ?±??¥µ?????ªº?æø
    ¿¡¬????«»? ÀÃÕ??
    ??????÷?ÿ??¤????
    ?·???ÂÊÁËÈÍÎÏÌÓÔ
    ?ÒÚÛÙ???¯???¸???


Log in to reply