Not that internationalization matters



  • One of the things Microsoft likes to accolaid themselves for is correct handling of internationalization.  All of their development tools are oriented at making i18n reasonably straight forward, some can't even handle non-unicode strings, others let you build for particular locales.

    With that line of thought, you might make the mistake of thinking that MS's distributed applications can handle Unicode correctly (hey, didn't Unicode start out as MS's proprietary version of ISO-10646?).  Sadly, that's not quite true.

     

    While dealing with text files from 'far-eastern languages', I noticed first that I needed the language pack installed.  No WTF yet, the package is 200 megs (not counting Thai, which is a 'complex left-to-right' language), it's reasonable that it's not installed by default on my 120 gig hard drive.  So I install my far-eastern language package, so I can view my japanese text.

     

    Now, the assumption here is partly my fault -- thinking that wordpad is the more capable of the two basic text editors, I go to open my japanese files.  I notice two things:

     

    1) My japanese isn't japanese, and I don't know what the fuck it is.

    2) My posix newlines aren't newlines, in a program that properly handles posix newlines.  Apparently 00 0A doesn't get groked as "newline" in unicode.

     

    Starting to wonder WTF, I checked it out in notepad.  Now my japanese is japanese, but I still have no newlines (that wasn't surprising, notepad can't get posix newlines right in ansi/iso format).  Easy to diagnose the remaining problem, the file is big-endian.

     

    Slightly annoyed, but not completely let down, I checked the unicode handling in my development environments.  VC6 doesn't even grok unicode -- the files are considered binary.  VC8 was a bit better, at least showing my japanese in japanese (still, no line feeds... but my hopes weren't that high).

     

    Step back for a second, and ask what the WTF is here.  Obviously it's not that wordpad can't handle UTF-16BE properly, that's just asking too much.  It's not that VC6 can't handle unicode at all (it can't really handle anything else properly either).  It's not that VC8 can't handle posix-linefeeds, proprietary requirments for a mandatory carriage-return that don't conform to the rest of the world are par for the course up in Redmond.  And it's only slightly that notepad, the most light-weight of the group, handles it the best.

     

    In fact, it sort of makes sense that notepad works the best, it's just an edit window slapped on top of a sizeable frame.  It's kind of heartening to know that the base controls in win32 handle both UTF-16BE and -16LE properly (or is it just some grommet thought it was a Good Idea to actually respect the BOM before calling SetText?).  Though, it's rather surprising that such a dinky 'tool' can [i]save[/i] in either format (and UTF-8!).

     

    So I was slightly disheartened when I got a support call from somebody reporting that all of their UTF-16BE documents had become unusable.  Bar-none.  Popping open the handy hex-editor, it's easy to see that all of the BOM's had been reversed.

     

    Their trainee, not knowing better, had opened the documents in wordpad (obviously the first logical thing to do with a unicode document), made a list of 'correct' corrections to an incorrectly displayed file, then correctly saved the file as unicode text.  It's not enough that wordpad ignores the BOM when reading a file, and treates it as always in the native byte-order.  No, we have to compound that with the incredible ability of a tool that also ignores the BOM when writing the file.  Wordpad's definition of changing UTF-16BE to UTF-16LE?  Change the BOM, then all will be ok.

     

    Somewhere between user ignorance and program inability, you need to distinguish between being inable to cope properly, and just doing the Wrong Thing -- in this case the difference amounted to about 100 man hours, as the client had to revert all of the BOMs, and replace the changes to however many files they had corrupted.



  • @Corwinoid said:

    ...With that line of thought, you might make
    the mistake of thinking that MS's distributed applications can handle
    Unicode correctly

    [...]

    Somewhere between user ignorance and program inability, you need to
    distinguish between being inable to cope properly, and just doing the
    Wrong Thing -- in this case the difference amounted to about 100
    man hours, as the client had to revert all of the BOMs, and replace the
    changes to however many files they had corrupted.
    <font size="5">Y</font>ou
    should not be using notepad or wordpad to edit Unicode files. 
    They are "toy" applets whuch were not designed for misssion-critical
    editing.  I doubt that either has been updated since Windows
    98.  Microsoft wants you to shell out the big-bucks for Office; so
    I recommend that you download OpenOffice, for free, and give it a spin.




  • @triso said:

    <font size="5">Y</font>ou
    should not be using notepad or wordpad to edit Unicode files. 
    They are "toy" applets whuch were not designed for misssion-critical
    editing.  I doubt that either has been updated since Windows
    98.  Microsoft wants you to shell out the big-bucks for Office; so
    I recommend that you download OpenOffice, for free, and give it a spin.


    Actually, they have been improved somewhat - Windows 98 Notepad doesn't support Unicode at all (whereas newer versions have OKish support). There are various other fixes as well - IIRC, Ctrl-S in Notepad now actually saves the document (about time!)



  • and Win98 Notepad had a 64KB filesize limit which has been upped to something like 2GB with the result that the unwary trying to load a logfile which grows faster than Windows can read it is going to have to wait a LONG time before it fails ;)



  • Even if you had spelled "accolade" correctly, it still wouldn't be a verb.



  • @makomk said:

    @triso said:
    <FONT size=5>Y</FONT>ou should not be using notepad or wordpad to edit Unicode files.  They are "toy" applets whuch were not designed for misssion-critical editing.  I doubt that either has been updated since Windows 98.  Microsoft wants you to shell out the big-bucks for Office; so I recommend that you download OpenOffice, for free, and give it a spin.

    Actually, they have been improved somewhat - Windows 98 Notepad doesn't support Unicode at all (whereas newer versions have OKish support). There are various other fixes as well - IIRC, Ctrl-S in Notepad now actually saves the document (about time!)

    Actually, it hasn't improved.  Notepad in Windows NT 3.1 (circa 1992) supported Unicode and 2GB file sizes.  This version was used in the whole NT product line -- NT 3.1, NT3.5, NT 3.51, NT4, Windows 2000, Windows XP, and Windows 2003.  Windows 95, 98, and ME used Windows 3.1's 16 bit version of notepad, with little or no enhancements.



  • @Corwinoid said:

    Now, the assumption here is partly my fault -- thinking that wordpad is the more capable of the two basic text editors, I go to open my japanese files.  I notice two things:

     

    1) My japanese isn't japanese, and I don't know what the fuck it is.

    2) My posix newlines aren't newlines, in a program that properly handles posix newlines.  Apparently 00 0A doesn't get groked as "newline" in unicode.

     

    Starting to wonder WTF, I checked it out in notepad.



    OK, but how did you get either Notepad or Wordpad to know which encoding you were using?  Unicode is not a complete character description by itself - you need the encoding too.  This is usually specified out-of-band in MIME types.  There are hacked-up unreliable ways to guess the encoding, but I'm not at all surprised that either of those two applications can't do that very well.  You should not be expecting too much of ASCII applications that were extended to support Unicode without any real attempt to allow a proper encoding specification, never mind a still non-existent endianness specification.

    I'd say the real WTF here is expecting Microsoft to actually get anything other than U.S. ASCII correct, no matter what they say about themselves.



  • @stevekj said:


    OK, but how did you get either Notepad or Wordpad to know which encoding you were using?  Unicode is not a complete character description by itself - you need the encoding too.



    OK, that was a little too hasty - if the text files did contain BOMs in the first place, they are supposed to indicate the UTF-16 encoding and also of course the byte order.  So that's supposed to be a complete and accurate representation of Unicode... so you're right, WP and NP are definitely WTFs for ignoring (or mishandling) the BOMs, despite knowing at least something about Unicode internally.

    Side WTF: BOMs??



  • @jsmith said:

    @makomk said:

    @triso said:
    <font size="5">Y</font>ou should not be using notepad or wordpad to edit Unicode files.  They are "toy" applets whuch were not designed for misssion-critical editing.  I doubt that either has been updated since Windows 98.  Microsoft wants you to shell out the big-bucks for Office; so I recommend that you download OpenOffice, for free, and give it a spin.

    Actually, they have been improved somewhat - Windows 98 Notepad doesn't support Unicode at all (whereas newer versions have OKish support). There are various other fixes as well - IIRC, Ctrl-S in Notepad now actually saves the document (about time!)

    Actually, it hasn't improved.  Notepad in Windows NT 3.1 (circa 1992) supported Unicode and 2GB file sizes.  This version was used in the whole NT product line -- NT 3.1, NT3.5, NT 3.51, NT4, Windows 2000, Windows XP, and Windows 2003.  Windows 95, 98, and ME used Windows 3.1's 16 bit version of notepad, with little or no enhancements.


    Actually, it has improved - though IMO the improvements are what jsmith regards as fixes.  In Windows 2000, open file dialog would have the options "File name"and "File Type", in XP the additional item "Encoding" was added - very helpful when BOM is not supplied.  The option for selection of encoding when saving was in both. The other diff between 2000/XP that is immediattely noticeable is the addition (optional) of a status bar for ln,col info.  Past improvements I liked were Ctrl+A (the option for select all did exist but didn't have a short cut, if I recall) and Ctrl+G (goto line).
    I wouldn't mind the next version to have Ctrl+arrow support for word boundaries. Just using white space boundaries annoys me, but it may be asking too much.
    BOM goodiness: http://www.unicode.org/faq/utf_bom.html#BOM



  • @triso said:

    <font size="5">Y</font>ou
    should not be using notepad or wordpad to edit Unicode files. 
    They are "toy" applets whuch were not designed for misssion-critical
    editing.  I doubt that either has been updated since Windows
    98.  Microsoft wants you to shell out the big-bucks for Office; so
    I recommend that you download OpenOffice, for free, and give it a spin.

    "I'm upset: my screwdriver set looks like it can handle torx screws, but it stripped them all."

    "You shouldn't be using a toy like a handheld screwdriver for that. Get a 24-volt cordless drill/driver."

    WTF? A text editor should edit files without mangling them, or it should complain that it can't handle the file's encoding. I shouldn't have to use a heavyweight office suite for editing plain text. Text editing is not word processing, in any language.



  • that's why VI was invented...


Log in to reply