Is Skeet wrong in the below presentation?



  • Check my fiddle.

    Continuing the discussion from Unicode (of course):

    @Nagesh said:

    Summary:

    Numbers are bad.
    Text is bad.
    DateTime is bad.

    https://www.youtube.com/watch?v=l3nPJ-yK-LU



  • ABsolutely not. All of those things are totally fully of WTFery.

    By the way, http://utf8everywhere.org/



  • Did you even click on my .NET fiddle link? Can you prove that code place an accent on the r letter instead of the e letter when I reverse it? Am I using wrong character?



  • There are two ways to represent an “é” in Unicode; either with the “é” character proper, or with an “e“ character followed with a combining acute accent character ( ́).
    If you use the former representation, the accent will stay on the “e” when the string is reversed. But if you use the latter, the character sequence “e ́r” becomes “r ́e”, so the accent ends up on the ”r”.
    It is possible to normalize the string so every e ́ sequences becomes é, every a` sequence becomes à, etc.
    Anyway, why are you complaining that reversing a string of text gives stranges results? Reversed text usually does not make any sense...



  • $ echo "é" | hd -c
    00000000 c3 a9 0a |...|
    0000000 � � \n
    0000003
    $ echo "é" | hd -c
    00000000 65 cc 81 0a |e...|
    0000000 e � 201 \n
    0000004

    Your fiddle uses the top one, but the presentation uses the bottom one.



  • Thanks for the simple and effective explanation. I also have another gripe with Skeet's presentation. He starts off with a double d = 3.0;

    Then asking audience to guess what is the value of d. Here's my experience on the same.

    I input 3 and I get a 3 back.



  • @Nagesh said:

    Check my fiddle.

    Is there supposed to be something wrong with that fiddle? The accent is on the e in the reversed string for me.



  • @ufmace said:

    Is there supposed to be something wrong with that fiddle? The accent is on the e in the reversed string for me.

    Go through Skeet presentation. in his example, the accent goes on the "r" when reversed. @VinDuv has given adequate explanation for that.



  • @Nagesh said:

    Did you even click on my .NET fiddle link? Can you prove that code place an accent on the r letter instead of the e letter when I reverse it? Am I using wrong character?

    That .NET Fiddle reverses it correctly. It prints out "selbarésiM seL".

    The .NET framework uses UTF-16 encoding 1. If I remember correctly, it converts to UTF-8 when rendering an ASP.NET page. Anyway,the thing is 'é' is a unique character, and Skeet used "e\u0301" in his example. "é" != "e\u0301", which is why you don't get the result that skeet showed.



  • @Nagesh said:

    Check my fiddle. https://dotnetfiddle.net/rRtUXi

    This version totally does as the presentation claims. Prints

    selbaŕesiM seL
    

    Yes, I did change the string. No, vgrep won't detect that difference.



  • @Bulb said:

    This version totally does as the presentation claims. Prints

    selbaŕesiM seL
    

    Yes, I did change the string. No, vgrep won't detect that difference.

    Now this is definitely witchcraft.



  • @Nagesh said:

    Now this is definitely witchcraft.

    The presentation really doesn't do it justice. Unicode is a HUGE pile of WTF partly because some writing systems are really insane (most complicated (for computer; for handwriting it's conveniently compact) is probably Korean where letters forming a syllable are composed together to form a box), partly because it contains heaps of backward compatibility kludges to allow recoding from older systems without loss of information (no matter how useless that information is) and not the least because it already managed to accumulate a lot of legacy kludges of it's own.

    Basically the whole UTF-16 thing is a gigantic pile of fail. Initially it was thought that 16 bits would be enough and so 16-bit encoding was not enough. It required the "Han" unification, which caused a lot of complaints from people claiming that Kanji is totally not the same as Hanzi, not to mention Hanja, but they managed to squeeze scripts of all living languages in.

    Of course reality does not give up so easily, so promptly requests starting to come in for various hieroglyphs and other ancient scripts and Unicode gave way and extended beyond 16 bit range.

    But by that time The Operating System That Shall Not Be Named (and some others, possibly) changed all it's interfaces to use 16-bit characters. So a hole in the encoding was found to be used for two-unit codes to allow total of 1113087 in discontinuous range 0-55295 and 57344-1114111.

    Meanwhile some smart people that didn't want to duplicate all their APIs came with this nicely uniform and backward compatible UTF-8 encoding. And not having hardcoded the sizeof(wchar_t) all over their API started using 32-bit wchar_t for ease of operation. But The Operating System That Shall Not Be Named for whatever reason chose to ignore this saner approach.

    So that, my dear children, is how we replaced a legacy mess with a legacy mess.



  • Of course the presentation does not do justice to the other two topics either.

    Speaking as a guy, who does not have to deal with relativity, but whose day job involves technology that definitely does. And whose application does horrible job of timezones due to lack of round tuits.


  • Considered Harmful

    @Bulb said:

    But The Operating System That Shall Not Be Named for whatever reason chose to ignore this saner approach.

    {{POV}}



  • @Bulb said:

    This version totally does as the presentation claims. Prints

    selbaŕesiM seL
    

    Yes, I did change the string. No, vgrep won't detect that difference.

    Ok. how did you get e\u0301 to input from the keyboard. I am looking at the charmap in windows 7.
    Do I need to use Linux to get that character?



  • @Nagesh said:

    Ok. how did you get e\u0301 to input from the keyboard. I am looking at the charmap in windows 7. Do I need to use Linux to get that character?

    I keyed the escape sequence to a script, had it printed and copy&pasted ;-).

    I also often use the great unicode python script. Just be warned that while in Linux prints the properties of 💩 just fine, it only handles basic plane on Windows, because CPython uses wchar_t the way it was intended and can't treat it as utf-16.

    @error said:

    {{POV}}

    Utf-8 works with most old ascii and iso-8859-x-based code just fine. It is a variable length encoding. Utf-16 requires significant changes to everything and is also variable length encoding now. Utf-8 is saner. Period.


  • Discourse touched me in a no-no place

    @Bulb said:

    Utf-16 requires significant changes to everything and is also variable length encoding now. Utf-8 is saner.

    But not sane, because the underlying Unicode system is also not sane.

    The biggest problem with UTF-8 itself is that it is variable-width, forcing the use of more complex data structures to get high performance general string operations. I wouldn't change UTF-8, as it is a decent compromise on ever so many fronts, but I'm also not going to claim that it is without consequence. (Anyone claiming that general indexing into a string by character count is unnecessary is Plain Wrong; your code might not need it, but other people do need it.)



  • @dkf said:

    The biggest problem with UTF-8 itself is that it is variable-width

    The only way to avoid this with Unicode is to use UCS-4/UTF-32 (and even then not completely - there are combining characters). UTF-16 is variable-width.


  • Discourse touched me in a no-no place

    I know. It's just a pain. Much better than the Bad Old Days though (and UTF-16 has much more fundamental problems than just its variable width).

    Most people are told to never write their own string handling library; it's a good thing to tell people because it's hard, but someone has to maintain the string libraries that exist and I'm one of those people. Sometimes I get grouchy about it, but I get double grouchy with people who think that they can just change the definition of wchar_t and have everything be peachy. (ABI stability is good for users, damnit!)


  • ♿ (Parody)

    @dkf said:

    UTF-8

    Filed Under: Alphabet privilege


  • Discourse touched me in a no-no place

    Yeah, I like discriminate against those who use the egyptian hieroglyphs and sumerian cuneiform localisation settings for their stock tracker apps.


    Filed under: just think how many ticker symbols you could have!


  • ♿ (Parody)

    Piker. I discriminate against even the funny business Europeans make with the squigglies and dots. It's like ASCII never left around here.



  • @boomzilla said:

    It's like ASCII never left around here.

    I miss the times when a simple mode con cp select=852 solved everything.


    Filed under: for certain definitions of "solved"



  • @Maciejasjmj said:

    I miss the times when a simple mode con cp select=852 solved everything.

    I support one client that still uses accounting software written in Clipper ... which doesn't work with CP852, but the older YUASCII "standard" (national characters mapped to {[]}|@`). I had to help them get this to work in DOSBox when they got computers with 64-bit Windows...



  • @ender said:

    (national characters mapped to {[]}|@`).

    *shudder*

    I just thought about how my whole codebase would look like after one meeting with this standard.


    Filed under: Y U NO ASCII



  • @Maciejasjmj said:

    I just thought about how my whole codebase would look like after one meeting with this standard.

    My first computer (actually, it was my father's computer) had a switch in the back to switch between šŠčČžŽćĆđĐ and {[~^`@}]|. Some time later (when my father bought a VGA card), there was a bootup menu to choose between the old standard and CP852.

    As for code, the company I work for also still supports some software that was originally written in Clipper (but was ported to Windows), and there are still translation routines in there (even though AFAIK, the program has been changed to use 852 internally around the time Windows 95 was released).

    Character encodings are hell. I remember seeing URLs printed as http://some.site/čuser/, I regularly get my last name mangled on mail...


  • Discourse touched me in a no-no place

    @ender said:

    national characters mapped to {[]}|@`

    I remember thinking that Unix shell scripts started with:

    £!/bin/sh
    

    Hey, the £ is a pound symbol, right? Right?



  • @ender said:

    a switch in the back to switch between šŠčČžŽćĆđĐ and {[~^`@}]|.

    Best office prank idea ever.

    @ender said:

    Character encodings are hell.

    True to that. I actually used to know by heart how 852 maps to 437 on Polish characters, being able to read things like:

    za╛óêå g⌐ÿlÑ ja½Σ

    with no particular problem...



  • @dkf said:

    I remember thinking that Unix shell scripts started with:

    £!/bin/sh

    Hey, the £ is a pound symbol, right? Right?

    No, lbs is the pound symbol.


  • BINNED

    @ender said:

    My first computer (actually, it was my father's computer) had a switch in the back to switch between šŠčČžŽćĆđĐ and {[~^`@}]|.

    Back in elementary school we used to work in Logo and I always wondered why the hell do we need to start and end loops with š and đ. But not Š and Đ, nope, lowercase, when all other commands were capitalized. Confused the hell out of me.



  • @dkf said:

    £!/bin/sh

    I still somatimes think of \ as Đ, because that was the directory separator the first few years I used the computer (and DOS prompt was C:Đ>).
    @Maciejasjmj said:
    Best office prank idea ever.

    The switch was quite common on Hercules cards, and I remember that my father had to take the computer to somebody that modified the ROM to support this (first few weeks the computer didn't have national characters).

    @Onyx said:

    Back in elementary school we used to work in Logo and I always wondered why the hell do we need to start and end loops with š and đ. But not Š and Đ, nope, lowercase, when all other commands were capitalized. Confused the hell out of me.

    Don't you mean š and ć? đ was |.


  • BINNED

    @ender said:

    Don't you mean š and ć? đ was |.

    Quite possible. It was a long time ago and I was going half by memory, half by looking at my keyboard now. Now that you mentioned it, I think you're right, since C:Đ for DOS prompt does ring a bell.

    I didn't have my own computer at the time, and even if I did it was Windows era already anyway, we just had obsolete technology at school. I didn't exactly spend days looking at it like I would if it was my own computer.

    So yeah, vague memory, nearly completely forgot about it until I saw your post about the switch.



  • @Onyx said:

    I always wondered why the hell do we need to start and end loops with š and đ. But not Š and Đ, nope
    @ender said:
    Don't you mean š and ć? đ was |.

    Holy hell. I think I'm finally beginning to understand why offshore development has a quality problem. It was over before it even started.



  • @dkf said:

    Anyone claiming that general indexing into a string by character count is unnecessary is Plain Wrong; your code might not need it, but other people do need it.

    It does not work with utf-16 either though and while it is supposed to work witch wchar_t, it does not on systems that are now stuck with 16-bit wchar_t due to overeager adoption.

    The other thing, of course, is that it does not work in unicode in general. With all that composing characters stuff, you just can't consider the code units in isolation.

    @dkf said:

    people who think that they can just change the definition of wchar_t and have everything be peachy.

    The fact is that wchar_t is 32-bit on some systems and 16-bit on others. The specification (like for any other fundamental C/C++ type) does not say either way. The specification does say that all code points that the system is able to handle have to fit in wchar_t and the standard library has some definitions that work on per-codeword basis. So on systems with 16-bit wchar_t the standard library will have trouble with extended planes.

    Well, as a result, can anybody tell me what is a clinically sane string library for C++? The standard library only gets one so far (in C++11 they added the explicit char16_t and char32_t, so it's a bit better there, but our project is unfortunately stuck with some ancient compilers) and while ICU is quite comprehensive, it's C++ interface is butt-ugly.


  • Discourse touched me in a no-no place

    @Bulb said:

    The fact is that wchar_t is 32-bit on some systems and 16-bit on others.

    And the fact is also that C++ just isn't a language designed to support portable ABIs, let alone doing so while also supporting Unicode characters. (Nor is C, but it's a heck of a lot easier to actually do so anyway there; you do have to avoid a lot of fancy types though.)

    But then the majority of users of both of those languages still really think that characters are bytes.



  • @dkf said:

    And the fact is also that C++ just isn't a language designed to support portable ABIs

    … and despite this C++ is the only language in which it is actually possible to write a portable program if you are so demanding to want mobile platforms. Microsoft doesn't do Java, others don't do C#. Yes, it is pain to maintain the portability layer, but at least it is possible.



  • @Bulb Have you seen Scott Hanselman video on C#?

    http://youtu.be/g_jiMxkLK7s


  • Discourse touched me in a no-no place

    Why would I want to target the MS mobile platform? Because I need to expand my potential number of users by 5? (C# doesn't run on 8-bit systems either.)



  • @dkf said:

    Why would I want to target the MS mobile platform? Because I need to expand my potential number of users by 5? (C# doesn't run on 8-bit systems either.)

    Because most of the single-purpose boxes that are sold with some preinstalled software are still based on Windows CE. Sometimes even 6.0, often 5.0 and it's not that long since I've seen 4.2. And we still have customers who want those things. We actually started with that platform some years ago.



  • @dkf said:

    Hey, the £ is a pound symbol, right? Right?

    Depends. Did you mean £, £, or maybe even 💷? ben_lubar thinks it's ℔, but that's ambiguous, what with averdupois, troy, tower, london, jersey and friends, and it's probably expressed better as "{count} ℥ ago".



  • @tufty said:

    as "{count} ℥ ago".

    https://meta.discourse.org/t/post-time-should-say-1-hour-ago-not-1-hour/7175

    106 posts and 108 likes over 6 days. Ago

    ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago


    Filed under: ago 3 seconds ago


  • Banned


Log in to reply