Is Skeet wrong in the below presentation?
-
Check my fiddle.
Continuing the discussion from Unicode (of course):
Summary:
Numbers are bad.
Text is bad.
DateTime is bad.
-
ABsolutely not. All of those things are totally fully of WTFery.
By the way, http://utf8everywhere.org/
-
Did you even click on my .NET fiddle link? Can you prove that code place an accent on the r letter instead of the e letter when I reverse it? Am I using wrong character?
-
There are two ways to represent an “é” in Unicode; either with the “é” character proper, or with an “e“ character followed with a combining acute accent character ( ́).
If you use the former representation, the accent will stay on the “e” when the string is reversed. But if you use the latter, the character sequence “e ́r” becomes “r ́e”, so the accent ends up on the ”r”.
It is possible to normalize the string so every e ́ sequences becomes é, every a` sequence becomes à, etc.
Anyway, why are you complaining that reversing a string of text gives stranges results? Reversed text usually does not make any sense...
-
$ echo "é" | hd -c
00000000 c3 a9 0a |...|
0000000 � � \n
0000003
$ echo "é" | hd -c
00000000 65 cc 81 0a |e...|
0000000 e � 201 \n
0000004Your fiddle uses the top one, but the presentation uses the bottom one.
-
Thanks for the simple and effective explanation. I also have another gripe with Skeet's presentation. He starts off with a double d = 3.0;
Then asking audience to guess what is the value of d. Here's my experience on the same.
I input 3 and I get a 3 back.
-
Check my fiddle.
Is there supposed to be something wrong with that fiddle? The accent is on the e in the reversed string for me.
-
Is there supposed to be something wrong with that fiddle? The accent is on the e in the reversed string for me.
Go through Skeet presentation. in his example, the accent goes on the "r" when reversed. @VinDuv has given adequate explanation for that.
-
Did you even click on my .NET fiddle link? Can you prove that code place an accent on the r letter instead of the e letter when I reverse it? Am I using wrong character?
That .NET Fiddle reverses it correctly. It prints out "selbarésiM seL".
The .NET framework uses UTF-16 encoding 1. If I remember correctly, it converts to UTF-8 when rendering an ASP.NET page. Anyway,the thing is 'é' is a unique character, and Skeet used "e\u0301" in his example. "é" != "e\u0301", which is why you don't get the result that skeet showed.
-
Check my fiddle. https://dotnetfiddle.net/rRtUXi
This version totally does as the presentation claims. Prints
selbaŕesiM seL
Yes, I did change the string. No,
vgrep
won't detect that difference.
-
This version totally does as the presentation claims. Prints
selbaŕesiM seL
Yes, I did change the string. No,
vgrep
won't detect that difference.Now this is definitely witchcraft.
-
Now this is definitely witchcraft.
The presentation really doesn't do it justice. Unicode is a HUGE pile of WTF partly because some writing systems are really insane (most complicated (for computer; for handwriting it's conveniently compact) is probably Korean where letters forming a syllable are composed together to form a box), partly because it contains heaps of backward compatibility kludges to allow recoding from older systems without loss of information (no matter how useless that information is) and not the least because it already managed to accumulate a lot of legacy kludges of it's own.
Basically the whole UTF-16 thing is a gigantic pile of fail. Initially it was thought that 16 bits would be enough and so 16-bit encoding was not enough. It required the "Han" unification, which caused a lot of complaints from people claiming that Kanji is totally not the same as Hanzi, not to mention Hanja, but they managed to squeeze scripts of all living languages in.
Of course reality does not give up so easily, so promptly requests starting to come in for various hieroglyphs and other ancient scripts and Unicode gave way and extended beyond 16 bit range.
But by that time The Operating System That Shall Not Be Named (and some others, possibly) changed all it's interfaces to use 16-bit characters. So a hole in the encoding was found to be used for two-unit codes to allow total of 1113087 in discontinuous range 0-55295 and 57344-1114111.
Meanwhile some smart people that didn't want to duplicate all their APIs came with this nicely uniform and backward compatible UTF-8 encoding. And not having hardcoded the
sizeof(wchar_t)
all over their API started using 32-bitwchar_t
for ease of operation. But The Operating System That Shall Not Be Named for whatever reason chose to ignore this saner approach.So that, my dear children, is how we replaced a legacy mess with a legacy mess.
-
Of course the presentation does not do justice to the other two topics either.
Speaking as a guy, who does not have to deal with relativity, but whose day job involves technology that definitely does. And whose application does horrible job of timezones due to lack of round tuits.
-
But The Operating System That Shall Not Be Named for whatever reason chose to ignore this saner approach.
-
This version totally does as the presentation claims. Prints
selbaŕesiM seL
Yes, I did change the string. No,
vgrep
won't detect that difference.Ok. how did you get
e\u0301
to input from the keyboard. I am looking at the charmap in windows 7.
Do I need to use Linux to get that character?
-
Ok. how did you get e\u0301 to input from the keyboard. I am looking at the charmap in windows 7. Do I need to use Linux to get that character?
I keyed the escape sequence to a script, had it printed and copy&pasted ;-).
I also often use the great unicode python script. Just be warned that while in Linux prints the properties of 💩 just fine, it only handles basic plane on Windows, because CPython uses
wchar_t
the way it was intended and can't treat it as utf-16.{{POV}}
Utf-8 works with most old ascii and iso-8859-x-based code just fine. It is a variable length encoding. Utf-16 requires significant changes to everything and is also variable length encoding now. Utf-8 is saner. Period.
-
Utf-16 requires significant changes to everything and is also variable length encoding now. Utf-8 is saner.
But not sane, because the underlying Unicode system is also not sane.
The biggest problem with UTF-8 itself is that it is variable-width, forcing the use of more complex data structures to get high performance general string operations. I wouldn't change UTF-8, as it is a decent compromise on ever so many fronts, but I'm also not going to claim that it is without consequence. (Anyone claiming that general indexing into a string by character count is unnecessary is Plain Wrong; your code might not need it, but other people do need it.)
-
The biggest problem with UTF-8 itself is that it is variable-width
The only way to avoid this with Unicode is to use UCS-4/UTF-32 (and even then not completely - there are combining characters). UTF-16 is variable-width.
-
I know. It's just a pain. Much better than the Bad Old Days though (and UTF-16 has much more fundamental problems than just its variable width).
Most people are told to never write their own string handling library; it's a good thing to tell people because it's hard, but someone has to maintain the string libraries that exist and I'm one of those people. Sometimes I get grouchy about it, but I get double grouchy with people who think that they can just change the definition of
wchar_t
and have everything be peachy. (ABI stability is good for users, damnit!)
-
-
Yeah, I like discriminate against those who use the egyptian hieroglyphs and sumerian cuneiform localisation settings for their stock tracker apps.
Filed under: just think how many ticker symbols you could have!
-
Piker. I discriminate against even the funny business Europeans make with the squigglies and dots. It's like ASCII never left around here.
-
It's like ASCII never left around here.
I miss the times when a simple
mode con cp select=852
solved everything.
Filed under: for certain definitions of "solved"
-
I miss the times when a simple mode con cp select=852 solved everything.
I support one client that still uses accounting software written in Clipper ... which doesn't work with CP852, but the older YUASCII "standard" (national characters mapped to {[]}|@`). I had to help them get this to work in DOSBox when they got computers with 64-bit Windows...
-
(national characters mapped to {[]}|@`).
*shudder*
I just thought about how my whole codebase would look like after one meeting with this standard.
Filed under: Y U NO ASCII
-
I just thought about how my whole codebase would look like after one meeting with this standard.
My first computer (actually, it was my father's computer) had a switch in the back to switch between šŠčČžŽćĆđĐ and {[~^`@}]|. Some time later (when my father bought a VGA card), there was a bootup menu to choose between the old standard and CP852.
As for code, the company I work for also still supports some software that was originally written in Clipper (but was ported to Windows), and there are still translation routines in there (even though AFAIK, the program has been changed to use 852 internally around the time Windows 95 was released).
Character encodings are hell. I remember seeing URLs printed as http://some.site/čuser/, I regularly get my last name mangled on mail...
-
national characters mapped to {[]}|@`
I remember thinking that Unix shell scripts started with:
£!/bin/sh
Hey, the
£
is a pound symbol, right? Right?
-
a switch in the back to switch between šŠčČžŽćĆđĐ and {[~^`@}]|.
Best office prank idea ever.
Character encodings are hell.
True to that. I actually used to know by heart how 852 maps to 437 on Polish characters, being able to read things like:
za╛óêå g⌐ÿlÑ ja½Σ
with no particular problem...
-
I remember thinking that Unix shell scripts started with:
£!/bin/sh
Hey, the
£
is a pound symbol, right? Right?No,
lbs
is the pound symbol.
-
My first computer (actually, it was my father's computer) had a switch in the back to switch between šŠčČžŽćĆđĐ and {[~^`@}]|.
Back in elementary school we used to work in Logo and I always wondered why the hell do we need to start and end loops with š and đ. But not Š and Đ, nope, lowercase, when all other commands were capitalized. Confused the hell out of me.
-
£!/bin/sh
I still somatimes think of \ as Đ, because that was the directory separator the first few years I used the computer (and DOS prompt wasC:Đ>
).
@Maciejasjmj said:Best office prank idea ever.
The switch was quite common on Hercules cards, and I remember that my father had to take the computer to somebody that modified the ROM to support this (first few weeks the computer didn't have national characters).Back in elementary school we used to work in Logo and I always wondered why the hell do we need to start and end loops with š and đ. But not Š and Đ, nope, lowercase, when all other commands were capitalized. Confused the hell out of me.
Don't you mean š and ć? đ was |.
-
Don't you mean š and ć? đ was |.
Quite possible. It was a long time ago and I was going half by memory, half by looking at my keyboard now. Now that you mentioned it, I think you're right, since C:Đ for DOS prompt does ring a bell.
I didn't have my own computer at the time, and even if I did it was Windows era already anyway, we just had obsolete technology at school. I didn't exactly spend days looking at it like I would if it was my own computer.
So yeah, vague memory, nearly completely forgot about it until I saw your post about the switch.
-
I always wondered why the hell do we need to start and end loops with š and đ. But not Š and Đ, nope
@ender said:Don't you mean š and ć? đ was |.
Holy hell. I think I'm finally beginning to understand why offshore development has a quality problem. It was over before it even started.
-
Anyone claiming that general indexing into a string by character count is unnecessary is Plain Wrong; your code might not need it, but other people do need it.
It does not work with utf-16 either though and while it is supposed to work witch
wchar_t
, it does not on systems that are now stuck with 16-bitwchar_t
due to overeager adoption.The other thing, of course, is that it does not work in unicode in general. With all that composing characters stuff, you just can't consider the code units in isolation.
people who think that they can just change the definition of wchar_t and have everything be peachy.
The fact is that
wchar_t
is 32-bit on some systems and 16-bit on others. The specification (like for any other fundamental C/C++ type) does not say either way. The specification does say that all code points that the system is able to handle have to fit inwchar_t
and the standard library has some definitions that work on per-codeword basis. So on systems with 16-bitwchar_t
the standard library will have trouble with extended planes.Well, as a result, can anybody tell me what is a clinically sane string library for C++? The standard library only gets one so far (in C++11 they added the explicit
char16_t
andchar32_t
, so it's a bit better there, but our project is unfortunately stuck with some ancient compilers) and while ICU is quite comprehensive, it's C++ interface is butt-ugly.
-
The fact is that
wchar_t
is 32-bit on some systems and 16-bit on others.And the fact is also that C++ just isn't a language designed to support portable ABIs, let alone doing so while also supporting Unicode characters. (Nor is C, but it's a heck of a lot easier to actually do so anyway there; you do have to avoid a lot of fancy types though.)
But then the majority of users of both of those languages still really think that characters are bytes.
-
And the fact is also that C++ just isn't a language designed to support portable ABIs
… and despite this C++ is the only language in which it is actually possible to write a portable program if you are so demanding to want mobile platforms. Microsoft doesn't do Java, others don't do C#. Yes, it is pain to maintain the portability layer, but at least it is possible.
-
@Bulb Have you seen Scott Hanselman video on C#?
-
Why would I want to target the MS mobile platform? Because I need to expand my potential number of users by 5? (C# doesn't run on 8-bit systems either.)
-
Why would I want to target the MS mobile platform? Because I need to expand my potential number of users by 5? (C# doesn't run on 8-bit systems either.)
Because most of the single-purpose boxes that are sold with some preinstalled software are still based on Windows CE. Sometimes even 6.0, often 5.0 and it's not that long since I've seen 4.2. And we still have customers who want those things. We actually started with that platform some years ago.
-
Hey, the £ is a pound symbol, right? Right?
Depends. Did you mean £, £, or maybe even 💷? ben_lubar thinks it's ℔, but that's ambiguous, what with averdupois, troy, tower, london, jersey and friends, and it's probably expressed better as "{count} ℥ ago".
-
as "{count} ℥ ago".
https://meta.discourse.org/t/post-time-should-say-1-hour-ago-not-1-hour/7175
106 posts and 108 likes over 6 days. Ago
ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago ago
Filed under: ago 3 seconds ago
-