Clang supports Unicode characters in identifiers

robbak

OH, so much fun. So very much fun to be had!

Ronald

including backspace?

Hatshepsut

OH, so much fun. So very much fun to be had!

@C++11 Standard said:

Annex E (normative)

Universal character names for identifier characters [charname]

E.1 Ranges of characters allowed [charname.allowed]

00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00B7-00BA, 00BC-00BE, 00C0-00D6, 00D8-00F6, 00F8-00FF

0100-167F, 1681-180D, 180F-1FFF

200B-200D, 202A-202E, 203F-2040, 2054, 2060-206F

2070-218F, 2460-24FF, 2776-2793, 2C00-2DFF, 2E80-2FFF

3004-3007, 3021-302F, 3031-303F

3040-D7FF

F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD

10000-1FFFD, 20000-2FFFD, 30000-3FFFD, 40000-4FFFD, 50000-5FFFD,

60000-6FFFD, 70000-7FFFD, 80000-8FFFD, 90000-9FFFD, A0000-AFFFD,

B0000-BFFFD, C0000-CFFFD, D0000-DFFFD, E0000-EFFFD

E.2 Ranges of characters disallowed initially [charname.disallowed]

0300-036F, 1DC0-1DFF, 20D0-20FF, FE20-FE2F

TGV

What, no Linear B? Or perhaps more importantly, no CJK?

devjoe

E.2 Ranges of characters disallowed initially [charname.disallowed]
0300-036F, 1DC0-1DFF, 20D0-20FF, FE20-FE2F

These are four separate blocks of combining diacritical marks. This makes sense to disallow, since it might be nearly impossible to distinguish, say, e+0306 from 00EB.

But there's a load of WTF in what they allowed.

E.1 Ranges of characters allowed [charname.allowed]
00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00B7-00BA, 00BC-00BE, 00C0-00D6, 00D8-00F6, 00F8-00FF

00AD is a "soft hyphen," i.e. a character that is completely invisible unless some software that performs word-wrap decides to break a line in the word that it appears in, in which case it appears completely indistiguishable from the regular ASCII hyphen. Because of course my source code editor wraps lines for me. And keep in mind, this isn't just characters allowed in the document (perhaps in string literals or comments), this is for identifier names! For those people whse idea of self-documenting code is to use identifiers that are 120 characters long.

00B4 is an acute accent. Lots of fun to be had there. ´WTF´ = "WTF"

For some reason 00B7 middle dot (·) is allowed, but 00D7 multiplication sign (×) and 00F7 division sign (÷) are excluded. Can I use the latter two as operators, or are they just disallowed?

The remainder of this group of allowed regions encompasses most every language in the world, which makes a sort of sense if you want to write variable names in Japanese or Korean or whatever. So perhaps it's more informative here to look at the characters they omitted.

1680 is omitted, Ogham space mark ( ). Because it's a space?

180E is a Mongolian vowel separator, a nonprinting character. But the nonprinting 180B-180D, Mongolian free variation selectors one, two, and three, are allowed.

2000-200A are spaces of various widths. But 200B, zero width space, is allowed, and 200C and 200D are also zero width characters. Then 200E and 200F, left-to-right and right-to-left marks, are omitted. What, I can't have a variable that reads partly left-to-right and partly right-to-left? But I can put zero-width characters into it?!

The 201x block is full of variations on the hyphen and quotation marks. And the omitted 202x characters are punctuation like the dagger and bullet. But 202A-202E are other variations on left-to-right and right-to-left indicators. So I CAN write variable names that are read partly left to right and partly right to left! (I admit to not knowing the difference between these different characters nor how they are meant to be used in mixed LTR/RTL text, but I find the exclusion of one set of these and inclusion of others interesting.

The next group of excluded characters 202F-203E includes another space, per-mille and per-ten-thousand signs (similar to percent), prime and double-prime and such (wait, I can't make a variable named a prime? I guess I've have to use 00B4 acute accent instead), and some other miscellaneous punctuation including the interrobang (‽ - I could have really used this two paragraphs ago!). They allowed 203F, 2040, and 2054 which are some sort of character ties (‿ ⁀ ⁔, fun if you want to parenthesize text written vertically, I guess) but omitted a bunch of other weird punctuation here.

The next group of allowed characters 2060-206F is entirely nonprinting, and includes 2062 invisible times and 2064 invisible plus. So I can't put a × in my variable name but I can put an invisible one? WTF?

Then 2190-245F includes a bunch of arrows, mathematical symbols, and miscellaneous symbols, including, starting at 2400, the printable substitutes for basic control characters like ␀. They allowed a bunch of circled and parenthesized letters and numbers like ⒄. [Yes, that's a 17 in parentheses, all as a single Unicode character. Unicode is TRWTF.] Then they skipped a bunch of box drawing characters and miscellaneous symbols, but allowed more circled numbers at 2776-2793 - the ones from dingbat fonts.

Basically it looks like they tried to include everything that was supposed to represent a letter or number, and excluded most punctuation and spaces, though there are some weird exceptions. And when they got up to 10000 they gave up and just allowed everything. Like this. And this.

devjoe

@TGV said:

What, no Linear B? Or perhaps more importantly, no CJK?

Linear B is at 10000-1007F (syllabary) and 10080-100FF (ideograms). And CJK is at 2E80-9FFF (too many sections to link, but see a list). Both included, though they appear to have taken care to exclude CJK spaces and punctuation.

error

@devjoe said:

2000-200A are spaces of various widths. But 200B, zero width space, is allowed, and 200C and 200D are also zero width characters. Then 200E and 200F, left-to-right and right-to-left marks, are omitted. What, I can't have a variable that reads partly left-to-right and partly right-to-left? But I can put zero-width characters into it?!

Now write a program where every type, function, and variable have the same name, with different combinations of zero-width characters interspersed.

Maciejasjmj

FE47-FFFD

[code]
//put this section in some obscure header 
#define � { 
#define �� } 
#define �� int 
#define �� main 
#define �� () 
#define �� cout 
#define �� << 
#define �� "Hello world!\n"; 
 
#include <iostream> 
using namespace std;

��
[/code]

Also you can implement Whitespace by the sole use of soft hyphens.

MiffTheFox

So unicode support is TRWTF now? Next you'll be telling me that you can't set the code page of an application.

boomzilla

@MiffTheFox said:

So unicode support is TRWTF now? Next you'll be telling me that you can't set the code page of an application.

Can you explain why you would say this in the context of this tread, because AFAICT, TDEMSYR.

devjoe

When I said Unicode was TRWTF I meant because it has characters like ⒄. What, you can't just type (17) like the rest of us? Of course, in this instance, you really can't, because the ( and ) are not allowed in variable names, while ⒄ is.

MiffTheFox

@boomzilla said:

@MiffTheFox said:
So unicode support is TRWTF now? Next you'll be telling me that you can't set the code page of an application.

Can you explain why you would say this in the context of this tread, because AFAICT, TDEMSYR.

In an age where Unicode has existed for over two decades, it should be commonplace by now. Indeed, many popular programming languages, such as C#, Java, and now C++ (as mentioned above) support it. The OP of this thread is complaining that the Clang compiler supports the use of Unicode in the program itself, instead of forcing the user to use ASCII.

My first sentence is a question, asking nobody in particular if supporting Unicode characters should be considered, as they say here, a "WTF". However, the way it is phrased implies that I don't believe Unicode support. "TRWTF" refers to the core issue of the WTF (although often it's something completely tangential to the OP's intent), and it seems that there is no component to the WTF, as the OP posted it, other then the support of Unicode in Clang.

My sentence sentence is what is known as irony. Code pages are an obsolete technology that predates widespread Unicode support. Each application would run in it's own "code page", a set of mappings for the characters represented by 0x80 to 0xFE. Each code page had a numerical code. For example, in code page 437, 0xEA was the Greek letter Omega, whereas in 819, it was a lowercase letter e with circumflex. Code pages were problematic in computing as misinterpreting a file as the wrong code page lead to a problem known as Mojibake, where characters in a foriegn language became a garbled sequence of characters in the user's own language. Code page support has many problems that Unicode solves, and clearly should not be used in any new application except in the decoding of text files from a legacy system.

The whole point of my post was to point out that the OP apparently considers Unicode to be a WTF, and ironically suggest concern that the code page system is being (was) deprecated. Next time, I'll just make my posts straightforward without any rhetorical questions or irony. (The previous sentence was a lie for the purpose of irony.)

boomzilla

@MiffTheFox said:

In an age where Unicode has existed for over two decades, it should be commonplace by now. Indeed, many popular programming languages, such as C#, Java, and now C++ (as mentioned above) support it.

Did you read the rest of the thread discussing the characters allowed to see why it really is a WTF? I guess it seems reasonable to support other encodings, but allowing weird characters in identifiers is just asking for trouble. I wouldn't accept any code like that in any project I had a say over. Allowing weird characters in string literals makes a ton of sense. Allowing weird characters in identifiers is trouble.

@MiffTheFox said:

The whole point of my post was to point out that the OP apparently considers Unicode to be a WTF, and ironically suggest concern that the code page system is being (was) deprecated.

I think his point was that using ambiguous code points in code is stupid, although blakeyrat would say that I'm making shit up at this point.

error

@boomzilla said:

I think his point was that using ambiguous code points in code is stupid, although blakeyrat would say that I'm making shit up at this point.

How do you know what he would say? Did you infer it? Stop making shit up!

boomzilla

@joe.edwards said:

How do you know what he would say? Did you infer it? Stop making shit up!

Of course not. That's way too much work. I just type whatever my shoulder aliens tell me to type. Get out of your timepod already and join the Century of the Anchovy.

devjoe

@boomzilla said:

I wouldn't accept any code like that in any project I had a say over. Allowing weird characters in string literals makes a ton of sense. Allowing weird characters in identifiers is trouble.

I hope somebody writes a C++11 program for the OMGWTF2 contest where all the identifiers are named using parenthesized number characters, Futhark runes, emoticon symbols, or playing card symbols, etc.

I feel morally obligated to post something on﹡ this thread, considering my signature …

﹡ in? to?

da_Doctah

@Zecc said:

I feel morally obligated to post something on﹡ this thread, considering my signature …

Maybe you could have a word with boomzilla about his cool woodpecker medicine sig.

Hatshepsut

@devjoe said:

Of course, in this instance, you really can't, because the ( and ) are not allowed in variable names, while ⒄ is.

Unfortunately we've had to put my latest project on hold until the Unicode Consortium pull their collective finger out and define the character (178472).

silverpie

@devjoe said:

When I said Unicode was TRWTF I meant because it has characters like ⒄. What, you can't just type (17) like the rest of us? Of course, in this instance, you really can't, because the ( and ) are not allowed in variable names, while ⒄ is.

Like a lot of seeming strangenesses in Unicode, this one is explained by East Asian vertical text, where it may be necessary to have the entire ⒄ in one cell (vertical text follows a grid-like structure).

ekolis

So are they implying that East Asians can't count higher than 20?