Why my friend had to retake his C++ course with a different professor

Gąska

The nice thing of UTF-8 in an ASCII-oriented environment like C is that everything keeps working. Only truncation and character-oriented operations on non-ASCII characters are a problem.

In other words, everything works, except for everything.

Gąska

@PleegWat said:

Notably, text parsing and tokenizing operations are typically safe as long as your syntax characters are in ASCII.

Safe, yes. Working, not always. For example, calculating position of every character preceded by non-ASCII one in the same line will be off, and substring operations might cause invalid characters. Actually, I can think of some scenarios in which these can lead to horrible crashes or security vulnerabilities, so no, it's not even safe.

NeighborhoodButcher

That's why people shouldn't use C for these things. In fact, people shouldn't use C at all. In C++, you can use std::u16string or std::u32string (or good old std::wstring). Hell, you can even make/use std::basic_string< some_utf8_codepoint >, but anyone working on utf8 in code is pretty much clueless (hint: utf8 is good for storage/compatibility, not for processing).

Gąska

@NeighborhoodButcher said:

In C++, you can use std::u16string or std::u32string (or good old std::wstring).

None of which handles UTF-8.

@NeighborhoodButcher said:

(hint: utf8 is good for storage/compatibility, not for processing).

Hint: UTF-8 is the standard, so you should use UTF-8 everywhere. And UTF-8-aware procedures work with UTF-8 flawlessly - it's only ASCII-based stuff that has any problems.

dkf

@Gaska said:

Hint: UTF-8 is the standard, so you should use UTF-8 everywhere.

No. Life is not as simple and convenient as that. Why? Indexing to a specific character is O(N) when the string is UTF-8 unless you do some considerably more complex processing to build an auxiliary index structure. If you're using 32-bit characters, it's O(1); that used to be true with 16-bit characters, but isn't nowadays because the Unicode standard was extended and you end up with surrogate pairs if you do that, which cause other problems. (There's no point at all in trying to use anything between 16 and 32 bits per character when in memory; the cost of all the unaligned bit shuffling required will slay any benefits you get from using less space. Alas.) That said, it's definitely pretty convenient to support UTF-8 for string constants in your program; I wouldn't criticise anyone for that at all.

UTF-8 absolutely should be used for data at rest or data sent over a communications channel (where compatible with the protocol) but there's lots of reasons to not automatically go with it in memory. And yes, I have written high-performance string code: I do know what I'm talking about here.

Gąska

@dkf said:

No. Life is not as simple and convenient as that. Why? Indexing to a specific character is O(N) when the string is UTF-8 unless you do some considerably more complex processing to build an auxiliary index structure.

Either byte offset suffices, or you're

. It's kinda like hardcoding path to Program Files - it works, but still.

dkf

@Gaska said:

Either byte offset suffices, or you're .

No. You've just completely misunderstood. Because you're not half as competent as you think.

Gąska

In what case do you need character indexing that has to respect character boundaries and your data isn't structured in a way that allows for better data representation than string?

NeighborhoodButcher

@Gaska said:

Hint: UTF-8 is the standard, so you should use UTF-8 everywhere. And UTF-8-aware procedures work with UTF-8 flawlessly - it's only ASCII-based stuff that has any problems.

Damn those arguments are bad. UTF8 should be used for I/O; absolutely not for processing. Most operations on UTF8 require going linearly from the beginning, in order to find the exact code point you're looking for. Searching, replacing, computing length, inserting etc. - everything requires going through the whole string. You can't even index a specific code point without going through all that hassle. Not to mention you would trash your CPU cache lines just to get to one character.

Gąska

OK, fair enough.

dkf

@Gaska said:

In what case do you need character indexing that has to respect character boundaries and your data isn't structured in a way that allows for better data representation than string?

You use the UTF-8 data to build a sequence of Unicode characters of fixed width. Probably 32-bits per character (as that's supposed to be enough, and is in any case lots). You might also normalise the string in the process, depending on what you are doing. Then you can index by character as easily as you can by byte, enabling a lot of other algorithms to be implemented using efficient code. When writing out, you use the sequence of Unicode characters to direct what stream of UTF-8 data is written. (It is usually possible to do the conversion from UTF-8 to Unicode and back lazily, but you need a very good grasp on the difference between storage units — bytes, wchar_ts, etc. — and characters, or rather more formally, codepoints chosen from an abstract space of potential characters as defined by the Unicode standard. Strings are sequences of Unicode codepoints, and yes, this says nothing about how they are stored in memory.)

The main problem with UTF-8 and Unicode is that developers think they understand it without actually trying to understand it for real. It's very much non-trivial… ;)

Medinoc

@NeighborhoodButcher said:

Oh yeah, especially strlen().

Yes, in fact.

strlen() is not actually for determining the number of glyphs that will be drawn on the screen when you output the string. It's for determining how many chars (not counting the terminating null character) a string contains, and how big a buffer must be to hold it.

That's why people call strlen() : to know how many chars they are working with. This as no bearing on how many symbols are drawn, not only because of multi-byte code points, but also because of combining characters (diacritics etc.). Oh, and control characters too (CR screws up the count anyway, BEL usually doesn't output any glyph, etc.)

Which means, strlen() still works just fine with UTF-8.

PleegWat

@dkf said:

conversion from UTF-8 to Unicode and back

ITYM UTF-8 to UTF-32 and back. It's not much code, but it's unwieldy enough that I wouldn't write it inline.

Most of the operations we need to do on utf-8 strings are strstr() (which is safe) and our own str_replace_once()/str_replace_all() functions which also don't need character-level considerations.

Salamander

@Medinoc said:

It's for determining how many chars (not counting the terminating null character) a string contains, and how big a buffer must be to hold it.

Using strlen to determine your buffer size sounds like a recipe for exploitation with unterminated strings.

dkf

@Salamander said:

Using strlen to determine your buffer size sounds like a recipe for exploitation with unterminated strings.

That depends on whether it is a constant in your program, or whether it comes from outside. For constants in your program, it's your own damn fault if they're unterminated. For stuff from outside, don't use strlen() as you should already know the length.

Gąska

@Medinoc said:

strlen() is not actually for determining the number of glyphs that will be drawn on the screen when you output the string. It's for determining how many chars (not counting the terminating null character) a string contains, and how big a buffer must be to hold it.

Because it's so obvious that function named "string length" returns size in bytes, not the, um, you know, string length.

dkf

@Gaska said:

Because it's so obvious that function named "string length" returns size in bytes, not the, um, you know, string length.

Because history and because “number of bytes needed” is still a useful value.

Arantor

Even PHP knows the difference which is why we have strlen and ~~mysql_real_strlen~~mb_strlen

blakeyrat

And C# does this shit all properly, including having a separate byte and char type.

Then they snatched defeat from the hands of victory by making char get interpreted as integer if you use it in a numeric operation. Just like good ol' C. Because it's not like the designers of C# were going for strong typing or anything, guyz.

And that's the real tragedy. C's confusion over this, treating characters and integers interchangeably even though they're completely different types, has spread into all kinds of otherwise-good other languages. Not because it makes sense, but because "we've always done it that way!" Ugh.

Arantor

Ugh, that sucks. Way to fuck it up, people.

blakeyrat

@Arantor said:

Ugh, that sucks.

Like a Hoover.

Gąska

@dkf said:

Because history

I love how you side-stepped the whole issue of actually answering the question by replacing it with a single word that carries no information whatsoever other than that this function's name was chosen before I wrote my post.

@dkf said:

and because “number of bytes needed” is still a useful value.

You know what else is useful? Trimming whitespaces from beginning and end of string. Why isn't the trimming function named strlen too?

Gąska

@blakeyrat said:

And C# does this shit all properly, including having a separate byte and char type.

And then fails horribly by making surrogate pairs two separate chars.

antiquarian

@blakeyrat said:

And that's the real tragedy. C's confusion over this, treating characters and integers interchangeably even though they're completely different types, has spread into all kinds of otherwise-good other languages. Not because it makes sense, but because "we've always done it that way!" Ugh.

You could try Haskell or Ada, neither of which has that problem.

Khudzlin

@Gaska said:

@blakeyrat said:
And C# does this shit all properly, including having a separate byte and char type.

And then fails horribly by making surrogate pairs two separate chars.

And allowing strings to contain unpaired surrogates (which are invalid). In other words, pretending UCS-2 (the obsolete 16-bit codepoint set) is Unicode. And, unlike Java, C# has no string functions that take codepoints as arguments (Java uses int to represent those).

Medinoc

I know, I've taken some "good practices" to avoid this. I name my variables length when the count doesn't include the terminating null character, and size when they do.
...Or when I'm lazy, I simply use strlen()+1 as a reflex.

Medinoc

@Gaska said:

Because it's so obvious that function named "string length" returns size in bytes, not the, um, you know, string length.

How is the number of squiggles on screen more of a length than the number of chars it's made of?

Edit: Especially from software's point of view?

Edit2: And what is the length of a non-printable character?

dkf

@Medinoc said:

How is the number of squiggles on screen more of a length than the number of chars it's made of?

What are you measuring? If you are not exactly certain what that is, you'll get confused.

NeighborhoodButcher

@Medinoc said:

How is the number of squiggles on screen more of a length than the number of chars it's made of?

By string we usually mean the characters we input or we get as output. If you put "ążź" into a program, you expect it to return the length of 3, not whatever the unicode byte count is. The problem goes all the way to the dark ages of C, when people though a character is a byte. And because C is a horrible language and C programmers are horrible people, that mistake was never corrected, and spread further.

blakeyrat

@Khudzlin said:

And, unlike Java, C# has no string functions that take codepoints as arguments (Java uses int to represent those).

No; but it has string functions to tell the language how to interpret a byte array as a string.

dkf

@NeighborhoodButcher said:

If you put "ążź" into a program, you expect it to return the length of 3, not whatever the unicode byte count is.

That depends on what you are doing with it. There's at least three different concepts here: the number of bytes, the number of characters and the number of glyphs. Except there may be more because of decomposed forms. It gets complicated. It gets even more complicated when you get into non-European scripts.

NeighborhoodButcher

True. That's why it's important to realize first what we want to measure. In Tizen, those wondrous Korean programmers, always use strlen() to count the glyphs, so we get all the fun with Unicode character properties.

dkf

@NeighborhoodButcher said:

all the fun with Unicode character properties

Buddy

Out of interest, how good does polymorphism work for that? My gut says juggling 8 and 16 could get painful, but I figure if you had a different subclass of string that only got used when you had surrogate pairs in it, most people could go their whole lives without ever activating that code path. Isn't that range mostly for like linear-b or some shit?

anotherusername

Signed chars? What the fuck were they smoking?

Grunnen

That's surprising, considering the nature of the Korean alphabet. 남대문, for example, consists of 8 letters, stacked together into 3 glyphs, occupying certainly neither 8 nor 3 bytes. I'd expect that they would at least support their own native language properly?

anotherusername

@NeighborhoodButcher said:

By string we usually mean the characters we input or we get as output. If you put "ążź" into a program, you expect it to return the length of 3, not whatever the unicode byte count is.

Why? It's obvious from the name that strlen needs to return some kind of length, but bytes and chars are both units of length and it's not obvious which one it should return.

dkf

@Buddy said:

Isn't that range mostly for like linear-b or some shit?

And emoji. And some of the less common parts of Chinese. And other weird and wonderful stuff.

Gąska

@Buddy said:

Out of interest, how good does polymorphism work for that? My gut says juggling 8 and 16 could get painful, but I figure if you had a different subclass of string that only got used when you had surrogate pairs in it, most people could go their whole lives without ever activating that code path. Isn't that range mostly for like linear-b or some shit?

The problem is, how do you check whether there are surrogates or not? And how do you insert surrogates into non-surrogate-able string?

@anotherusername said:

Signed chars? What the fuck were they smoking?

Remember that there's no 1-byte integer type other than char in C/C++. Without signed char, you couldn't have signed one-byte integers.

anotherusername

@Gaska said:

there's no 1-byte integer type other than char in C/C++.

That's not a compelling argument for their sanity.

NeighborhoodButcher

Score another one for C compatibility.

Gąska

@anotherusername said:

That's not a compelling argument for their sanity.

I wasn't arguing they're sane.

PleegWat

@Salamander said:

Using strlen to determine your buffer size sounds like a recipe for exploitation with unterminated strings.

Determine what buffer size you need. If for some reason you have an non-null-terminated string (why and how did you get yourself in that mess) bug you do know the input buffer size there is always strnlen(). Or memchr().

anotherusername

Buffers are actually a really good example of why you'd want to know the length of the string in bytes. If you reserve a 100 byte buffer when the null-terminated string was 100 characters, you result in a buffer overflow if there were any multi-byte characters.

NeighborhoodButcher

Or you can use a sane language, like C++, don't give a shit about the size of some buffer, and just use the string like you want to.

Medinoc

What Salamander meant was that if you do this, you lost the game:

char const * someString = "abcd";
size_t length = strlen(someString);
char * copyOfString = malloc(length); //Oops!
//Now if I use strcpy(), I overflow the buffer.
//and if I do this instead:
strncpy(copyOfString, someString, length);
//I get a non-null-terminated string!

//And finally, Microsoft's "secure" version
strncpy_s(copyOfString, length, someString, _TRUNCATE);
//Gives me a null-terminated string that's missing its last character.

In C, what you need to do is this:

char const * someString = "abcd";
size_t length = strlen(someString);
size_t size = length+1;
char * copyOfString = malloc(size);
//And now you can use strcpy() or use 'size' as argument to the safer functions.

Quite the pitfall, isn't it?

PleegWat

If that's all you need, what you need to do is this:

char const * someString = "abcd";
char * copyOfString = strdup(someString);

NeighborhoodButcher

That's actually a good example why this language should die. Isn't it better to write:

std::string someString = "abcd";
std::string someCopy = someString;

Or even better:

auto someString = "abcd"s;
auto someCopy = someString;

No buffer anywhere, no memory corruption or leak possible, total safety. All with native performance, not some C# or Java StringBuilder bullshit.

Medinoc

But you can find yourself on a non-POSIX platform and commit this mistake while coding strdup() in the first place!

Or, you can do this mistake while doing something strdup() can't do, such as concatenating strings.

jmp

When C/C++ maintain backwards compatibility at the cost of something, it's a tragedy.
When *nix programs don't maintain backwards compatibility at the cost of something, it's a tragedy.
When something Microsoft develops maintains backwards compatibility at the cost of something (for ex. UCS-2, the awkwardness of properties vs fancy WPF properties in C#, little use of alternate data streams because code written before NTFS ain't gonna handle it right), it's an unfortunate tradeoff but hey backwards compatibility is really important.
When something Microsoft develops breaks backwards compatibility to avoid a cost (ex. various programs that don't run correctly in more recent versions of windows because they use ancient terrible APIs or because they are horrible programs that break the API contract), it's an unfortunate tradeoff, but hey we can't have an entire OS held hostage to ancient code.

Consistency is the hobgoblin of small minds, I guess.

(Yes it would be lovely if Unicode was nicer to handle in C++, this is definitely something C# does better than C++, probably because C++ predates Unicode by four years, whereas Unicode predates C# by ~13 years)