Why my friend had to retake his C++ course with a different professor

Gąska

Buffers are actually a really good example of why you'd want to know the length of the string in bytes. If you reserve a 100 byte buffer when the null-terminated string was 100 characters, you result in a buffer overflow if there were any multi-byte characters.

That, or you just allocate 256 bytes and cut off everything after 255th byte. This way, your app doesn't crash even if there's no null terminator at all!

@NeighborhoodButcher said:

Or you can use a sane language, like C++

jmp

@Gaska said:

That, or you just allocate 256 bytes and cut off everything after 255th byte. This way, your app doesn't crash even if there's no null terminator at all!

Some of the legacy code I've been updating (read: ripping out the useful bits and rewriting everything else) used exactly this approach to serialise some structures to send over the wire. Statically declared an array of size C (chosen to be larger than the serialized size of the struct) to serialise into (on the stack!), dump the struct into it using the serializing function (which takes a char* to dump bytes into and returns the resulting size!), send it off now that we know how large it actually is. There was a comment pointing out that maybe they should do something else, so somebody who touched this had half a clue at some point...

dkf

@NeighborhoodButcher said:

All with native performance, not some C# or Java StringBuilder bullshit.

You do realise that StringBuilder is really just a wrapper round an array and some extra info (length, capacity) so that concatenation is amortised-constant time? Probably almost the identical algorithm that std::string uses. And that it will be compiled into native code (albeit usually at runtime)? If you're going to pick on them, do it for the way bad programmers turn an efficient language into an inefficient one by making it easy to use String concatenation badly.

C strings though… well, you can make them go fast and be safe, but you do it by ignoring the standard library. There is no good way to use the majority of it.

jmp

We can all agree, at least, that Pascal strings would be a disaster with Unicode.

dkf

@jmp said:

Pascal strings

In what sense do you mean this phrase? It affects how much I disagree or agree…

marczellm

Most probably a string that has a first byte that holds its length. Exercise to the reader: combine this idea with Unicode, then list 10 possible problems with that...

jmp

struct pascalString {
unsigned char length;
unsigned char* string;
};

EDIT: The problem, of course, is that computing length for UTF8 or UTF16 Unicode strings is nontrivial.

anotherusername

@jmp said:

We can all agree, at least, that Pascal strings would be a disaster with Unicode.

It could have been AWK...

NeighborhoodButcher

@dkf said:

You do realise that StringBuilder is really just a wrapper round an array and some extra info (length, capacity) so that concatenation is amortised-constant time?

Yeah - I don't know the details, but most likely they use ropes underneath. Which is all fine and dandy when you do a bazillion concats, but for a small number, you'll probably be better off using simple realloc. Which is usually not possible in such languages, because let's pretend strings are immutable, even though it's bullshit on native level.

@dkf said:

Probably almost the identical algorithm that std::string uses.

That one usually uses another approach - grow by X with some extra space. Of course that's not a requirement, but I can't remember a rope approach anywhere. In most cases you don't do that much additions to warrant rope usage; it actually slows things down. But, of course, it all depends and what you want to do. You can always make a list of strings and boost::join them.

@dkf said:

C strings though… well, you can make them go fast and be safe

Or use a language which is both fast and safe by default. Like pretty much any other language.

@jmp said:

We can all agree, at least, that Pascal strings would be a disaster with Unicode.

Goddamn I remember those. The 255 limit - good old times.

Gąska

@jmp said:

Some of the legacy code I've been updating (read: ripping out the useful bits and rewriting everything else) used exactly this approach to serialise some structures to send over the wire.

Well, I was speaking of reading, or deserialization. For serialization, this approach obviously sucks.

@jmp said:

We can all agree, at least, that Pascal strings would be a disaster with Unicode.

Well, FWIW, Rust uses kind of Pascal strings, and it does the job rather well.

Medinoc

COM, old-school Visual Basic, etc. do to, as well (BSTRs are length-prefixed strings). Only this time the length is an actual 32-bit integer rather than a single char (and well, there's also a null character appended to them, to avoid disasters when mishandled by code expecting C-style strings).

@jmp said:

We can all agree, at least, that Pascal strings would be a disaster with Unicode.

Nope, because here too, the length would be the "storage" length rather than the number of glyphs.
.

Salamander

@NeighborhoodButcher said:

Yeah - I don't know the details, but most likely they use ropes underneath.

No, StringBuilder in both C# and Java are dynamic arrays that double when they are full.

NeighborhoodButcher

@Salamander said:

No, StringBuilder in both C# and Java are dynamic arrays that double when they are full.

So they do a realloc like a std::string. Funny how such a simple thing like a string, in those languages had to be a "builder" for max efficiency.

dkf

@NeighborhoodButcher said:

Of course that's not a requirement, but I can't remember a rope approach anywhere. In most cases you don't do that much additions to warrant rope usage; it actually slows things down.

Ropes are problematic precisely because they have more complexity inside, which tends to decrease cache coherency. This sort of thing is why it's important to measure whether your changes have made a positive difference and not just calculate it from algorithmic principles. Some of the things I've done in string processing have looked like they were a massive back-step in algorithmic terms, but were big gains because they worked better with caches.

Medinoc

Yeah, the best C# has to offer is the single-codepoint Char.ConvertToUtf32/Chat.ConvertFromUtf32.

To apply it on a whole string, you have the choice between doing it all manually with this function, or encoding it as bytes with Encoding.UTF32 then deserializing these bytes as integers.

dkf

@NeighborhoodButcher said:

So they do a realloc like a std::string. Funny how such a simple thing like a string, in those languages had to be a "builder" for max efficiency.

The difference is that Java and C# both make their basic string type immutable; the builders are needed because you get quadratic behaviour when building a string in a loop otherwise, even with perfect implementation of concatenation.

Khudzlin

@NeighborhoodButcher said:

such a simple thing like a string

A string is actually not that simple, since a character is not actually a simple thing either.

NeighborhoodButcher

I must dig into the logic of immutable strings someday.

NeighborhoodButcher

@Khudzlin said:

A string is actually not that simple, since a character is not actually a simple thing either.

It is simple when done right. Qt nailed it down almost perfectly.

Medinoc

Immutability on objects such as strings allows them to have full value semantics despite being reference types.

Khudzlin

@NeighborhoodButcher said:

It is simple when done right.

How do you define a character?

NeighborhoodButcher

http://doc.qt.io/qt-5/qchar.html#details that is sufficient for most cases.

LB_

I forgot this thread existed, I can't believe it is still going.

@Khudzlin said:

How do you define a character?

Has anyone brought up http://site.icu-project.org/ yet?

Khudzlin

In Qt, Unicode characters are 16-bit entities without any markup or structure.

It fails right from the start by making the obsolete assumption that UCS-2 is synonymous with Unicode. The Unicode codepoint space goes up to U+10FFFF (which takes 21 bits) and that limit is only due to legacy limitations (surrogates in UTF-16), otherwise the limit would be at least U+FFFFFFFF (32 bits).

NeighborhoodButcher

@Khudzlin said:

It fails right from the start by making the obsolete assumption that UCS-2 is synonymous with Unicode.

That's why I said "almost". QChar has some legacy baggage to it, and they should go with 32. But in most cases, it's good enough. Good thing is the support for all the wacky character properties etc. All in all, it's an example of something done right.

Khudzlin

Support for character properties is very good, but using only 16 bits for characters means limiting the character set to 1/17th of Unicode (the most used, admittedly, but still), so I can't consider it almost perfect.

NeighborhoodButcher

I guess it depends on your use cases then.

Medinoc

What do you mean by "character properties" exactly?

Gąska

Upper/lowercase, script, direction, etc.

Medinoc

Thanks.

I think .Net supports a good part of such properties too then. Though I don't see script or direction in the list.