Why my friend had to retake his C++ course with a different professor
-
Buffers are actually a really good example of why you'd want to know the length of the string in bytes. If you reserve a 100 byte buffer when the null-terminated string was 100 characters, you result in a buffer overflow if there were any multi-byte characters.
That, or you just allocate 256 bytes and cut off everything after 255th byte. This way, your app doesn't crash even if there's no null terminator at all!Or you can use a sane language, like C++
-
That, or you just allocate 256 bytes and cut off everything after 255th byte. This way, your app doesn't crash even if there's no null terminator at all!
Some of the legacy code I've been updating (read: ripping out the useful bits and rewriting everything else) used exactly this approach to serialise some structures to send over the wire. Statically declared an array of size C (chosen to be larger than the serialized size of the struct) to serialise into (on the stack!), dump the struct into it using the serializing function (which takes a char* to dump bytes into and returns the resulting size!), send it off now that we know how large it actually is. There was a comment pointing out that maybe they should do something else, so somebody who touched this had half a clue at some point...
-
All with native performance, not some C# or Java StringBuilder bullshit.
You do realise that StringBuilder is really just a wrapper round an array and some extra info (length, capacity) so that concatenation is amortised-constant time? Probably almost the identical algorithm that std::string uses. And that it will be compiled into native code (albeit usually at runtime)? If you're going to pick on them, do it for the way bad programmers turn an efficient language into an inefficient one by making it easy to use String concatenation badly.
C strings though… well, you can make them go fast and be safe, but you do it by ignoring the standard library. There is no good way to use the majority of it.
-
We can all agree, at least, that Pascal strings would be a disaster with Unicode.
-
Pascal strings
In what sense do you mean this phrase? It affects how much I disagree or agree…
-
Most probably a string that has a first byte that holds its length. Exercise to the reader: combine this idea with Unicode, then list 10 possible problems with that...
-
struct pascalString {
unsigned char length;
unsigned char* string;
};EDIT: The problem, of course, is that computing length for UTF8 or UTF16 Unicode strings is nontrivial.
-
We can all agree, at least, that Pascal strings would be a disaster with Unicode.
It could have been AWK...
-
You do realise that StringBuilder is really just a wrapper round an array and some extra info (length, capacity) so that concatenation is amortised-constant time?
Yeah - I don't know the details, but most likely they use ropes underneath. Which is all fine and dandy when you do a bazillion concats, but for a small number, you'll probably be better off using simple realloc. Which is usually not possible in such languages, because let's pretend strings are immutable, even though it's bullshit on native level.
Probably almost the identical algorithm that std::string uses.
That one usually uses another approach - grow by X with some extra space. Of course that's not a requirement, but I can't remember a rope approach anywhere. In most cases you don't do that much additions to warrant rope usage; it actually slows things down. But, of course, it all depends and what you want to do. You can always make a list of strings and boost::join them.C strings though… well, you can make them go fast and be safe
Or use a language which is both fast and safe by default. Like pretty much any other language.We can all agree, at least, that Pascal strings would be a disaster with Unicode.
Goddamn I remember those. The 255 limit - good old times.
-
Some of the legacy code I've been updating (read: ripping out the useful bits and rewriting everything else) used exactly this approach to serialise some structures to send over the wire.
Well, I was speaking of reading, or deserialization. For serialization, this approach obviously sucks.We can all agree, at least, that Pascal strings would be a disaster with Unicode.
Well, FWIW, Rust uses kind of Pascal strings, and it does the job rather well.
-
COM, old-school Visual Basic, etc. do to, as well (
BSTR
s are length-prefixed strings). Only this time the length is an actual 32-bit integer rather than a singlechar
(and well, there's also a null character appended to them, to avoid disasters when mishandled by code expecting C-style strings).We can all agree, at least, that Pascal strings would be a disaster with Unicode.
Nope, because here too, the length would be the "storage" length rather than the number of glyphs.
.
-
Yeah - I don't know the details, but most likely they use ropes underneath.
No, StringBuilder in both C# and Java are dynamic arrays that double when they are full.
-
No, StringBuilder in both C# and Java are dynamic arrays that double when they are full.
So they do a realloc like a std::string. Funny how such a simple thing like a string, in those languages had to be a "builder" for max efficiency.
-
Of course that's not a requirement, but I can't remember a rope approach anywhere. In most cases you don't do that much additions to warrant rope usage; it actually slows things down.
Ropes are problematic precisely because they have more complexity inside, which tends to decrease cache coherency. This sort of thing is why it's important to measure whether your changes have made a positive difference and not just calculate it from algorithmic principles. Some of the things I've done in string processing have looked like they were a massive back-step in algorithmic terms, but were big gains because they worked better with caches.
-
Yeah, the best C# has to offer is the single-codepoint
Char.ConvertToUtf32
/Chat.ConvertFromUtf32
.To apply it on a whole string, you have the choice between doing it all manually with this function, or encoding it as bytes with
Encoding.UTF32
then deserializing these bytes as integers.
-
So they do a realloc like a std::string. Funny how such a simple thing like a string, in those languages had to be a "builder" for max efficiency.
The difference is that Java and C# both make their basic string type immutable; the builders are needed because you get quadratic behaviour when building a string in a loop otherwise, even with perfect implementation of concatenation.
-
such a simple thing like a string
A string is actually not that simple, since a character is not actually a simple thing either.
-
I must dig into the logic of immutable strings someday.
-
A string is actually not that simple, since a character is not actually a simple thing either.
It is simple when done right. Qt nailed it down almost perfectly.
-
Immutability on objects such as strings allows them to have full value semantics despite being reference types.
-
-
http://doc.qt.io/qt-5/qchar.html#details that is sufficient for most cases.
-
I forgot this thread existed, I can't believe it is still going.
How do you define a character?
Has anyone brought up http://site.icu-project.org/ yet?
-
In Qt, Unicode characters are 16-bit entities without any markup or structure.
It fails right from the start by making the obsolete assumption that UCS-2 is synonymous with Unicode. The Unicode codepoint space goes up to U+10FFFF (which takes 21 bits) and that limit is only due to legacy limitations (surrogates in UTF-16), otherwise the limit would be at least U+FFFFFFFF (32 bits).
-
It fails right from the start by making the obsolete assumption that UCS-2 is synonymous with Unicode.
That's why I said "almost". QChar has some legacy baggage to it, and they should go with 32. But in most cases, it's good enough. Good thing is the support for all the wacky character properties etc. All in all, it's an example of something done right.
-
Support for character properties is very good, but using only 16 bits for characters means limiting the character set to 1/17th of Unicode (the most used, admittedly, but still), so I can't consider it almost perfect.
-
I guess it depends on your use cases then.
-
What do you mean by "character properties" exactly?
-
Upper/lowercase, script, direction, etc.
-
Thanks.
I think .Net supports a good part of such properties too then. Though I don't see script or direction in the list.