Zed Shaw gets schooled on C undefined behavior
-
Calling wcslen
You mentioned it first. And in this case, my string is 5 characters, not 3, including the embedded null.
But imagine your buffer was {'a','b','c','d'}, and SysAlloc didn't add a null terminator. Then SysStringLength would give you 4, and wcslen would give you a seg fault.
If you only used BSTRs with functions that expect BSTRs, you wouldn't need the terminator. That's a convenience, not the canonical marker-of-length, is all I was ever saying.
-
humb != finger
Also, what if I held up the index and ring fingers? Under @kian's analogy I'm holding up one, not two.
-
I have brand new microcontrollers designed in 2014 with just 1KB of ram :D
Hint: It costs 20 cents for a reason. If you want your cheap toys, you get cheap silicon. You want 128KB of RAM? Then you pay $10 per chip. 32KB chips are in the $2-3 price range.
Mind you this seems weird because you can get 4GB DDR for like $20 right? Well micros do not use DDR or high frequency buses until you hit the high end $12+ chips.
-
If you only used BSTRs with functions that expect BSTRs, you wouldn't need the terminator. That's a convenience, not the canonical marker-of-length, is all I was ever saying.
Everything is a convenience. The null terminator is there because it serves a function, just as the length field is there because it serves a function. COM itself is meant to make calling other languages and applications convenient. Saying "It's a convenience" is devoid of meaning. You don't like calling it a null-terminated string? Fine, it's not a null-terminated string. It's just a structure containing a length, a string, and a null terminator field.
Also, what if I held up the index and ring fingers? Under @kian's analogy I'm holding up one, not two.
You're not very good at analogies.... :P
-
You're not very good at analogies....
It made sense to me.
If you provide this set of chars 0x20, 0x00, 0x20, you'll get 1 for the length, even though there are two valid characters.
But, null-termination is how most "strings" are defined, so 1 is the correct answer for string length.
It's not however the correct value for how many valid characters exist within the space of the buffer.
-
But, null-termination is how most "strings" are defined
No, that's how C-strings are defined; a lot of other languages define them differently
-
yeah, ok.
I thought C was the context.
-
The null terminator is there because it serves a function, just as the length field is there because it serves a function
We need a "you're talking past me" emoticon, apparently.
Pay attention, I'll say it one more time: the null terminator is a convenience function in a BSTR. A BSTR, as long as you work with BSTR-aware code, doesn't need it. The way you find out how long a BSTR is is to look at the count.
The null terminator is not, properly speaking, any such thing, in BSTR semantics, because there's no way to distinguish an embedded two-byte null from the "terminator".
If you use non-BSTR-aware functions, they'll break with embedded nulls.
and a null terminator field.
Ugh. No, it's something that can, under some circumstances, be used as one, but formally it isn't.
-
It made sense to me.
The analogy was "question to function", not "fingers to characters". If you ask a question, the correct answer is the one that answers the question. If you call a function, the correct response is whatever the function is specified to do with the input. If you want to know how many fingers are in your hand, ask how many fingers are in your hand, not how many you are holding up. If you want to check the length of a string, ask for the length of the string, not the location of the first null.
Ugh. No, it's something that can, under some circumstances, be used as one, but formally it isn't.
Really? You should tell the people that wrote the reference, which I linked when I mentioned BSTR in the first place, because they say:
-
If you use non-BSTR-aware functions, they'll break with embedded nulls.
They won't break, they'll do what they're supposed to do. It may not be what you want them to do, but if you don't want them to do what they're designed to do, DON'T CALL THEM.
-
You're not very good at analogies
Sure I am. Some things are hard to shoehorn, and I was just extending the one you used instead of making another one up.
-
If you provide this set of chars 0x20, 0x00, 0x20, you'll get 1 for the length, even though there are two valid characters.
Under some circumstances, there are three. As I said upthread, it was common in older computers to use a string variable to hold assembly code and execute it dynamically. IIRC that's more or less specifically one reason BSTRS allow internal nulls.
But, null-termination is how most "strings" are defined, so 1 is the correct answer for string length.
Yeah, most strings. Specifically not a BSTR, as my program demonstrates!
-
If you call a function, the correct response is whatever the function is specified to do with the input.
Well, I'd argue using wcslen on a BSTR is generally the wrong thing to do. :) Unless you know the BSTR in question will never have internal nulls. Because it will give the correct answer for its contract, but it's not actually the length of the BSTR.
-
Really? You should tell the people that wrote the reference, which I linked when I mentioned BSTR in the first place, because they say:
The "terminator" CANNOT canonically be considered "the thing that tells you where the end of the string is", even though they use the word.
What does SysStringLen do? "The returned value may be different from strlen(bstr) if the BSTR contains embedded Null characters. This function always returns the number of characters specified in the cch parameter of the SysAllocStringLen function used to allocate the BSTR."
Hey, I guess that means that if you want to know how long a BSTR is, you use the character count, not the "terminator".
-
They won't break, they'll do what they're supposed to do.
Nitpickery accepted, because it argues my point that the null "terminator" is not the canonical way to find the end of a BSTR.
-
It's a huge WTF that
BSTR
, a string type with additional information, is ultimatelytypedef
ed fromwchar_t
in the first place.(For reference:
#if !defined(_NATIVE_WCHAR_T_DEFINED) typedef unsigned short WCHAR; #else typedef wchar_t WCHAR; #endif` typedef WCHAR OLECHAR; typedef OLECHAR* BSTR;
-
Only if you ignore the fact that the entire COM API is designed to be used from C
-
That doesn't explain why it isn't a struct of some sort.
-
Because the string length field is not to be exposed to the developer; it's an internal-use-only field
-
That doesn't explain why it isn't a struct of some sort.
Can you make variable width stucts? In C and C++, you can't. The way to hack it is to make a struct whose last member is a zero-width array. I imagine the guy who had to design BSTR didn't like it.
-
It's a huge WTF that BSTR, a string type with additional information, is ultimately typedefed from wchar_t in the first place.
wchar_t was almost certainly not widespread, if extant at all, when BSTR was invented, so the use of wchar_t must've been retrofitted in.
-
Because the string length field is not to be exposed to the developer; it's an internal-use-only field
Which point is highlighted by the fact that a BSTR is a pointer not to the beginning of the struct, but to the wchar_t pointer inside it.
-
The "terminator" CANNOT canonically be considered "the thing that tells you where the end of the string is", even though they use the word.
The terminator is not what determines where the ends is, it's the thing that is at the end. You don't like calling it the terminator? You can propose another term for "last thing that always has to look like this".
The terminator in astronomy, for example, is the line where day turns to night. It's where the shadowed part of the planet starts. However, not every shadow in the daylight part marks the terminator.
-
@blakeyrat said:
Writing your OWN code to copy strings is a stupid concept, you're right that it's not an ASCII thing necessarily, but that doesn't change that it's still fucking idiotic.
Agreed. If you decide to write your own string copy code, you're doing it wrong.
But it's still a useful exercise in a programming book.
I'm writing an OS. I needed to do it, as those functions aren't even available (or any function for that matter).
Can't think of any other reason though...Edit: Yes, I AM inspired by TempleOS's pretty marquees and stuff.
-
The terminator is not what determines where the ends is
That's what I've been saying! Thanks for coming around to my POV.
My point is that in a real null-terminated string, like the C ones, the null character IS the terminator. It's the canonical way you find the end of the string. And one more time, that's not true for a BSTR, because the count is canonical.
-
We seem to have had a failure of communication then, because I never claimed otherwise. Which is why I qualified on one of my first responses that BSTRs may not be not null-terminated, but they have a null terminator. The fact that you don't use the null terminator to determine the length has no bearing on the fact that they have one regardless.
-
I rather like CosmOS, myself.
-
He didn't say "string functions." He referred specifically to the function that @FrostCat mentioned.
What are you going on about? The first function that FrostCat mentioned specifically waswcslen
, which doesn't work properly (for any reasonable definition of properly) if you have a string with embedded NULs. He has also referred to functions that operate on NUL-terminated strings in general multiple timesSo when you call a function that specifically operates until the first null, the first null is what determines how far into the string the function will operate. It's a tautology, it's not that hard to grasp. It doesn't mean the function is doing the wrong thing. It's simply wrong to call it if you didn't want that behavior.
By that logic,size_t strlen(const char * s) { return 5; }
isn't doing the wrong thing. If you call it with a string that's not 5 bytes, that was just your mistake for using it if you didn't want that behavior.Let me try to articulate precisely what I'm trying to say:
- C is the outlier in terms of counted vs. terminated strings being the modus operandi
- Functions, like strlen, that are designed for terminated strings do not produce reasonable answers for counted strings
If your string is both counted AND has a null terminator, however, you can trust that if you pass it to a function that requires a null terminator, it will do the right thing
No! I will not agree that stupid semantics are right! I will agree that it won't directly provoke undefined behavior and crash your program. That is worth putting a terminator on the end of your counted strings for, but it's not an excuse for using functions designed for terminated strings in the first place.Everything is a convenience. The null terminator is there because it serves a function, just as the length field is there because it serves a function. COM itself is meant to make calling other languages and applications convenient. Saying "It's a convenience" is devoid of meaning.
If BSTRs did not have terminators, there would be no loss of information. If you wanted to (incorrectly) call a function that expects a terminated string, you could allocate some new space, copy the BSTR to it, and tack a terminator on the end. That is why it's a convenience, so you don't have to do that.The terminator on a C string is not a convenience because otherwise you don't know where the string ends. The count field of a counted string is not a convenience because otherwise you don't know where the string ends.
Really? You should tell the people that wrote the reference, which I linked when I mentioned BSTR in the first place, because they say:
It's defined that way because it's convenient to be able to use a BSTR where a terminated string is expected, not because it's necessary!Well, I'd argue using wcslen on a BSTR is generally the wrong thing to do.
I'd say it is effectively always the wrong thing to do, even if you know there are no embedded NULs. In fact, the only exception I can think of is if you think it might have embedded NULs, and you want to know where the first one is (e.g. because you want to copy it to some other thing that is a terminated string and you are OK with the potential loss of information).
-
The first function that FrostCat mentioned specifically was wcslen, which doesn't work properly (for any reasonable definition of properly) if you have a string with embedded NULs.
Ok, suppose you have a string with embedded nulls, and you want to know the position of the various embedded null. Perhaps you want to split the string into various sub-strings, for example. Unicode doesn't allow embedded nulls in text, so you could pack various text strings into one BSTR, and delimit them with nulls. What do you do? Do you write your own function, or call the function specifically designed to find the first null in a char array?
#include <vector> std::vector<wchar_t*> GetSubstrings(BSTR input) { std::vector<wchar_t*> result; int inputLength = SysStrLength(input); for(int pos = 0; pos <= inputLength;) { result.push_back(input+pos); pos = wcslen(input+pos)+1; } return result; }
There you go, a reasonable use case for functions that handle c-strings being fed a BSTR. Not to mention all the code that simply isn't aware of BSTR and treats any pointers to chars as c-strings.
By that logic, size_t strlen(const char * s) { return 5; } isn't doing the wrong thing. If you call it with a string that's not 5 bytes, that was just your mistake for using it if you didn't want that behavior.
What does this function do?int KillAllHumans(int a, int b) { return a+b;}
Does it kill all humans, or does it return the result of adding it's two inputs? If the documentation says "The function KillAllHumans returns the result of adding the two parameters", is it wrong that it doesn't kill all humans? Legacy functions have unfortunate names. They were written when the conventions were different. So don't call them if you don't want what they do. If your function is documented as always returning 5, then it would be pretty stupid of me to call it for anything other than asking for a 5. It's not the function's fault if I'm an idiot that doesn't read documentation, regardless of the name.
No! I will not agree that stupid semantics are right!
I'm not asking you to agree that the semantics are right! Who the fuck cares if they're right or wrong? They are what they are. They're already in your system whether you want to use them or not. A lot of legacy code uses them, so when designing your system you have to understand that these things exist. It doesn't mean you have to use them yourself if you don't want to.But if you use them without understanding what they do, or expecting them to do what you want them to do because that is what you think they should do, you're a terrible programmer. Functions do what they do. If they meet the spec, they're not buggy. I don't care about morality, I care about the spec. You don't like the spec? I don't care. I code to spec, not to opinions.
If BSTRs did not have terminators, there would be no loss of information. If you wanted to (incorrectly) call a function that expects a terminated string, you could allocate some new space, copy the BSTR to it, and tack a terminator on the end. That is why it's a convenience, so you don't have to do that.
Ok, that has some substance. And yes, I know it's more handy, when you have memory to spare, to embed the length of the string with the string. I'm not arguing null terminators are awesome, I'm saying they serve a function, and legacy code expects them. Having to make a copy whenever you want to call one of those functions would be error prone and a pain in the ass. That's reason enough to have it.It's defined that way because it's convenient
I don't care why it's defined that way. It's enough that it is, and that I have to work with it. My job is to understand how it works, and use it correctly. Not to critique it.I'd say it is effectively always the wrong thing to do, even if you know there are no embedded NULs.
I gave a reasonable reason you might want to above.
-
Zed Shaw used to work for Bear Sterns, which I thought was hilarious. Up until they stopped existing.
-
Bear Sterns
https://c1.staticflickr.com/7/6205/6071571505_8d0bec66fe_n.jpg
? (Yes, I know it was a typo :) )
-
What are you going on about?
What I remember from the conversation.
The first function that FrostCat mentioned specifically was wcslen, which doesn't work properly (for any reasonable definition of properly) if you have a string with embedded NULs.
Yes, exactly. In the context of that function, there's no embedded null. You just have a shorter string than you thought you had.
-
incidentally, there's not UB listed in ANSI for 'alter the variable of a for-loop'
incidentally, there's not UB listed in ANSI
not UB listed in ANSI
not UB listed
UB listed
My sides
-
Yes, exactly. In the context of that function, there's no embedded null. You just have a shorter string than you thought you had.
Just for the record I mentioned wcslen because someone else, maybe @Kian, mentioned "wstrlen" above that, and the latter doesn't exist in MSVC (if at all? When I googled it, I wound up on the MSDN page for wcslen and family).
So I wasn't even the one to introduce "using functions that aren't appropriate for BSTRs", although I don't recall if we were talking about BSTRs per se yet.
-
All I know is once that was the context, even if you pass it a BSTR, it's not a BSTR any longer. Those sorts of context switches seem like exercises in pedantic dickweedery, but so is feeding source code to a compiler.
-
It's not the function's fault if I'm an idiot that doesn't read documentation, regardless of the name.
Typically, if I find software that's that poorly named, it's indicative of the quality and I use something else.
I'm sure having completely unintuitive interfaces are great for maintenance.
-
All I know is once that was the context, even if you pass it a BSTR, it's not a BSTR any longer.
That's some kind of lemma to my point.
-
I'm sure having completely unintuitive interfaces are great for maintenance.
The real functions in question were intuitive back when they were designed. Then "it" changed and now they're not intuitive anymore. Intuitive or not, however, there's no excuse for not knowing what functions you call do.
The hypothetical examples simply highlight the fact that you can't let auto-complete code for you. You have to read the documentation for every function you call, not just guess. And while it would be nice to be able to change libraries when you don't like one, if you inherit something it's rare that you get the choice at all.
-
there's no excuse for not knowing what functions you call do.
Yeah, it's called bad documentation.
The real functions in question were intuitive back when they were designed.
Oh really?
strstr vs. strpbrk?
-
Yeah, it's called bad documentation.
I wouldn't say that's an excuse. It may be a reason, but I can't say "my code is good, it's the framework's fault if it doesn't do what I want it to". It just means your work is going to be that much harder.
To define my terms, I understand an excuse to be something that releases you of responsibility. A reason is why you do something. Looked it up, and google at least agrees:
ex·cuse
verb
ikˈskyo͞oz/
1.
attempt to lessen the blame attaching to (a fault or offense); seek to defend or justify.
"he did nothing to hide or excuse Jacob's cruelty"
synonyms: justify, defend, condone, vindicate; More
2.
release (someone) from a duty or requirement.
-
there's no excuse for not knowing what functions you call do.
Disagree
This isn't an excuse for having broken code.
Agreed.
Often I use another interface. If I can't use another interface, I consider using an adapter. So that my code makes sense and is maintainable.
-
How is it moving the goal post? I made a statement and I stand by it. I clarified my definition in case my point wasn't clear enough, but considering my definition is the dictionary definition, you can't even claim I redefined the words to suit me. If you don't know what words mean, that's not my fault.
To reiterate, if you choose to type the name of a function into your editor, compile that code, and run it, without any clue as to what that code is supposed to do, that's wrong. It may be necessary, but you are still responsible for whatever happens next. You are not excused just because you had no way of knowing.
-
You are not excused just because you had no way of knowing.
Ok, look, if the doc says returns 5, and it returns 6. I'm excused from the code failing. Am I excused if I leave it that way, no.
-
Of course I am! I wouldn't have to correct them if I didn't!
@xaade said:Ok, look, if the doc says returns 5, and it returns 6. I'm excused from the code failing. Am I excused if I leave it that way, no.
Sure, we agree on that.
-
-
Depends how you do it.
-
Doesn't matter, Discourse tries to moralise anyway, because it doesn't know if it is legitimately right or not. Just one if the many things we have laughed at as "basically broken by design"
Even the crusty 1990s toxic hell stew forums knew this one, because they tried it and removed it again... But that learning experience doesn't count because 1990s toxic hell stew, right?
-
Interesting guy. Obviously talented in many ways. But at the same time, seems to be his own greatest enemy.
He's a ranter.
He's not my favorite ranter.
-
FUCK OFF ALL DISCOURSE TOASTERS
JUST FUCK OFF
YOU ARE ALL BROKEN
EVERY ONE OF YOU
NONE OF YOU IS WORTH A FART IN A HIGH WIND
-