C stringsþÝ«ÌΉŠ‹ÿ
-
Some people like to be able to glom any old shit into any old container or whatever and have the computer attempt to make some sense of what it's being asked to do rather than recoiling in horror. Call it the PHP approach to software.
-
When I am serializing / deserializing stuff, such as XML or JSON. Especially XML and I don't have a schema
-
But seriously. Why would anyone want dynamic types in a statically typed language? That kills the whole purpose of having compile-time type checks.
The C application I work on needs to be able to receive, store, and transmit data from various types depending on a configuration file, so using dynamic types is the only option.
To store the actual data, a union is used instead ofvoid*
and casting; it’s cleaner and avoids memory alignment issues.
-
Having done quite a bit of COM development or DLLs callable from .Net, I'm partial to VARIANT and SAFEARRAY.
-
When I am serializing / deserializing stuff, such as XML or JSON. Especially XML and I don't have a schema
If you don't have schema (not necessarily as XML schema file, but an abstract concept about what data you accept), how can you even serialize it other than building stringly-typed element tree? And if you do have schema, what's the problem with making it a static type?The C application I work on needs to be able to receive, store, and transmit data from various types depending on a configuration file, so using dynamic types is the only option.
I would use char arrays. Because bytes are bytes are bytes.To store the actual data, a union is used instead of void* and casting; it’s cleaner and avoids memory alignment issues.
Unions aren't exactly the same thing as dynamic types. Personally, I like unions - but I treat them more as stack-allocable hierarchical polymorphism tool than whatever-the-fuck-I-want tool.
-
I would use char arrays. Because bytes are bytes are bytes.
The problem with char arrays is that they may not be properly aligned. If you use something like this:
struct my_variant { uint8_t type; char data[4]; };
you won’t be able to read/write an
int32_t
from/intodata
directly; you’ll have to usememcpy
every time. (I know*(int32_t*)v.data
will work on most common architectures, but it doesn’t work on the particular ARM architecture I’m developing for -- reading/writing at an odd address will write at the previous even address)If you use a union
struct my_variant { uint8_t type; union { int8_t data_int8; int32_t data_int32; }; };
you’ll be able to use
v.data_int32
without issues.
-
The problem with char arrays is that they may not be properly aligned. If you use something like this you won’t be able to read/write an int32_t from/into data directly
Wait wait wait, when did we get to int32_t's? Your original code was about passing around stuff. Stuff you don't know shit about, nor you care. And if you care, use real data types. And write proper serialization functions. And if you really, really need to manipulate char[] as anything other than char[] without proper deserialization, there's always__attribute__ ((aligned (4)))
.If you use a union you’ll be able to use v.data_int32 without issues.
And that's why unions are awesome. But notice how you have only two variants in your union, not countably infinite like boost::any.
-
Wait wait wait, when did we get to int32_t's? Your original code was about passing around stuff.
Yeah, I forgot to mention that the application also does some work on the data (the configuration file allows things like “Value 3 (int32) is the sum of value 2 (int32) and value 5 (int32)”). Sorry.
I agree that a char buffer is enough if the data only needs to be stored somewhere.But notice how you have only two variants in your union, not countably infinite like boost::any.
Unfortunately, plain old C still seem to be the reference language for embedded systems programming around here...
-
Yeah, I forgot to mention that the application also does some work on the data (the configuration file allows things like “Value 3 (int32) is the sum of value 2 (int32) and value 5 (int32)”).
You effectively have a scripting language embedded in your app. Why the fuck would you need to specify type in scripting language? Or does the configuration file specify the data format you get from external sources? This would be a WTF on its own.Unfortunately, plain old C still seem to be the reference language for embedded systems programming around here...
¿Que? I totally didn't expect a non sequitur in technical discussion.
-
-
I would use char arrays. Because bytes are bytes are bytes.
But chars aren't bytes, chars are characters. Those are different.
I thought you liked strong typing, and now we find you're taking something labeled "char" and putting in arbitrary bytes? Feh.
-
-
Implementation detail. They're conceptually different no matter what the language
-
Not in C.
If you BELIEVE in strong typing, then you should ALSO believe in treating byte arrays and char arrays as two different types. Even if the underlying language's implementation (in this case) is wrong. A char is not a byte. Even if your app is exclusively ASCII, a char is still not a byte.
C# has the exact same issue. Bugs the heck out of me.
Normally it's not a big deal, but it's really annoying to see it from someone who just went on a tirade about how great strong typing is-- then we find out he doesn't even do it himself! Ugh.
-
Implementation detail. They're conceptually different no matter what the language
If you BELIEVE in strong typing, then you should ALSO believe in treating byte arrays and char arrays as two different types.
Whatever you want to believe, a char in C is defined as a byte.
I'm not saying that there shouldn't be a conceptual difference, and that's where stuff like wchar comes in in C.
-
Whatever you want to believe, a char in C is defined as a byte.
Correct; but C is wrong. That's my point.
He doesn't believe in "strong typing" as a general concept. He believes in "do whatever C does". Which is fine, but it's not strong typing.
-
Correct; but C is wrong. That's my point.
Fine, but at least having the appropriate target of your bitching doesn't make you look like you're wrong.
He doesn't believe in "strong typing" as a general concept. He believes in "do whatever C does". Which is fine, but it's not strong typing.
I'm not sure that's an accurate summary, but I'm also not positive who is and isn't trolling whom in this thread.
-
Yeah, C's angle of 'all integers of the same width and signedness are equal' can be annoying. As a workaround you'd have to wrap the variable in a struct.
-
Correct; but C is wrong. That's my point.
Restating: C, for historical reasons, calls its byte typechar
. Orunsigned char
. It sometimes pretends that it is the type of characters too. Ho ho ho. ;-)He doesn't believe in "strong typing" as a general concept. He believes in "do whatever C does". Which is fine, but it's not strong typing.
C's type system is “minimum over assembler to get rid of the worst of the pain”. Heck, that's C all over; it has occupied that particular niche for decades now and isn't about to change.
-
and that's where stuff like wchar comes in in C.
To make things even more fun, C++ now has char16_t and char32_t.
Granted, if you're using C++ you should be using std::string anyway...
-
C++
Yeah, I didn't want to go there, but the stuff you mentioned are still different from "char"s.
-
Restating: C, for historical reasons, calls its byte type
char
. Orunsigned char
. It sometimes pretends that it is the type of characters too. Ho ho ho. ;-)And for extra fun,
char
is distinct from bothsigned char
andunsigned char
, making three distinctchar
types. Ho ho ho indeed.
-
To make things even more fun, C++ now has char16_t and char32_t.
Granted, if you're using C++ you should be using std::string anyway...
Or perhaps
std::u16string
orstd::u32string
.
-
And for extra fun,
char
is distinct from bothsigned char
andunsigned char
, making three distinctchar
types. Ho ho ho indeed.I know my C-fu isn't strong, but given that
char
is already unsigned, how wouldchar
andunsigned char
be different?
-
I know my C-fu isn't strong, but given that
char
is already unsigned, how wouldchar
andunsigned char
be different?TIL.
http://stackoverflow.com/questions/4337217/difference-between-signed-unsigned-char mentions it (withouth citations, grmbl), but http://stackoverflow.com/questions/4337217/difference-between-signed-unsigned-char#comment37080635_4337249 suggests that it's only true in C++.
Care to elaborate, @powerlord?
-
I know my C-fu isn't strong, but given that
char
is already unsigned, how wouldchar
andunsigned char
be different?char
isn't necessarily unsigned, that's up to the compiler/platform (IIRC, it's signed by default in GCC, but unsigned in MSVC).As for the second part, the standard says so. It affects stuff like function overloads and templates (again IIRC).
-
http://stackoverflow.com/questions/4337217/difference-between-signed-unsigned-char mentions it (withouth citations, grmbl), but http://stackoverflow.com/questions/4337217/difference-between-signed-unsigned-char#comment37080635_4337249 suggests that it's only true in C++.
Hmm, http://stackoverflow.com/a/2054941 suggests that it might be true for C as well, but I'm more familiar with C++, so I can't really confirm or deny that. Alas, <limits.h> (the plain old C header) does indeed have all of
SCHAR_MIN
,SCHAR_MAX
UCHAR_MAX
- and additionally
CHAR_MIN
andCHAR_MAX
.
Edit: So the following three functions
f
are distinct overloads:#include <stdio.h> void f(char) { printf( "char\n" ); } void f(signed char) { printf( "schar\n" ); } void f(unsigned char) { printf( "uchar\n" ); } int main() { char a = 0; signed char b = 0; unsigned char c = 0; f(a); f(b); f(c); return 0; }
-
See http://www.trilithium.com/johan/2005/01/char-types/, especially the little PoC at the end and the remark that
gcc
complains (when invoked with-pedantic
), whereasg++
errors out.
-
Well, TIL that you can't make
wchar_t
signed
orunsigned
.Edit: and the same applies for the C++11
char16_t
andchar32_t
.
-
I know my C-fu isn't strong, but given that char is already unsigned, how would char and unsigned char be different?
It isn't necessarily.It is explicitly stated in The Standard™ that the sign of
char
is implementation defined.Which is why, when doing cross-platform stuff, there are three types of char:
signed
,unsigned
, anddo-you-feel-lucky-punk?
(if you don't check)
Of course, if you're single-platform, and you know what your compiler does, option #3 isn't (necessarily) an issue for you.
6.2.5
15) The three typeschar
,signed char
, andunsigned char
are collectively called the character types . The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char[35]
[35] CHAR_MIN , defined in <limits.h> , will have one of the values 0 or SCHAR_MIN , and this can be used to distinguish the two options. Irrespective of the choice made, char is a separate type from the other two and is not compatible with either.
-
So, as I understand it:
char
is for (text) characters. This type is more useful as an array orchar *
to represent text. Multibyte text functions usechar *
as the MB string type, so that's consistent with this interpretation.unsigned char
should have been namedbyte
instead. It's guaranteed to support values between 0 and 255 (might hold larger values, but that's implementation defined).signed char
can be used to hold very integer values between -128 and 127. It might hold support and/or larger values, but that is also implementation defined.
-
char is for (text) characters.
Hahaha.
char
is a byte. If you're using something like ASCII, then happily, bytes are the same as text characters. Code in the wild does all sorts of crazy stuff. YMMV
-
@OffByOne said:
char is for (text) characters.
Hahaha.
char
is a byte. If you're using something like ASCII, then happily, bytes are the same as text characters. Code in the wild does all sorts of crazy stuff. YMMVHence my remark about being more useful in quantities > 1 (array or pointer to). I could have expressed that better though.
Handling encoding conversions and using string functions that are aware of the specific encoding of your strings is left as an exercise for the programmer
-
a char in C is defined as a byte
This. The C char type is very poorly named. You probably wouldn't expect to use it to hold character data in any code written since 1999.
-
I'm also not positive who is and isn't trolling whom
in this threadon this site in general.
-
-
Well, TIL that you can't make wchar_t signed or unsigned.
Edit: and the same applies for the C++11 char16_t and char32_t.
How about an unsigned struct?
-
Code in the wild does all sorts of crazy stuff.
The joys of undefined behaviour writ large.
-
char
isn't necessarily unsigned, that's up to the compiler/platform (IIRC, it's signed by default in GCC, but unsigned in MSVC).As for the second part, the standard says so. It affects stuff like function overloads and templates (again IIRC).
That just makes me want to smack my forehead. Since ASCII characters use code points 0-255, you'd expect the data type intended to store that to also be 0-255.
I should cross-post this to Things that Dennis Ritchie got wrong.
-
Since ASCII characters use code points 0-255,
I think that at the time C was born, ASCII only used points 0-127, hence it not mattering if char was signed or unsigned at the time. Because, assumptions remain valid for all of time, right?
-
The joys of undefined behaviour writ large.
#undefined behaviour
No, not much joy there. Maybe not large enough?
-
I think that at the time C was born, ASCII only used points 0-127, hence it not mattering if <tt>char</tt> was <tt>signed</tt> or <tt>unsigned</tt> at the time. Because, assumptions remain valid for all of time, right?
ASCII is still only using points 0-127, anything else is an extension of ASCII. :-)
-
But chars aren't bytes, chars are characters. Those are different.
I thought you liked strong typing, and now we find you're taking something labeled "char" and putting in arbitrary bytes? Feh.
It's not my fault they named byte typechar
. Also, if you store Unicode character inchar
type, you're doing it wrong.Implementation detail. They're conceptually different no matter what the language
Except in C they madechar
literally the byte type. They even defined all the other types as multiples ofchar
type. I hate C so much.If you BELIEVE in strong typing, then you should ALSO believe in treating byte arrays and char arrays as two different types. Even if the underlying language's implementation (in this case) is wrong. A char is not a byte. Even if your app is exclusively ASCII, a char is still not a byte.
I almost agree with you. Almost, because YOU'VE GOT IT BACKWARDS.char
ISN'T MEANT TO STORE ARBITRARY CHARACTERS - it's meant to store ARBITRARY BYTES. RTFM from time to time, please.I know it's counterintuitive, but in C,
char
means byte, not character. Even if it was namedyour_mom
, it would still be a byte.Normally it's not a big deal, but it's really annoying to see it from someone who just went on a tirade about how great strong typing is-- then we find out he doesn't even do it himself! Ugh.
There is the ideal world where everything is strongly typed and type inference takes care of everything being safe and how you meant it and where you don't have to ever check for null because API contract guarantees it won't happen and where operating system vendors don't assume a filepath is the file identifier and where all windows automatically align on screen in the most optimal way, and there's real life where you need void-casts to make your stuff work with the library.He doesn't believe in "strong typing" as a general concept. He believes in "do whatever C does". Which is fine, but it's not strong typing.
Not at all. C is a horrible language. I hate it, I hate it with all my life, I hate every single function in standard library, every single function in POSIX headers, every single WinAPI function, every single character of C code I've ever written. It's one of the worst still-maintained languages there is. And we fucking must use it either directly or through wrappers because almost all low-level libraries are written in it, most significantly the APIs of all major operating systems. But it's still better than Java.
-
There is the ideal world where everything is strongly typed and type inference takes care of everything being safe
INB4 @antiquarian Haskell (or whatever)
-
I almost agree with you. Almost, because YOU'VE GOT IT BACKWARDS. char ISN'T MEANT TO STORE ARBITRARY CHARACTERS - it's meant to store ARBITRARY BYTES. RTFM from time to time, please.
That might be true now; that was not true when C was designed.
-
That might be true now; that was not true when C was designed
There's a timepod rant in this, somwhere.
-
There is the ideal world where everything is strongly typed and type inference takes care of everything being safe and how you meant it
If I may be allowed a bit of pendantry: the strong typing takes care of everything being safe; type inference is meant to save you typing.
-
ASCII is still only using points 0-127, anything else is an extension of ASCII. :-)
Yes, but even back then Extended ASCII was around.
Oh, and before I forget, EBCDIC was also around when C was created and has always used code points past 127... heck, lowercase letters start at code point 129.
-
That might be true now; that was not true when C was designed.
There was no Unicode when C was designed andchar
can actually store then-standard ASCII just fine, if that's what you mean. But still, they directly tied byte size to character size from day one.If I may be allowed a bit of pendantry: the strong typing takes care of everything being safe; type inference is meant to save you typing.
Being safe, as opposed to not compiling, I mean.
-