C stringsþÝ«ÌΉŠ‹ÿ



  • Some people like to be able to glom any old shit into any old container or whatever and have the computer attempt to make some sense of what it's being asked to do rather than recoiling in horror. Call it the PHP approach to software.



  • When I am serializing / deserializing stuff, such as XML or JSON. Especially XML and I don't have a schema



  • @Gaska said:

    But seriously. Why would anyone want dynamic types in a statically typed language? That kills the whole purpose of having compile-time type checks.

    The C application I work on needs to be able to receive, store, and transmit data from various types depending on a configuration file, so using dynamic types is the only option.
    To store the actual data, a union is used instead of void* and casting; it’s cleaner and avoids memory alignment issues.



  • Having done quite a bit of COM development or DLLs callable from .Net, I'm partial to VARIANT and SAFEARRAY.


  • Banned

    @lucas said:

    When I am serializing / deserializing stuff, such as XML or JSON. Especially XML and I don't have a schema

    If you don't have schema (not necessarily as XML schema file, but an abstract concept about what data you accept), how can you even serialize it other than building stringly-typed element tree? And if you do have schema, what's the problem with making it a static type?

    @VinDuv said:

    The C application I work on needs to be able to receive, store, and transmit data from various types depending on a configuration file, so using dynamic types is the only option.

    I would use char arrays. Because bytes are bytes are bytes.

    @VinDuv said:

    To store the actual data, a union is used instead of void* and casting; it’s cleaner and avoids memory alignment issues.

    Unions aren't exactly the same thing as dynamic types. Personally, I like unions - but I treat them more as stack-allocable hierarchical polymorphism tool than whatever-the-fuck-I-want tool.



  • @Gaska said:

    I would use char arrays. Because bytes are bytes are bytes.

    The problem with char arrays is that they may not be properly aligned. If you use something like this:

    struct my_variant {
        uint8_t type;
        char data[4];
    };
    

    you won’t be able to read/write an int32_t from/into data directly; you’ll have to use memcpy every time. (I know *(int32_t*)v.data will work on most common architectures, but it doesn’t work on the particular ARM architecture I’m developing for -- reading/writing at an odd address will write at the previous even address)

    If you use a union

    struct my_variant {
        uint8_t type;
        union {
            int8_t data_int8;
            int32_t data_int32;
        };
    };
    

    you’ll be able to use v.data_int32 without issues.


  • Banned

    @VinDuv said:

    The problem with char arrays is that they may not be properly aligned. If you use something like this you won’t be able to read/write an int32_t from/into data directly

    Wait wait wait, when did we get to int32_t's? Your original code was about passing around stuff. Stuff you don't know shit about, nor you care. And if you care, use real data types. And write proper serialization functions. And if you really, really need to manipulate char[] as anything other than char[] without proper deserialization, there's always __attribute__ ((aligned (4))).

    @VinDuv said:

    If you use a union you’ll be able to use v.data_int32 without issues.

    And that's why unions are awesome. But notice how you have only two variants in your union, not countably infinite like boost::any.



  • @Gaska said:

    Wait wait wait, when did we get to int32_t's? Your original code was about passing around stuff.

    Yeah, I forgot to mention that the application also does some work on the data (the configuration file allows things like “Value 3 (int32) is the sum of value 2 (int32) and value 5 (int32)”). Sorry.
    I agree that a char buffer is enough if the data only needs to be stored somewhere.

    @Gaska said:

    But notice how you have only two variants in your union, not countably infinite like boost::any.

    Unfortunately, plain old C still seem to be the reference language for embedded systems programming around here...


  • Banned

    @VinDuv said:

    Yeah, I forgot to mention that the application also does some work on the data (the configuration file allows things like “Value 3 (int32) is the sum of value 2 (int32) and value 5 (int32)”).

    You effectively have a scripting language embedded in your app. Why the fuck would you need to specify type in scripting language? Or does the configuration file specify the data format you get from external sources? This would be a WTF on its own.

    @VinDuv said:

    Unfortunately, plain old C still seem to be the reference language for embedded systems programming around here...

    ¿Que? I totally didn't expect a non sequitur in technical discussion.


  • ♿ (Parody)

    @Medinoc said:

    SAFEARRAY

    :shudder:



  • @Gaska said:

    I would use char arrays. Because bytes are bytes are bytes.

    But chars aren't bytes, chars are characters. Those are different.

    I thought you liked strong typing, and now we find you're taking something labeled "char" and putting in arbitrary bytes? Feh.


  • ♿ (Parody)

    @blakeyrat said:

    But chars aren't bytes, chars are characters. Those are different.

    Not in C.


  • kills Dumbledore

    Implementation detail. They're conceptually different no matter what the language



  • @boomzilla said:

    Not in C.

    If you BELIEVE in strong typing, then you should ALSO believe in treating byte arrays and char arrays as two different types. Even if the underlying language's implementation (in this case) is wrong. A char is not a byte. Even if your app is exclusively ASCII, a char is still not a byte.

    C# has the exact same issue. Bugs the heck out of me.

    Normally it's not a big deal, but it's really annoying to see it from someone who just went on a tirade about how great strong typing is-- then we find out he doesn't even do it himself! Ugh.


  • ♿ (Parody)

    @Jaloopa said:

    Implementation detail. They're conceptually different no matter what the language

    @blakeyrat said:

    If you BELIEVE in strong typing, then you should ALSO believe in treating byte arrays and char arrays as two different types.

    Whatever you want to believe, a char in C is defined as a byte.

    I'm not saying that there shouldn't be a conceptual difference, and that's where stuff like wchar comes in in C.



  • @boomzilla said:

    Whatever you want to believe, a char in C is defined as a byte.

    Correct; but C is wrong. That's my point.

    He doesn't believe in "strong typing" as a general concept. He believes in "do whatever C does". Which is fine, but it's not strong typing.


  • ♿ (Parody)

    @blakeyrat said:

    Correct; but C is wrong. That's my point.

    Fine, but at least having the appropriate target of your bitching doesn't make you look like you're wrong.

    @blakeyrat said:

    He doesn't believe in "strong typing" as a general concept. He believes in "do whatever C does". Which is fine, but it's not strong typing.

    I'm not sure that's an accurate summary, but I'm also not positive who is and isn't trolling whom in this thread.


  • Java Dev

    Yeah, C's angle of 'all integers of the same width and signedness are equal' can be annoying. As a workaround you'd have to wrap the variable in a struct.


  • Discourse touched me in a no-no place

    @blakeyrat said:

    Correct; but C is wrong. That's my point.

    Restating: C, for historical reasons, calls its byte type char. Or unsigned char. It sometimes pretends that it is the type of characters too. Ho ho ho. ;-)

    @blakeyrat said:

    He doesn't believe in "strong typing" as a general concept. He believes in "do whatever C does". Which is fine, but it's not strong typing.

    C's type system is “minimum over assembler to get rid of the worst of the pain”. Heck, that's C all over; it has occupied that particular niche for decades now and isn't about to change.



  • @boomzilla said:

    and that's where stuff like wchar comes in in C.

    To make things even more fun, C++ now has char16_t and char32_t.

    Granted, if you're using C++ you should be using std::string anyway...


  • ♿ (Parody)

    @powerlord said:

    C++

    Yeah, I didn't want to go there, but the stuff you mentioned are still different from "char"s. 😛



  • @dkf said:

    Restating: C, for historical reasons, calls its byte type char. Or unsigned char. It sometimes pretends that it is the type of characters too. Ho ho ho. ;-)

    And for extra fun, char is distinct from both signed char and unsigned char, making three distinct char types. Ho ho ho indeed. 😕



  • @powerlord said:

    To make things even more fun, C++ now has char16_t and char32_t.

    Granted, if you're using C++ you should be using std::string anyway...

    Or perhaps std::u16string or std::u32string.



  • @cvi said:

    And for extra fun, char is distinct from both signed char and unsigned char, making three distinct char types. Ho ho ho indeed. 😕

    I know my C-fu isn't strong, but given that char is already unsigned, how would char and unsigned char be different?



  • @powerlord said:

    I know my C-fu isn't strong, but given that char is already unsigned, how would char and unsigned char be different?

    TIL.

    http://stackoverflow.com/questions/4337217/difference-between-signed-unsigned-char mentions it (withouth citations, grmbl), but http://stackoverflow.com/questions/4337217/difference-between-signed-unsigned-char#comment37080635_4337249 suggests that it's only true in C++.

    Care to elaborate, @powerlord?



  • @powerlord said:

    I know my C-fu isn't strong, but given that char is already unsigned, how would char and unsigned char be different?

    char isn't necessarily unsigned, that's up to the compiler/platform (IIRC, it's signed by default in GCC, but unsigned in MSVC).

    As for the second part, the standard says so. It affects stuff like function overloads and templates (again IIRC).



  • @OffByOne said:

    http://stackoverflow.com/questions/4337217/difference-between-signed-unsigned-char mentions it (withouth citations, grmbl), but http://stackoverflow.com/questions/4337217/difference-between-signed-unsigned-char#comment37080635_4337249 suggests that it's only true in C++.

    Hmm, http://stackoverflow.com/a/2054941 suggests that it might be true for C as well, but I'm more familiar with C++, so I can't really confirm or deny that. Alas, <limits.h> (the plain old C header) does indeed have all of

    • SCHAR_MIN, SCHAR_MAX
    • UCHAR_MAX
    • and additionally CHAR_MIN and CHAR_MAX.

    Edit: So the following three functions f are distinct overloads:

    #include <stdio.h>
    
    void f(char) { printf( "char\n" ); }
    void f(signed char) { printf( "schar\n" ); }
    void f(unsigned char) { printf( "uchar\n" ); }
    
    int main()
    {
    	char a = 0;
    	signed char b = 0;
    	unsigned char c = 0;
    
    	f(a);
    	f(b);
    	f(c);
    
    	return 0;
    }
    


  • See http://www.trilithium.com/johan/2005/01/char-types/, especially the little PoC at the end and the remark that gcc complains (when invoked with -pedantic), whereas g++ errors out.



  • Well, TIL that you can't make wchar_t signed or unsigned.

    Edit: and the same applies for the C++11 char16_t and char32_t.


  • Discourse touched me in a no-no place

    @powerlord said:

    I know my C-fu isn't strong, but given that char is already unsigned, how would char and unsigned char be different?

    It isn't necessarily.

    It is explicitly stated in The Standard™ that the sign of char is implementation defined.

    Which is why, when doing cross-platform stuff, there are three types of char:

    • signed,
    • unsigned, and
    • do-you-feel-lucky-punk? (if you don't check)

    Of course, if you're single-platform, and you know what your compiler does, option #3 isn't (necessarily) an issue for you.


    http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf

    6.2.5

    ­
    15) The three types char, signed char, and unsigned char are collectively called the character types . The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char[35]
    ­


    ­

    [35] CHAR_MIN , defined in <limits.h> , will have one of the values 0 or SCHAR_MIN , and this can be used to distinguish the two options. Irrespective of the choice made, char is a separate type from the other two and is not compatible with either.



  • So, as I understand it:

    • char is for (text) characters. This type is more useful as an array or char * to represent text. Multibyte text functions use char * as the MB string type, so that's consistent with this interpretation.
    • unsigned char should have been named byte instead. It's guaranteed to support values between 0 and 255 (might hold larger values, but that's implementation defined).
    • signed char can be used to hold very integer values between -128 and 127. It might hold support and/or larger values, but that is also implementation defined.

  • ♿ (Parody)

    @OffByOne said:

    char is for (text) characters.

    Hahaha. char is a byte. If you're using something like ASCII, then happily, bytes are the same as text characters. Code in the wild does all sorts of crazy stuff. YMMV



  • @boomzilla said:

    @OffByOne said:
    char is for (text) characters.

    Hahaha. char is a byte. If you're using something like ASCII, then happily, bytes are the same as text characters. Code in the wild does all sorts of crazy stuff. YMMV

    Hence my remark about being more useful in quantities > 1 (array or pointer to). I could have expressed that better though.

    Handling encoding conversions and using string functions that are aware of the specific encoding of your strings is left as an exercise for the programmer 😄



  • @boomzilla said:

    a char in C is defined as a byte

    This. The C char type is very poorly named. You probably wouldn't expect to use it to hold character data in any code written since 1999.



  • @boomzilla said:

    I'm also not positive who is and isn't trolling whom in this threadon this site in general.



  • @powerlord said:

    given that char is already unsigned

    Says who?



  • @cvi said:

    Well, TIL that you can't make wchar_t signed or unsigned.

    Edit: and the same applies for the C++11 char16_t and char32_t.

    How about an unsigned struct?



  • @boomzilla said:

    Code in the wild does all sorts of crazy stuff.

    The joys of undefined behaviour writ large.



  • @cvi said:

    char isn't necessarily unsigned, that's up to the compiler/platform (IIRC, it's signed by default in GCC, but unsigned in MSVC).

    As for the second part, the standard says so. It affects stuff like function overloads and templates (again IIRC).

    That just makes me want to smack my forehead. Since ASCII characters use code points 0-255, you'd expect the data type intended to store that to also be 0-255.

    I should cross-post this to Things that Dennis Ritchie got wrong.



  • @powerlord said:

    Since ASCII characters use code points 0-255,

    I think that at the time C was born, ASCII only used points 0-127, hence it not mattering if char was signed or unsigned at the time. Because, assumptions remain valid for all of time, right?



  • @tar said:

    The joys of undefined behaviour writ large.

    #undefined behaviour

    No, not much joy there. Maybe not large enough?



  • @tar said:

    I think that at the time C was born, ASCII only used points 0-127, hence it not mattering if <tt>char</tt> was <tt>signed</tt> or <tt>unsigned</tt> at the time. Because, assumptions remain valid for all of time, right?

    ASCII is still only using points 0-127, anything else is an extension of ASCII. :-)


  • Banned

    @blakeyrat said:

    But chars aren't bytes, chars are characters. Those are different.

    I thought you liked strong typing, and now we find you're taking something labeled "char" and putting in arbitrary bytes? Feh.


    It's not my fault they named byte type char. Also, if you store Unicode character in char type, you're doing it wrong.

    @Jaloopa said:

    Implementation detail. They're conceptually different no matter what the language

    Except in C they made char literally the byte type. They even defined all the other types as multiples of char type. I hate C so much.

    @blakeyrat said:

    If you BELIEVE in strong typing, then you should ALSO believe in treating byte arrays and char arrays as two different types. Even if the underlying language's implementation (in this case) is wrong. A char is not a byte. Even if your app is exclusively ASCII, a char is still not a byte.

    I almost agree with you. Almost, because YOU'VE GOT IT BACKWARDS. char ISN'T MEANT TO STORE ARBITRARY CHARACTERS - it's meant to store ARBITRARY BYTES. RTFM from time to time, please.

    I know it's counterintuitive, but in C, char means byte, not character. Even if it was named your_mom, it would still be a byte.

    @blakeyrat said:

    Normally it's not a big deal, but it's really annoying to see it from someone who just went on a tirade about how great strong typing is-- then we find out he doesn't even do it himself! Ugh.

    There is the ideal world where everything is strongly typed and type inference takes care of everything being safe and how you meant it and where you don't have to ever check for null because API contract guarantees it won't happen and where operating system vendors don't assume a filepath is the file identifier and where all windows automatically align on screen in the most optimal way, and there's real life where you need void-casts to make your stuff work with the library.

    @blakeyrat said:

    He doesn't believe in "strong typing" as a general concept. He believes in "do whatever C does". Which is fine, but it's not strong typing.

    Not at all. C is a horrible language. I hate it, I hate it with all my life, I hate every single function in standard library, every single function in POSIX headers, every single WinAPI function, every single character of C code I've ever written. It's one of the worst still-maintained languages there is. And we fucking must use it either directly or through wrappers because almost all low-level libraries are written in it, most significantly the APIs of all major operating systems. But it's still better than Java.


  • ♿ (Parody)

    @Gaska said:

    There is the ideal world where everything is strongly typed and type inference takes care of everything being safe

    INB4 @antiquarian Haskell (or whatever)



  • @Gaska said:

    I almost agree with you. Almost, because YOU'VE GOT IT BACKWARDS. char ISN'T MEANT TO STORE ARBITRARY CHARACTERS - it's meant to store ARBITRARY BYTES. RTFM from time to time, please.

    That might be true now; that was not true when C was designed.


  • ♿ (Parody)

    @blakeyrat said:

    That might be true now; that was not true when C was designed

    There's a timepod rant in this, somwhere.


  • BINNED

    @Gaska said:

    There is the ideal world where everything is strongly typed and type inference takes care of everything being safe and how you meant it

    If I may be allowed a bit of pendantry: the strong typing takes care of everything being safe; type inference is meant to save you typing.



  • @cvi said:

    ASCII is still only using points 0-127, anything else is an extension of ASCII. :-)

    Yes, but even back then Extended ASCII was around.

    Oh, and before I forget, EBCDIC was also around when C was created and has always used code points past 127... heck, lowercase letters start at code point 129.


  • Banned

    @blakeyrat said:

    That might be true now; that was not true when C was designed.

    There was no Unicode when C was designed and char can actually store then-standard ASCII just fine, if that's what you mean. But still, they directly tied byte size to character size from day one.

    @antiquarian said:

    If I may be allowed a bit of pendantry: the strong typing takes care of everything being safe; type inference is meant to save you typing.

    Being safe, as opposed to not compiling, I mean.


  • ♿ (Parody)

    @Gaska said:

    There was no Unicode

    Definitely a mistake to let all those funny characters in.


Log in to reply