C stringsþÝ«ÌΉŠ‹ÿ


  • Banned

    @tar said:

    Can I do stuff like this with these format strings?

    Yes.

    @PleegWat said:

    I'd hope it can do everything printf can? Placeholder width, precision, truncation. Does it include time formatting?

    Sadly, no. Also, the bigger problem is that format string must be provided at compile-time.


  • Discourse touched me in a no-no place

    @TwelveBaud said:

    It's an OS limitation.

    Hmmm. I thought some of those limits aren't true under NTFS but am not an expert.

    http://technet.microsoft.com/en-us/library/cc781134(v=WS.10).aspx says file names (not paths) can be 255 Unicode characters, FWIW, and I thought that the native functions can get around the limits, but admittedly most people won't use those.



  • I wish Tuxera hadn't taken down their excellent NTFS internals page as it makes finding this information difficult.

    ...

    According to some other guys, NTFS stores in a byte the number of characters, not bytes, in the name as I'd originally believed. So yes, 255 Unicode characters per path element.


  • Discourse touched me in a no-no place

    Like I said, I thought I had read years ago that if you use the native API instead--remembering that Windows is, at this level, a personality over NT, like Posix, or the old OS/2 subsystem--the length and character restrictions go away, but I couldn't find anything to back that up on a quick search just now, so it could well be wrong.


  • Discourse touched me in a no-no place

    @FrostCat said:

    Like I said, I thought I had read years ago that if you use the native API instead--remembering that Windows is, at this level, a personality over NT, like Posix, or the old OS/2 subsystem--the length and character restrictions go away, but I couldn't find anything to back that up on a quick search just now, so it could well be wrong.

    Precede the filename with \\?\ to make the length restrictions go away. It says so right here
    [spoiler]Top hit for “windows long file names prefix”…[/spoiler]


  • Discourse touched me in a no-no place

    @dkf said:

    Precede the filename with \?\ to make the length restrictions go away.

    It wasn't just that. Like I said, I had thought you could get rid of the restricted characters limit, too.



  • \\?\ allows you to have files that start/end with dots/spaces and have reserved device names (NUL, etc.).
    But you still can't use the reserved characters (such as *) via win32 apis.


  • Discourse touched me in a no-no place

    @CreatedToDislikeThis said:

    \?\ allows you to have files that start/end with dots/spaces and have reserved device names (NUL, etc.).But you still can't use the reserved characters via win32 apis.

    If it mattered that much, to unreserve the characters you could use the \?\ (or like I said, I though the Native API could do it, although of course that's mostly undocumented) instead of the win32 apis.

    Time to switch to the Bad Ideas thread? 😄



  • Better to use ContainsKey for that anyway. Nulls make better sense as an optional extra than as default functionality; 99% of variables never need to hold a null, and having to wrap the ones that do in Nullable<T> serves the extra purpose of pretty much forcing a null check before anyone can even get at the value.



  • @Gaska said:

    NAME
    strlen - calculate the length of a string

    SYNOPSIS
    #include <string.h>

       size_t strlen(const char *s);
    

    DESCRIPTION
    The strlen() function calculates the length of the string s, excluding the terminating null byte ('\0').

    RETURN VALUE
    The strlen() function returns the number of bytes in the string s.


    Everything works according to spec. Again, it's programmer's fault to assume the length of string is number of characters in it.

    I've emboldened the ambiguous parts. "length of string" can mean quite a few things:

    • number of characters (counting combining characters as separate characters)
    • number of screen spaces occupied (combining characters don't contribute to the result)
    • number of "screen cells" occupied (result += 1 for most characters, but result += 2 for wide characters)
    • number of bytes used in memory for the string in the particular encoding/normalization it is represented in (this seems to be what strlen() calculates)
    • ... ?

    I also wouldn't assume anything wrt how strlen() behaves with strings where \0 is a valid part of the encoding of some characters; counting bytes until the first \0 is incorrect for encodings that have those.

    Of course the programmer shouldn't use strlen() unless what he's interested in is what strlen() calculates, especially when dealing with strings that are not entirely composed of US-ASCII characters.


  • Banned

    @OffByOne said:

    I've emboldened the ambiguous parts. "length of string" can mean quite a few things:

    Don't look at description but at return value, because your code doesn't deal with description but with return value. And return value is unambiguously documented as number of bytes.

    @OffByOne said:

    I also wouldn't assume anything wrt how strlen() behaves with strings where \0 is a valid part of the encoding of some characters; counting bytes until the first \0 is incorrect for encodings that have those.

    Except ISO C forbids \0 in text strings.

    @OffByOne said:

    Of course the programmer shouldn't use strlen() unless what he's interested in is what strlen() calculates, especially when dealing with strings that are not entirely composed of US-ASCII characters.

    strlen() is usually used to determine how far you can iterate from the pointer or how much memory you need to allocate when copying rather than how much screen space will be occupied.



  • @OffByOne said:

    I also wouldn't assume anything wrt how strlen() behaves with strings where \0 is a valid part of the encoding of some characters; counting bytes until the first \0 is incorrect for encodings that have those.
    Which is why you use wcslen() or mb_strlen() instead. Right tool for the right job.



  • This post is deleted!


  • @OffByOne said:

    strlen()... dealing with strings that are not entirely composed of US-ASCII characters

    That sounds like UB to me.



  • Maybe. Depends on whether there is a null terminator at all. So long as you have the null terminator inside valid memory, it's not.

    First, one must understand what a string means in C. A string is a null terminated sequence of bytes. It is not a sentence, or text, or anything else. It's not whatever the programmer thinks he's passing to the function. If you use a multibyte encoding that allows '\0' characters, the first c-string in the memory will go from the first memory address you gave the function to the first '\0' it encounters. If the first byte is '\0', it's considered to be an empty string of length 0. So, if you have the array:

    { 'a', 'b', 'c', 0, '1', '2', '3', 0 }
    

    You have two c strings, one starts at 'a', and is of length 3 (exclude the null terminator in the length), and the second one starts at '1' and is also length 3.

    Similarly, if you hand a multibyte encoding, strln will return the number of bytes until the first null in the array.



  • I understand what strlen() does with a char*, I'm just not really sure why anyone would be interested in the result it gives when the pointed-to string is a multibyte encoding.


  • Java Dev

    @Kian said:

    You have two c arrays, one starts at 'a', and is of length 3 (exclude the null terminator in the length), and the second one starts at '1' and is also length 3.

    Nitpick: 2 strings. One array.



  • Just pointing out it's not UB, unless there is no null terminating character. It may not be useful, and it may result in a bug in the program, but it's a well specified bug.

    @PleegWat said:

    Nitpick: 2 strings. One array

    Corrected. Good catch. I meant to say string but got it mixed up.



  • @Kian said:

    a well specified bug

    Those are the best kinds of bugs.


  • Java Dev

    @tar said:

    I understand what strlen() does with a char*, I'm just not really sure why anyone would be interested in the result it gives when the pointed-to string is a multibyte encoding.

    That all depends on what your program intends to do. If you are interested more in the strings as a whole than in the individual characters, strlen() is the function you want because it tells you how much memory to allocate.

    Character counts are mainly interesting when doing position-based substring operations, determining length limits (EG when inserting into a database column with character-based length limit), etc.



  • @PleegWat said:

    That all depends on what your program intends to do. If you are interested more in the strings as a whole than in the individual characters, strlen() is the function you want because it tells you how much memory to allocate.

    Well, that's the whole point of what's being discussed. If you give strlen a sequence of characters with a multibyte enconding, such as UTF16, some of your bytes are going to be null because you are supposed to read characters many bytes at a time. So you're not going to receive the size you need to allocate, which is going to lead to bugs. tar wondered if that was UB, I explained it was not. It is well defined, but wrong.



  • I think the fact that it is wrong is probably more significant than the fact that it is defined behaviour though.


  • ♿ (Parody)

    @tar said:

    I think the fact that it is wrong is probably more significant than the fact that it is defined behaviour though.

    Allow me to rephrase...

    It's FUCKING WRONG you asshole. Defining behavior to be WRONG is still wrong and you're a terrible fucking person for liking stuff being wrong. This is why everything is shit!



  • I feel strangely aroused now...



  • @boomzilla said:

    It's FUCKING WRONG you asshole. Defining behavior to be WRONG is still wrong and you're a terrible fucking person for liking stuff being wrong.

    I'm not sure who you're addressing this to.


  • ♿ (Parody)

    @Kian said:

    I'm not sure who you're addressing this to.

    If you don't know then it's not you.


  • BINNED

    Nice blakeyrant.


  • ♿ (Parody)

    Thanks. 😄


  • Discourse touched me in a no-no place

    @OffByOne said:

    number of "screen cells" occupied

    That one doesn't even work for ASCII.


  • Banned

    @Kian said:

    If you give strlen a sequence of characters with a multibyte enconding, such as UTF16

    Then it means you've casted char16_t into char and treat it as char-string instead of char16_t-string. THAT'S your problem, not strlen().



  • No one said strlen was the problem. The person that came closest to that was OffByOne, and even they clarified:
    @OffByOne said:

    Of course the programmer shouldn't use strlen() unless what he's interested in is what strlen() calculates, especially when dealing with strings that are not entirely composed of US-ASCII characters.

    Aside from that, the problem is not just that the type is different. The problem1 is that a c-string has specific rules that are not necessarily enforced by every pointer to char16_t even. I could pack several c-strings of valid UTF-16 text into a single array of char16_t, one after the other, and if I use strlen thinking it will walk the array until the end of the array I would get surprising and unexpected behavior, which is wrong for your program but well defined according to the language. Which is what I meant to highlight before.

    1. Problem meaning "the tricky bit that catches beginners unaware". It's not a problem in itself.


  • @Kian said:

    I could pack several c-strings of valid UTF-16 text into a single array of char16_t, one after the other, and if I use strlen

    Sorry to cut you off mid-sentence, but why would you want to use strlen() on a char16_t*?



  • Assuming there's a version of strlen for char16_t that behaves similarly, giving you the number of words instead of bytes. Should have clarified that.


  • Banned

    @Kian said:

    words

    Always loved that term. It's so wrong on almost every machine.



  • Naming things is hard.


  • Banned

    Yes it is. So what?



  • I meant modulo control characters; just characters with values >31 and <128.



  • Wait, are we talking about C or C++ here?

    C++ seems to have std::char_traits<char16_t>::length(char16_t*) for returning the number of characters instead of bytes.

    (and of course std::char_traits<char32_t>::length(char32_t*) for char32_t since it's actually a template)



  • Great another thread about stuff that only C++ programmers understand ... ;-)



  • It's a very elite club...



  • (Not to imply that knowledge of dark C++ secrets is intrinsically of any value...)


  • ♿ (Parody)

    @lucas said:

    Great another thread about stuff that only C++ programmers understand ... ;-)

    Better than the PHP threads.


  • FoxDev

    @boomzilla said:

    Better than the PHP threads.

    we need more of those....


  • ♿ (Parody)

    @accalia said:

    we need more of those...

    I never understand those. Some of the C/C++ goes over my head, but I learn from them. MEGO during PHP time.


  • FoxDev

    @boomzilla said:

    MEGO during PHP time.

    mine too.

    but i want to see the person who would be posting those topics more than i want the topics.

    he's been absent too long. :-(



  • @tar said:

    It's a very elite club...

    Pays pretty well too 😄



  • It reminds me of old gentleman talking about arbitrary classifications on things that nobody else in the pub understands.



  • Old men can be pretty racist when they've had a few...


  • Discourse touched me in a no-no place

    @OffByOne said:

    characters with values >31 and <128

    Awesome! We've still got a DEL (127)…

    (The term you were looking for is printable ASCII characters.)


  • Banned

    @lucas said:

    Great another thread about stuff that only C++ programmers understand ...

    Because only C++ has those problems.


Log in to reply