C stringsþÝ«ÌÎ‰Š‹ÿ

Gąska

@tar said:

Can I do stuff like this with these format strings?

Yes.

@PleegWat said:

I'd hope it can do everything printf can? Placeholder width, precision, truncation. Does it include time formatting?

Sadly, no. Also, the bigger problem is that format string must be provided at compile-time.

FrostCat

@TwelveBaud said:

It's an OS limitation.

Hmmm. I thought some of those limits aren't true under NTFS but am not an expert.

http://technet.microsoft.com/en-us/library/cc781134(v=WS.10).aspx says file names (not paths) can be 255 Unicode characters, FWIW, and I thought that the native functions can get around the limits, but admittedly most people won't use those.

TwelveBaud

I wish Tuxera hadn't taken down their excellent NTFS internals page as it makes finding this information difficult.

...

According to some other guys, NTFS stores in a byte the number of characters, not bytes, in the name as I'd originally believed. So yes, 255 Unicode characters per path element.

FrostCat

Like I said, I thought I had read years ago that if you use the native API instead--remembering that Windows is, at this level, a personality over NT, like Posix, or the old OS/2 subsystem--the length and character restrictions go away, but I couldn't find anything to back that up on a quick search just now, so it could well be wrong.

dkf

@FrostCat said:

Like I said, I thought I had read years ago that if you use the native API instead--remembering that Windows is, at this level, a personality over NT, like Posix, or the old OS/2 subsystem--the length and character restrictions go away, but I couldn't find anything to back that up on a quick search just now, so it could well be wrong.

Precede the filename with \\?\ to make the length restrictions go away. It says so right here…
[spoiler]Top hit for “windows long file names prefix”…[/spoiler]

FrostCat

@dkf said:

Precede the filename with \?\ to make the length restrictions go away.

It wasn't just that. Like I said, I had thought you could get rid of the restricted characters limit, too.

CreatedToDislikeThis

\\?\ allows you to have files that start/end with dots/spaces and have reserved device names (NUL, etc.).
But you still can't use the reserved characters (such as *) via win32 apis.

FrostCat

@CreatedToDislikeThis said:

\?\ allows you to have files that start/end with dots/spaces and have reserved device names (NUL, etc.).But you still can't use the reserved characters via win32 apis.

If it mattered that much, to unreserve the characters you could use the \?\ (or like I said, I though the Native API could do it, although of course that's mostly undocumented) instead of the win32 apis.

Time to switch to the Bad Ideas thread?

Buddy

Better to use ContainsKey for that anyway. Nulls make better sense as an optional extra than as default functionality; 99% of variables never need to hold a null, and having to wrap the ones that do in Nullable<T> serves the extra purpose of pretty much forcing a null check before anyone can even get at the value.

OffByOne

@Gaska said:

NAME
strlen - calculate the length of a string

SYNOPSIS
#include <string.h>
   size_t strlen(const char *s);
DESCRIPTION
The strlen() function calculates the length of the string s, excluding the terminating null byte ('\0').

RETURN VALUE
The strlen() function returns the number of bytes in the string s.
Everything works according to spec. Again, it's programmer's fault to assume the length of string is number of characters in it.

I've emboldened the ambiguous parts. "length of string" can mean quite a few things:

number of characters (counting combining characters as separate characters)
number of screen spaces occupied (combining characters don't contribute to the result)
number of "screen cells" occupied (result += 1 for most characters, but result += 2 for wide characters)
number of bytes used in memory for the string in the particular encoding/normalization it is represented in (this seems to be what strlen() calculates)
... ?

I also wouldn't assume anything wrt how strlen() behaves with strings where \0 is a valid part of the encoding of some characters; counting bytes until the first \0 is incorrect for encodings that have those.

Of course the programmer shouldn't use strlen() unless what he's interested in is what strlen() calculates, especially when dealing with strings that are not entirely composed of US-ASCII characters.

Gąska

@OffByOne said:

I've emboldened the ambiguous parts. "length of string" can mean quite a few things:

Don't look at description but at return value, because your code doesn't deal with description but with return value. And return value is unambiguously documented as number of bytes.

@OffByOne said:

I also wouldn't assume anything wrt how strlen() behaves with strings where \0 is a valid part of the encoding of some characters; counting bytes until the first \0 is incorrect for encodings that have those.

Except ISO C forbids \0 in text strings.

@OffByOne said:

Of course the programmer shouldn't use strlen() unless what he's interested in is what strlen() calculates, especially when dealing with strings that are not entirely composed of US-ASCII characters.

strlen() is usually used to determine how far you can iterate from the pointer or how much memory you need to allocate when copying rather than how much screen space will be occupied.

TwelveBaud

@OffByOne said:

I also wouldn't assume anything wrt how strlen() behaves with strings where \0 is a valid part of the encoding of some characters; counting bytes until the first \0 is incorrect for encodings that have those.

Which is why you use wcslen() or mb_strlen() instead. Right tool for the right job.

delfinom

This post is deleted!

tar

@OffByOne said:

strlen()... dealing with strings that are not entirely composed of US-ASCII characters

That sounds like UB to me.

Kian

Maybe. Depends on whether there is a null terminator at all. So long as you have the null terminator inside valid memory, it's not.

First, one must understand what a string means in C. A string is a null terminated sequence of bytes. It is not a sentence, or text, or anything else. It's not whatever the programmer thinks he's passing to the function. If you use a multibyte encoding that allows '\0' characters, the first c-string in the memory will go from the first memory address you gave the function to the first '\0' it encounters. If the first byte is '\0', it's considered to be an empty string of length 0. So, if you have the array:

{ 'a', 'b', 'c', 0, '1', '2', '3', 0 }

You have two c strings, one starts at 'a', and is of length 3 (exclude the null terminator in the length), and the second one starts at '1' and is also length 3.

Similarly, if you hand a multibyte encoding, strln will return the number of bytes until the first null in the array.

tar

I understand what strlen() does with a char*, I'm just not really sure why anyone would be interested in the result it gives when the pointed-to string is a multibyte encoding.

PleegWat

@Kian said:

You have two c arrays, one starts at 'a', and is of length 3 (exclude the null terminator in the length), and the second one starts at '1' and is also length 3.

Nitpick: 2 strings. One array.

Kian

Just pointing out it's not UB, unless there is no null terminating character. It may not be useful, and it may result in a bug in the program, but it's a well specified bug.

@PleegWat said:

Nitpick: 2 strings. One array

Corrected. Good catch. I meant to say string but got it mixed up.

tar

@Kian said:

a well specified bug

Those are the best kinds of bugs.

PleegWat

@tar said:

I understand what strlen() does with a char*, I'm just not really sure why anyone would be interested in the result it gives when the pointed-to string is a multibyte encoding.

That all depends on what your program intends to do. If you are interested more in the strings as a whole than in the individual characters, strlen() is the function you want because it tells you how much memory to allocate.

Character counts are mainly interesting when doing position-based substring operations, determining length limits (EG when inserting into a database column with character-based length limit), etc.

Kian

@PleegWat said:

That all depends on what your program intends to do. If you are interested more in the strings as a whole than in the individual characters, strlen() is the function you want because it tells you how much memory to allocate.

Well, that's the whole point of what's being discussed. If you give strlen a sequence of characters with a multibyte enconding, such as UTF16, some of your bytes are going to be null because you are supposed to read characters many bytes at a time. So you're not going to receive the size you need to allocate, which is going to lead to bugs. tar wondered if that was UB, I explained it was not. It is well defined, but wrong.

tar

I think the fact that it is wrong is probably more significant than the fact that it is defined behaviour though.

boomzilla

@tar said:

I think the fact that it is wrong is probably more significant than the fact that it is defined behaviour though.

Allow me to rephrase...

It's FUCKING WRONG you asshole. Defining behavior to be WRONG is still wrong and you're a terrible fucking person for liking stuff being wrong. This is why everything is shit!

tar

I feel strangely aroused now...

Kian

@boomzilla said:

It's FUCKING WRONG you asshole. Defining behavior to be WRONG is still wrong and you're a terrible fucking person for liking stuff being wrong.

I'm not sure who you're addressing this to.

boomzilla

@Kian said:

I'm not sure who you're addressing this to.

If you don't know then it's not you.

antiquarian

Nice blakeyrant.

boomzilla

Thanks.

dkf

@OffByOne said:

number of "screen cells" occupied

That one doesn't even work for ASCII.

Gąska

@Kian said:

If you give strlen a sequence of characters with a multibyte enconding, such as UTF16

Then it means you've casted char16_t into char and treat it as char-string instead of char16_t-string. THAT'S your problem, not strlen().

Kian

No one said strlen was the problem. The person that came closest to that was OffByOne, and even they clarified:
@OffByOne said:

Of course the programmer shouldn't use strlen() unless what he's interested in is what strlen() calculates, especially when dealing with strings that are not entirely composed of US-ASCII characters.

Aside from that, the problem is not just that the type is different. The problem¹ is that a c-string has specific rules that are not necessarily enforced by every pointer to char16_t even. I could pack several c-strings of valid UTF-16 text into a single array of char16_t, one after the other, and if I use strlen thinking it will walk the array until the end of the array I would get surprising and unexpected behavior, which is wrong for your program but well defined according to the language. Which is what I meant to highlight before.

Problem meaning "the tricky bit that catches beginners unaware". It's not a problem in itself.

tar

@Kian said:

I could pack several c-strings of valid UTF-16 text into a single array of char16_t, one after the other, and if I use strlen

Sorry to cut you off mid-sentence, but why would you want to use strlen() on a char16_t*?

Kian

Assuming there's a version of strlen for char16_t that behaves similarly, giving you the number of words instead of bytes. Should have clarified that.

Gąska

@Kian said:

words

Always loved that term. It's so wrong on almost every machine.

Kian

Naming things is hard.

Gąska

Yes it is. So what?

OffByOne

I meant modulo control characters; just characters with values >31 and <128.

powerlord

Wait, are we talking about C or C++ here?

C++ seems to have std::char_traits<char16_t>::length(char16_t*) for returning the number of characters instead of bytes.

(and of course std::char_traits<char32_t>::length(char32_t*) for char32_t since it's actually a template)

lucas

Great another thread about stuff that only C++ programmers understand ... ;-)

tar

It's a very elite club...

tar

(Not to imply that knowledge of dark C++ secrets is intrinsically of any value...)

boomzilla

@lucas said:

Great another thread about stuff that only C++ programmers understand ... ;-)

Better than the PHP threads.

accalia

@boomzilla said:

Better than the PHP threads.

we need more of those....

boomzilla

@accalia said:

we need more of those...

I never understand those. Some of the C/C++ goes over my head, but I learn from them. MEGO during PHP time.

accalia

@boomzilla said:

MEGO during PHP time.

mine too.

but i want to see the person who would be posting those topics more than i want the topics.

he's been absent too long. :-(

dcon

@tar said:

It's a very elite club...

Pays pretty well too

lucas

It reminds me of old gentleman talking about arbitrary classifications on things that nobody else in the pub understands.

tar

Old men can be pretty racist when they've had a few...

dkf

@OffByOne said:

characters with values >31 and <128

Awesome! We've still got a DEL (127)…

(The term you were looking for is printable ASCII characters.)

Gąska

@lucas said:

Great another thread about stuff that only C++ programmers understand ...

Because only C++ has those problems.