Windows, Unicode file names and C++

asdf

I have an old C++ command-line application, originally POSIX-only, which I want to port to Windows. The application:

Uses char* and std::string everywhere, not wchar_t* or std::wstring. It also uses a regular main (main(int argv, char** argv)).
Reads argv directly instead of calling GetCommandLineW().
Uses the standard C++ APIs (ifstream, ofstream) to access files, sometimes also the C APIs. I've already made sure that files are always opened in binary mode. This is an application which reads from file or stdin, transforms the contents and writes to file or stdout.
Will not have to use any Windows-specific APIs except _spawn* as a replacement for fork() + exec*().

Will my application be able to handle Unicode in file names correctly? Or will it behave differently than expected?

_{Or to put it another way: If I'm Doing It Wrong™ consistently, will Windows magically do the right thing?}

Gąska

@asdf As long as you don't have any non-English characters anywhere, you should be good.

Serious answer: either use WinAPI, or change all strings and streams to wstrings and wstreams and pray for it to work for you.

skotl

@asdf Unlikely without switching away from char * for anything that handles a filename.
You may well get away with it but any unicode character with a zero byte in it will get treated as the EOS marker.
You mentioned that the file is processed as binary data - do you read text from the file at all?
If not then hopefully you just need to worry about getting the strings from the command line to the open statement, unmolested.

asdf

@skotl said in Windows, Unicode file names and C++:

You may well get away with it but any unicode character with a zero byte in it will get treated as the EOS marker.

Hm, are you sure Windows doesn't convert the argument encoding for argv? Because that's what some SO answer I read yesterday suggested.

asdf

@skotl said in Windows, Unicode file names and C++:

You mentioned that the file is processed as binary data - do you read text from the file at all?

Not really. Everything in the file that's relevant to the processing is ASCII, the rest gets copied verbatim.

ben_lubar

@asdf said in Windows, Unicode file names and C++:

@skotl said in Windows, Unicode file names and C++:

You may well get away with it but any unicode character with a zero byte in it will get treated as the EOS marker.

Hm, are you sure Windows doesn't convert the argument encoding for argv? Because that's what some SO answer I read yesterday suggested.

Example of a WTF-16 character that contains a zero byte: every character in this post.

dkf

@ben_lubar However there won't be WTF-16 being passed around by those char* variables; even a pretty dumb developer would spot that problem. What sort of mangling they use instead… well, that's the real question.

asdf

@dkf said in Windows, Unicode file names and C++:

What sort of mangling they use instead… well, that's the real question.

And especially: Is the conversion consistent everywhere as long as I stick to the C/C++ API? Is the reverse transformation performed when I open a file and pass a char* or does the filename change when it's converted back to UTF-16?

flabdablet

@asdf said in Windows, Unicode file names and C++:

are you sure Windows doesn't convert the argument encoding for argv?

If it does, it's highly unlikely to do so using anything as sane as UTF-8. It's almost certain to do it in a way that depends on the current code page setting.

dse

Use this. A really nice library, and cross-platform. I can hardly think of writing a C++ application without CppRest library, and not just for ifstream_t

FAQ

The C++ REST SDK is a Microsoft project for cloud-based client-server communication in native code using a modern asynchronous C++ API design. This project aims to help C++ developers connect to an...

RaceProUK

@dse said in Windows, Unicode file names and C++:

Microsoft

@dse said in Windows, Unicode file names and C++:

cross-platform

Just five years ago, that would have been unthinkable.

dse

@RaceProUK said in Windows, Unicode file names and C++:

Just five years ago, that would have been unthinkable.

Frederic Lardinois / Nov 16, 2016

Microsoft joins the Linux Foundation | TechCrunch

How is this for a surprise: Microsoft today announced that it is joining the Linux Foundation as a high-paying Platinum member. "This may come as a

LB_

If you use the standard library that comes with Visual Studio, the .open() and constructors for file streams have non-standard overloads for wide characters which you can use to open files that have unicode characters in their filenames. And that's all I know.

Gąska

@skotl said in Windows, Unicode file names and C++:

You mentioned that the file is processed as binary data - do you read text from the file at all?

In C++, "binary file" means "don't fuck up line endings".

dkf

@RaceProUK said in Windows, Unicode file names and C++:

Just five years ago, that would have been unthinkable.

It was unthinkable exactly while Ballmer was in charge. It was obvious to me that that was where the policy came from, and I've since had it confirmed directly by MS employees. Neither Gates nor Nadella are nearly so committed to using a single platform without regard for what's good for the long-term health of the company.

dkf

@Gąska said in Windows, Unicode file names and C++:

In C++, "binary file" means "don't fuck up line endings".

Probably. These days, it might do other things too (such as disabling encoding conversion and clearing the default locale) but with binary data, you're not really supposed to do anything with it where the non-binary operations would matter all that much. The exact meaning of binary in the C++ spec isn't very clear (as on some platforms it's pretty close to a no-op) but it does convey the intention to manipulate binary data clearly to the runtime; the order of other operations wouldn't necessarily do that until it is far too late.

RaceProUK

@dkf said in Windows, Unicode file names and C++:

@RaceProUK said in Windows, Unicode file names and C++:

Just five years ago, that would have been unthinkable.

It was unthinkable exactly while Ballmer was in charge. It was obvious to me that that was where the policy came from, and I've since had it confirmed directly by MS employees. Neither Gates nor Nadella are nearly so committed to using a single platform without regard for what's good for the long-term health of the company.

I never really saw Gates as a fan of open source, but then I never saw him as against it either. To me, he just kinda ignored it and let it be. Nadella though, he's definitely for it.

Gąska

@dkf Last time I checked, which was a year or two ago, the text/binary setting work such that on Windows, text files had "\r\n" converted to (or from?) "\n", and *NIX systems it was ignored.

Anyway, everyone should always use binary mode under all and any circumstances, because it means reading exactly what you have in the file.

dkf

@Gąska said in Windows, Unicode file names and C++:

Anyway, everyone should always use binary mode under all and any circumstances, because it means reading exactly what you have in the file.

So you're processing some text content and you've decided you care a lot whether the file used newline or carriage-return-newline?

Gąska

@dkf if you use text mode, you can't control EOL character in output. Which means copying a file byte by byte might result in altered content. It's 2016 and we shouldn't care about this, but there are many Linux programs that don't work if you have \r\ns in your file, and many Windows programs that don't work with just \n. So if your program doesn't preserve line endings unless explicitly asked to, I'd call it a bug. Especially important if your program works well in pipe chains.

dkf

@Gąska said in Windows, Unicode file names and C++:

byte by byte

If you wanted byte by byte, you'd use binary. If you wanted character by character, you'd use text. They are different, sometimes enormously so. Unfortunately, C's idea of “character” is from 1970 — a time when men were real men, real programmers didn't eat quiche, and the big debate was still over ASCII vs EBCDIC — but it's still a nice try.

flabdablet

@Gąska said in Windows, Unicode file names and C++:

So if your program doesn't preserve line endings unless explicitly asked to, I'd call it a bug. Especially important if your program works well in pipe chains.

Then there's weirdness like this one from Windows cmd script:

for /f "tokens=1,2* delims== " %%A in (
	'wmic cdrom get Name /value'
) do if "%%A" == "Name" (
	set manuf=%%B
	set model=%%C
)

Turns out that model will always end up with a spurious CR character stuck to the end.

This happens as the wmic command's output in WTF-16 ("Unicode" in Windows parlance) gets converted to ASCII before being consumed by the for command: turns out that the conversion fails to remove the CR from the CR LF line terminators under these conditions.

You can work around this by forcing for to re-consume its own excreta, like this:

for /f "delims=" %%L in (
	'wmic cdrom get Name /value'
) do for /f "tokens=1,2* delims== " %%A in (
	"%%L"
) do if "%%A" == "Name" (
	set manuf=%%B
	set model=%%C
)

asdf

@flabdablet said in Windows, Unicode file names and C++:

are you sure Windows doesn't convert the argument encoding for argv?

If it does, it's highly unlikely to do so using anything as sane as UTF-8. It's almost certain to do it in a way that depends on the current code page setting.

That's exactly what I've read. Which is why I'm asking if anyone knows further details.

@Gąska said in Windows, Unicode file names and C++:

Last time I checked, which was a year or two ago, the text/binary setting work such that on Windows, text files had "\r\n" converted to (or from?) "\n", and *NIX systems it was ignored.

It does other stuff too. Not sure what exactly, but my unit tests magically started working instead of producing gibberish when I started putting std::ios_base::binary everywhere.

flabdablet

@asdf said in Windows, Unicode file names and C++:

It does other stuff too. Not sure what exactly

Most of it, again, is code page stuff. There's a lot of internal faffage devoted to converting between the top 128 codes in assorted code pages and their nearest Unicode equivalents when I/O happens in text mode.