Case (in)?sensitive filesystems are :doing_it_wrong:
-
@asdf said in Case (in)?sensitive filesystems are :
BTW: I still don't get why ß is written as "SS" when it's capitalized. "SZ" would make much more sense.
INB4 the obvious.
-
@Rhywden said in Case (in)?sensitive filesystems are :
INB4 the obvious.
You already Godwin-ized the thread, so there's no need to be shy.
-
@asdf said in Case (in)?sensitive filesystems are :
@Rhywden said in Case (in)?sensitive filesystems are :
INB4 the obvious.
You already Godwin-ized the thread, so there's no need to be shy.
As this is a trivial case, I leave the proof to interested readers.
-
@anotherusername said in Case (in)?sensitive filesystems are :
I don't know what you're talking about. Windows doesn't hide the file extensions.
DOWNVOTE DOWNVOTE DOWNVOTE
-
@Gurth said in Case (in)?sensitive filesystems are :
@Magus Basically, yes — but that’s more a problem for writing than for pronunciation, once you know that, for instance, eau is /o/, but /o/ could be o, au, eau, and
whatever elseot, ots, os, ocs, aux, aud, auds, eaux, ho or ö.FTFY
-
@asdf said in Case (in)?sensitive filesystems are :
Annoyingly, tab-completion is case-sensitive, so ~/imptab wouldn't autocomplete ~/Important.txt.
Excuse me, Sir, do you have a moment to talk about
Jesuszsh?Thanks,
Buddhabash is just fine :p$ touch this $ touch THIS
produces only a single file, named
this
.I'd think the appropriate error message on OSX would include an MC Hammer video.
-
@Rhywden said in Case (in)?sensitive filesystems are :
Rrrrrrr! Gleich wird zurrrrückgeschossen!
Bitte nicht scheißen!
-
@Yamikuronue said in Case (in)?sensitive filesystems are :
@asdf said in Case (in)?sensitive filesystems are :
You can argue
thþat it doesn't conform tothþe current official rules, butthþe letter definitely exists.F
TÞFYFÞFY
-
@Mikael_Svahnberg said in Case (in)?sensitive filesystems are :
files that I cant't check in to my CVS because they were created by an idiot who put an ä in the filename, sent it to a windows user who checked it in, and when I pulled it, it got translated to some mac-vernacular.
@Mikael_Svahnberg said in Case (in)?sensitive filesystems are :
I think OSX stores the files with the capitalisation you gave them, but treats all upper/lowercase-mutations as duplicates.
MacOS does preserve case, but it does not preserve normalization. And it chooses the other way compared to anybody else—MacOS always returns decomposed normal form where everything else uses composed most of the time. So when the file with
ä
is checked in on windows, the characters is encoded asä
, but then MacOS filesystem sees that and storesä
. And returns it. But CVS predates Unicode by millenia and does not have a slightest clue thatä
andä
are supposed to be equal. Nor does filesystem on the server, because neither Windows (that use UCS-2, but don't really understand it), nor Linux (which does not want to have anything to do with this and uses byte strings) consider them equal:
a) You thought
A.txt
anda.txt
in the same directory is confusing? Then behold this!
b) This would be actually easier to handle, because Unicode defines locale-independent normalization rules that cover these cases.
c) That does not make MacOS changing the normalization any less ; most software would work just fine if the filesystem at least returned exactly the same string it got even if it does not know unicode normalizations itself.
d) It also shows that doing these things in kernel does not actually solve the issue, but instead creates a lot of confusion if the programs don't go through the same trouble too.
-
@Gurth said in Case (in)?sensitive filesystems are :
@Khudzlin said in Case (in)?sensitive filesystems are :
English is among the worst languages for its spelling-pronunciation relationship (right up there with French).
English is far worse than French — at least with French, if you see a word written down you can probably pronounce it correctly (given a basic knowledge of the rules, of course) but you can be hard-pressed to write a word correctly if you only know its pronunciation.
I used to think so, too. But when you dig just a bit, you get stuff like "les poules du couvent couvent", in which the last 2 words are pronounced differently (the second last has "en" pronounced as [̃ɑ], and the last not pronounced at all - the final "t" isn't pronounced in either case).
Edit: fuck that shit, can't place the tilde correctly...
-
@Maciejasjmj said in Case (in)?sensitive filesystems are :
Bitte nicht scheißen!
Was that a late-night @accalia or intentional?
-
@asdf said in Case (in)?sensitive filesystems are :
Bitte nicht scheißen!
Was that a late-night @accalia or intentional?
ITYM @caccalalia
-
@asdf said in Case (in)?sensitive filesystems are :
Bitte nicht scheißen!
Was that a late-night @accalia or intentional?
Looks more like a shitty joke.
-
@Bulb said in Case (in)?sensitive filesystems are :
neither Windows (that use UCS-2, but don't really understand it),
Not since Windows 2000, now it's UTF-16. Worst of both worlds: variable length and always at least twice the memory.
-
@LaoC said in Case (in)?sensitive filesystems are :
@Bulb said in Case (in)?sensitive filesystems are :
neither Windows (that use UCS-2, but don't really understand it),
Not since Windows 2000, now it's UTF-16. Worst of both worlds: variable length and always at least twice the memory.
Many old Windows programs ignore the fact that UTF-16 is variable-length and don't handle surrogates. In effect, they're using UCS-2 instead. I agree that UTF-16 is the worst Unicode encoding: it has variable length (though only UTF-32 doesn't), endianness (only UTF-8 doesn't) and it takes more space than UTF-8 (though less than UTF-32, obviously).
-
@asdf said in Case (in)?sensitive filesystems are :
@Maciejasjmj said in Case (in)?sensitive filesystems are :
Bitte nicht scheißen!
Was that a late-night @accalia or intentional?
I guess I'm more used to German curse words than regular ones.
Also I might have been a bit, uhm, inebriated yesterday. As evidenced by drunk me apparently thinking it's a funny joke.
-
@Maciejasjmj said in Case (in)?sensitive filesystems are :
Also I might have been a bit, uhm, inebriated yesterday. As evidenced by drunk me apparently thinking it's a funny joke.
Silly pole.
... The rest of us found it amusing without alcohol.
-
@Mikael_Svahnberg said in Case (in)?sensitive filesystems are :
Silly pole.
While we're at semantically significant capitalization ...
-
@Khudzlin said in Case (in)?sensitive filesystems are :
[UTF-16] takes more space than UTF-8.
Actually, that's not true for pure text in Chinese or Korean (or any language using mostly characters in the range U+0800 to U+FFFF - those take up 3 bytes in UTF-8 versus 2 in UTF-16). But a webpage in such a language still takes fewer bytes in UTF-8, because the HTML markup uses only ASCII characters (which take up only 1 byte each in UTF-8 versus 2 in UTF-16), and the gains outweigh the losses. Characters outside the BMP or in the range U+0080 to U+07FF take up as much space in both encodings (4 or 2, respectively).
-
@Bulb said in Case (in)?sensitive filesystems are :
@ixvedeusi said in WTF Bites:
What's lowercase AE? ä or ae? Could be either. Same goes for SS -> ss or ß.
This is not really a big problem. A "case-insensitive" filesystem that does not treat
ä
as equivalent toae
andß
as equivalent toss
won't really offend most Germans. However a filesystem that treatsI
andi
as equivalent will offend Turks, because for them,i
is only equivalent toİ
andI
is equivalent toı
.Case insensitivity is inherently localized. So either the filesystem or the filename (or both) store the locale, or it cannot be done in a meaningful way.
-
@Martijn said in Case (in)?sensitive filesystems are :
Case insensitivity is inherently localized. So either the filesystem or the filename (or both) store the locale, or it cannot be done in a meaningful way.
And then you end up with those stupid __MACOSX files or whatever they're called in every archive that an Apple user has touched.
I wonder what happens if you unpack such an archive containing "Istanbul" and "ıstanbul" next to each other on a Turkish mac.
-
@LaoC said in Case (in)?sensitive filesystems are :
@Martijn said in Case (in)?sensitive filesystems are :
Case insensitivity is inherently localized. So either the filesystem or the filename (or both) store the locale, or it cannot be done in a meaningful way.
And then you end up with those stupid __MACOSX files or whatever they're called in every archive that an Apple user has touched.
I wonder what happens if you unpack such an archive containing "Istanbul" and "ıstanbul" next to each other on a Turkish mac.The machine turns itself into a Transformer and commits suicide.
-
@e4tmyl33t said in Case (in)?sensitive filesystems are :
The machine turns itself into a
TransformerSamsung andcommits suicideblows up.
-
@LaoC said in Case (in)?sensitive filesystems are :
$ touch this $ touch THIS
:giggi... Oh, hell no.
-
@LaoC said in Case (in)?sensitive filesystems are :
Not since Windows 2000, now it's UTF-16. Worst of both worlds: variable length and always at least twice the memory.
No, it isn't UTF-16, because the system will happily accept invalid encoding. And I mean I just slapped
#include <stdio.h> #include <tchar.h> #include "windows.h" int _tmain(int argc, _TCHAR* argv[]) { HANDLE h = CreateFile(L"\xd83d\xdca9\xdca9\xd83d.txt", GENERIC_WRITE, 7, nullptr, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, nullptr); DWORD len; WriteFile(h, "That's it!\n", 11, &len, nullptr); CloseHandle(h); return len == 11; }
into Visual Studio and compiled it and ran it (on Windows 7 we still have at $work, but Windows 7 > Windows 2000) and got a file named by that string, which you can clearly see is a surrogate sequence and reverse of the same sequence, which obviously can't be correct both ways.
Most programs, including most system ones, interpret the surrogate sequences if they can, making it kinda UTF-16. But the system still accepts lone surrogates and invalid surrogate sequences, making it UCS-2. Or perhaps WTF-16 is more appropriate.
-
@Bulb said in Case (in)?sensitive filesystems are :
Most programs, including most system ones, interpret the surrogate sequences if they can, making it kinda UTF-16. But the system still accepts lone surrogates and invalid surrogate sequences, making it UCS-2. Or perhaps WTF-16 is more appropriate.
As that article says:
Windows applications normally use UTF-16, but the file system treats path and file names as an opaque sequence of WCHARs (16-bit code units).
But yeah, WTF-16 sounds about right :D
-
@Bulb said in Case (in)?sensitive filesystems are :
the system still accepts lone surrogates and invalid surrogate sequences, making it UCS-2.
Except it isn't UCS-2 either, because UCS-2 is an encoding for code points inside the Unicode BMP, not an encoding for arbitrary 16-bit values, and there are not now and never will be any code points assigned in the range 0xD800..0xDFFF. The Unicode spec specifically says that attempts to encode such values in any encoding should be treated as encoding errors.
Windows "Unicode" APIs really are UTF-16, because valid surrogate pairs will be decoded as single Unicode code points for display, in accordance with the UTF-16 decoding rules. It's just that none of the APIs barf if you feed them broken UTF-16 strings.
Windows does prohibit certain values in filenames (all the control characters plus assorted punctuation). Failure to prohibit values in 0xD800..0xDFFF as well is, I'm completely sure, just an oversight rather than an actual design decision.
perhaps WTF-16 is more appropriate
We should work up an April 1 RFC on it.
-
@flabdablet said in Case (in)?sensitive filesystems are :
Failure to prohibit values in 0xD800..0xDFFF as well is, I'm completely sure, just an oversight rather than an actual design decision.
I wouldn't be so sure. I would rather expect it to be compatidebility—because Unicode 1.0 did not prohibit them (they were simply unassigned codes yet) and Microsoft probably feared—or, worse, perhaps knew—there are already systems out there that have such codes in filenames.
You are right UCS-2 isn't the right name either. In fact, just like Linux treats filenames as bytestrings, Windows treats them as wordstrings. Not really much of encoding at the system call level.
-
@Gurth said in Case (in)?sensitive filesystems are :
English is far worse than French — at least with French, if you see a word written down you can probably pronounce it correctly (given a basic knowledge of the rules, of course).
Really? How do you pronounce "fils", then? Or "est"?
(Hint: for both of these words, the second of which is extremely common, there are two different meanings that have different pronunciations and the same spelling.)
(For reference: the two pronunciations of "fils" are "feece" when it means "son" or "sons" and "feel" when it means "wires". "Est" is pronounced "eh" when it means "is", or "esst" when it means "east".)
-
@Gurth said in Case (in)?sensitive filesystems are :
@Magus Basically, yes — but that’s more a problem for writing than for pronunciation, once you know that, for instance, eau is /o/, but /o/ could be o, au, eau, and whatever else.
I remember enumerating the different ways of spelling terminal "-eh" in French:
- é
- ée
- és
- ées
- ai
- ais
- ait
- aient
- er
- ez
- et
-
@Gurth said in Case (in)?sensitive filesystems are :
@masonwheeler True, but the point is that implying the u is a misspelling because it’s not pronounced means the same could be said for the second o. Then again, at least it’s not Irish. I don’t think I’ve encountered any language where half the letters written in a word don’t seem to be pronounced, and (seemingly) half the sounds in the words aren’t represented when it’s written down.
Sort of. Turns out the apparently unpronounced letters (vowels) influence the pronunciation of the adjacent consonants. Example: s surrounded by "narrow" vowels (e, i) is pronounced "sh", while surrounded by "wide" vowels (a, o, u) it is pronounced "s". If surrounded by a mix of wide and narrow, it is a spelling error.
It's a bit like that "e" at the end of "breathe", which changes the pronunciation of the "th" compared to "breath". Or the difference between "strip" and "striped" or between "stripped" and "striped".
-
@Steve_The_Cynic said in Case (in)?sensitive filesystems are :
I remember enumerating the different ways of spelling terminal "-eh" in French:
Phonetically, there are (at least) two "eh" in French, although many people (and/or regional accents) don't differentiate them that well. They are quite obvious in the middle of many words (and when written with é or è) but the end of words -ais/-ait and -ez, for example, are not always pronounced the same, which is more subtle.
Right now I can't find an example where this could lead to different meanings but I'm sure there must be cases where it does.
-
How about more evil fun with Unicode normalized/denormalized file names?
-
@wft The apple fanclub is
-
@Steve_The_Cynic said in Case (in)?sensitive filesystems are :
@Gurth said in Case (in)?sensitive filesystems are :
English is far worse than French — at least with French, if you see a word written down you can probably pronounce it correctly (given a basic knowledge of the rules, of course).
Really? How do you pronounce "fils", then? Or "set"?
I said probably pronounce it correctly, I didn’t say you’d be able to 100% of the time. Also, as you indicated in your post, you can get the pronunciation from context; without context, any of the possible pronunciations would be valid.
-
@Khudzlin said in Case (in)?sensitive filesystems are :
@masonwheeler said in Case (in)?sensitive filesystems are :
Is it pronounced with a U? In Britain or anywhere else?
What the hell does pronunciation have to do with case sensitivity or variant spellings? Also, English is among the worst languages for its spelling-pronunciation relationship (right up there with French).
@Rhywden After some trips to Bavaria, I still wonder how anyone can understand Bavarians when they speak (yes, including other Bavarians).
The parts of Bavaria to the west of the Spessart are easily understandable.
(Because the dialects there are entirely unrelated to anything else in Bavaria, they're a lot like the ones in Hesse)
-
@Rhywden Computing is done in English.
-
@lucas1 said in Case (in)?sensitive filesystems are :
@Rhywden Computing is done in English.
Because every person who uses a PC knows how to read English?
-
@aliceif Stack overflow is in English only for this very reason.
-
@lucas1 said in Case (in)?sensitive filesystems are :
@aliceif Stack overflow is in English only for this very reason.
You are stuck at least a few years in the past.
Official Stackoverflow communities in non-English languages, run by the same company as Stackoverflow.
-
@aliceif Okay fair enoiught, but It is irrelevant. If you are working in IT you need to require English.
BTW I have just (skype) interviewed for 2 jobs in Belgium, 3 jobs in Germany and 1 job in Thailand.
They don't care that I don't speak the local language they care that I speak English.
-
@lucas1 said in Case (in)?sensitive filesystems are :
If you are working in IT you need to require English.
Not everyone who interacts with a file system is an IT professional.
-
@aliceif Fair point, but almost everyone on this forum is one or is a power user.
-
@lucas1 Explain Nagesh.
-
@Arantor Sorry the burden of proof is on you.
-
@lucas1 I was merely giving you an example that defines your 'almost' everyone.
I see your debating skills have improved. I hope to one day have an actual conversation with you.
-
@Arantor if I knew you wasn't being disingenuous maybe we could have one.
-
@lucas1 And there we have the measure of it... someone who gives out trolling so willingly doesn't like it when they get it back.
-
@Arantor I don't troll often. I argue often.
-
@lucas1 said in Case (in)?sensitive filesystems are :
@Arantor if I knew you
wasn'tweren't being disingenuous maybe we could have one.How exactly does a native speaker get this wrong?