Big list of software that cannot handle spaces or accents in paths

blakeyrat

@laoc said in Big list of software that cannot handle spaces or accents in paths:

I can only guess

How about you don't guess anything and just read the words on the screen.

Gąska

@dkf said in Big list of software that cannot handle spaces or accents in paths:

@gąska said in Big list of software that cannot handle spaces or accents in paths:

Take a random WPF project.

I don't have any at all.

I wasn't being literal.

Indeed, my current projects are both entirely using other tech stacks and, for good internal reasons, virtually all of them have no GUI code either. (Their “user interface” is in the form of commands within a scripting language; for what I'm doing that makes much more sense.)

So most likely, you don't use filenames and paths for anything except identifying files. If that's the case, you'd need only minimal changes to your codebase if you were to switch to IDs.

But that nitpick aside, there are a substantial number of complications for you to think about as part of your cunning plan.

But of course! And I'd love to talk about them!

First of them is that files (which are in general bags of bytes, but which might be programs or serialisations of objects or any number of other mostly-fixed things) have quite a need to refer to other files. That means that you're going to need to embed these IDs

Yes. I've mentioned earlier that serialization is one of the things that can go wrong. Though it's not significantly harder than serializing paths (some might argue that it's even easier, assuming file IDs are of constant size). But there's much less chance IDs go wrong than paths - you don't join them, you don't split the, you don't slice them, you don't transform them (directly). The only thing that might unintentionally alter ID is corruption when reading or writing to file.

and that means that you need your system to handle all that side of stuff.

Not really. Storage can be done fully by application. Data files don't really need to be human readable.

You'll also need ways to group files together because they are related in some way (as opposed to just using the filename with a different extension), which means that you'll need some sort of generalised relational graph system.

Or just tags - simple, arbitrary strings attached to file metadata. There are many possibilities.

There are probably other tricky bits too, but this is a rabbit hole I've never jumped down.

I'm sure there are - but not with the file organization itself (it can be done exactly like current filesystems are done, with hierarchical directories and all). A bigger problem would be that you need an API call for every operation people currently do with strings. It would also require a major paradigm shift in how applications identify and organize files. These are all very hard and interesting problems that I would love to have serious conversation about. But apparently no one here is interested in serious discussions.

The biggest problem you have is that a vast preponderance of existing technology simply doesn't work that way.

That's why it's more of an utopia to me than actual thing I'd like to get done.

A related problem is that the implementations that have been have also been catastrophically nasty and slow to use: I know a little bit about the generalised relational graph stuff

I never said anything about generalized relational graph. A simple directory tree would work just as well for the stated purposes. The change would be in HOW we use that directory tree.

I am, of course, not giving you any money.

I feel insulted by the fact you thought it's necessary to say that.

I think what you're contemplating trying to do is a Bad Idea for reasons that you're ignoring.

You've got two things wrong:

I'm not ignoring anything;
I don't want to actually implement anything.

It's just a fancy idea that poses interesting design and technical challenges. Figuring them out is simply a fun exercise to me. I think it would be pretty cool to make a working proof of concept that's not absolutely horrible to use. But I don't have any hope it would ever catch on, and I'm not ever going to try making a full production-grade OS, even if I had time and money for that.

LaoC

@gąska said in Big list of software that cannot handle spaces or accents in paths:

So you want to store the ID in the same structures that file systems commonly use for the name (that which every file system in existence already supports) … and then add names where?

Metadata, of course. Windows and MacOS already

And "every filesystem in existence", eh? Does every Filename->ID lookup trigger something akin to for id in $(find /); do [ $(getfattr -n filename $id) == $filename ] && echo $filename; done then?

Not that this changed anything; for a name-to-ID-and-back translation to work you still need a bijective mapping between the two.

And good luck keeping that consistent.

No, I liked the MacOS's idea

If you want to pick spelling nits, do it properly. "MacOS" is a proper name.

According to your link, I was right.

*sigh* Which part of that don't you understand?

Proper Nouns Ending in S
AP: Add an apostrophe.
Charlaine Harris’ books

The Chicago Manual of Style has a lot of good advice but it's not The Gospel.

that an application open files by ID so it never has to do string manipulation on paths. Moving etc. is another discussion that I don't have a side in (though the problem you talk about can be avoided with RW locks).

So you'd forbid doing the exact thing people found cool about the feature in the 90s to hack around the problems you'd get from implementing the feature in a modern environment.

For the last time: I DON'T SUPPORT BEING ABLE TO RENAME A FILE IN USE. NEITHER AM I AGAINST IT. STOP PESTERING ME ABOUT THINGS I HAVE NO OPINION ON.

That's what @blakeyrat said was cool. If you only want it to avoid quoting problems, you're breaking nuts with a sledgehammer.

@blakeyrat linked an article recently that proposes much simpler and less hacky fixes.

How many pages ago was it? I must have missed it.

Fixing Unix/Linux/POSIX Filenames: Control Characters (such as Newline), Leading Dashes, and Other Problems

Also, I don't see anything "hacky" about having a nice, strongly-typed, system-wide API for every operation you might ever need.

No, there isn't. It exists and it's called the libc API; none of it requires any kind of quoting or escaping to deal with file names. The problem is the untyped CLI sitting on top of it.

As a bandaid like that, chattr +i file ... has you covered.

But I'm still the owner then, and programs run as me still can delete it.

No.

So, sticky bit disallows owner from deleting it too? That's not what I was told earlier in the topic.

No, chattr doesn't set the sticky bit. The above sets the "immutable" attribute that keeps a file from being deleted or changed. The sticky bit does let you delete your own files..

Also, how well supported is it in practice? I'm asking because for example, running script via bash script ignores executable permission - and many people execute scripts that way rather than by ./script. If sticky bit wasn't always enforced, it wouldn't be very useful (not totally useless, but not that useful either).

Both sticky am immutable are enforced.

Maybe that better design you're missing hasn't been implemented because most people see it as a worse design.

Or maybe because there wasn't enough incentive to try anything else. Look at C++ - the 2011 version was objectively better in every way from 2003 version with no downsides, yet it took years for most companies to make the switch (many still haven't done it, though thankfully it's a dying species).

Well, companies aren't as quick to change as three-man hobbyist teams. Once your developers are familiar with it you have to make sure your whole toolchain is compatible, that mixing it with legacy code doesn't cause any unpleasant surprises etc. It's the same with Python 3 which I think has been out for 10 years now.

The point is, no one using something isn't proof that something is bad.

No doubt. Just a possibility to consider.

@blakeyrat said in Big list of software that cannot handle spaces or accents in paths:

@laoc said in Big list of software that cannot handle spaces or accents in paths:

I can only guess

How about you don't guess anything and just read the words on the screen.

OK THANK YOU FOR EXPLAIN I AM ALSO A ROBOT WHO DOES NOT GET HUMOR SO EVERY THING POSTED TO THIS FORUM IS LITERAL BEEP BEEP BEEP
<beep>

Gąska

@laoc said in Big list of software that cannot handle spaces or accents in paths:

@gąska said in Big list of software that cannot handle spaces or accents in paths:

So you want to store the ID in the same structures that file systems commonly use for the name (that which every file system in existence already supports) … and then add names where?

Metadata, of course. Windows and MacOS already

And "every filesystem in existence", eh? Does every Filename->ID lookup trigger something akin to for id in $(find /); do [ $(getfattr -n filename $id) == $filename ] && echo $filename; done then?

Most filesystems use filename as the file identifier (a primary key in DB nomenclature). If you don't use filename as identifier, you don't need to know its name or path anywhere nearly as often.

Not that this changed anything; for a name-to-ID-and-back translation to work you still need a bijective mapping between the two.

And good luck keeping that consistent.

Why would it be hard at all? At shell level, it would map paths to IDs the same way we currently map paths to inodes. At program level, you don't have to do any mappings at all.

Proper Nouns Ending in S
AP: Add an apostrophe.
Charlaine Harris’ books

The Chicago Manual of Style has a lot of good advice but it's not The Gospel.

And neither is AP. Unless there's something I don't know?

that an application open files by ID so it never has to do string manipulation on paths. Moving etc. is another discussion that I don't have a side in (though the problem you talk about can be avoided with RW locks).

So you'd forbid doing the exact thing people found cool about the feature in the 90s to hack around the problems you'd get from implementing the feature in a modern environment.

For the last time: I DON'T SUPPORT BEING ABLE TO RENAME A FILE IN USE. NEITHER AM I AGAINST IT. STOP PESTERING ME ABOUT THINGS I HAVE NO OPINION ON.

That's what @blakeyrat said was cool.

Yes. Blakeyrat. Not me.

If you only want it to avoid quoting problems, you're breaking nuts with a sledgehammer.

Is there any other solution for bad programmers routinely fucking up path handling other than not letting (or at least strongly discouraging) programmers to touch paths?

@blakeyrat linked an article recently that proposes much simpler and less hacky fixes.

How many pages ago was it? I must have missed it.

Fixing Unix/Linux/POSIX Filenames: Control Characters (such as Newline), Leading Dashes, and Other Problems

This doesn't help with Bash splitting file arguments on spaces. This doesn't help with accidentally making absolute path from relative path by leaving "/" in front after splitting. This doesn't help with messing up order argument and random text accidentally being treated as path. There are million other things that can (and do all the time) go wrong with paths other than ls beeping at you.

Also, I don't see anything "hacky" about having a nice, strongly-typed, system-wide API for every operation you might ever need.

No, there isn't.

Then why did you call it a hack? Or maybe, what did you call a hack?

It exists

No it doesn't. You might have lost it in context, but the API I was talking about is the hypothetical API used in a system that identifies files by ID and not path string, where you can do everything with just IDs so you never have to use path strings.

and it's called the libc API; none of it requires any kind of quoting or escaping to deal with file names.

But it requires checking if your filename contains / if you want a file in current directory named exactly like the name you have in a variable you're using to open the file.

The problem is the untyped CLI sitting on top of it.

Yes, and in case you can't remember more than five posts back, I have already covered that - my idea would most likely require revamping the fundamentals of how CLI works.

As a bandaid like that, chattr +i file ... has you covered.

But I'm still the owner then, and programs run as me still can delete it.

No.

So, sticky bit disallows owner from deleting it too? That's not what I was told earlier in the topic.

No, chattr doesn't set the sticky bit. The above sets the "immutable" attribute that keeps a file from being deleted or changed. The sticky bit does let you delete your own files.

So there's two hacks made to compensate for the fact Unix doesn't differentiate between write and delete permissions. I wonder what they'll name the third hack when a use case arises that isn't covered by either of the existing ones.

Also, how well supported is it in practice? I'm asking because for example, running script via bash script ignores executable permission - and many people execute scripts that way rather than by ./script. If sticky bit wasn't always enforced, it wouldn't be very useful (not totally useless, but not that useful either).

Both sticky am immutable are enforced.

That's good to know.

Maybe that better design you're missing hasn't been implemented because most people see it as a worse design.

Or maybe because there wasn't enough incentive to try anything else. Look at C++ - the 2011 version was objectively better in every way from 2003 version with no downsides, yet it took years for most companies to make the switch (many still haven't done it, though thankfully it's a dying species).

Well, companies aren't as quick to change as three-man hobbyist teams. Once your developers are familiar with it you have to make sure your whole toolchain is compatible, that mixing it with legacy code doesn't cause any unpleasant surprises etc. It's the same with Python 3 which I think has been out for 10 years now.

I'm glad you understand that.

The point is, no one using something isn't proof that something is bad.

No doubt. Just a possibility to consider.

I know. And I assure you I have considered it thoroughly. In fact, I've mentioned this topic exactly because I want to see whether this idea is bad, or just unlucky!

Look how many keystrokes we would have both saved if we didn't assume the other side is a fucking idiot!

LaoC

@gąska said in Big list of software that cannot handle spaces or accents in paths:

Metadata, of course. Windows and MacOS already

And "every filesystem in existence", eh? Does every Filename->ID lookup trigger something akin to for id in $(find /); do [ $(getfattr -n filename $id) == $filename ] && echo $filename; done then?

Is that a yes? Because it's a to any kind of performance. It would be fine if you proposed some kind of database but with existing filesystems it's lunacy.

Most filesystems use filename as the file identifier (a primary key in DB nomenclature). If you don't use filename as identifier, you don't need to know its name or path anywhere nearly as often.

I don't care if it happens only once a session but takes an hour because the system has to make a linear scan over half of the gazillion files on the volume.

Not that this changed anything; for a name-to-ID-and-back translation to work you still need a bijective mapping between the two.

And good luck keeping that consistent.

Why would it be hard at all? At shell level, it would map paths to IDs the same way we currently map paths to inodes. At program level, you don't have to do any mappings at all.

What keeps me from setting the path metadata of two file IDs to the same value?

Proper Nouns Ending in S
AP: Add an apostrophe.
Charlaine Harris’ books

The Chicago Manual of Style has a lot of good advice but it's not The Gospel.

And neither is AP. Unless there's something I don't know?

![0_1527331277146_Not-sure-if-trolling-or-just-stupid.jpg](Enviando 100%)

If you only want it to avoid quoting problems, you're breaking nuts with a sledgehammer.

Is there any other solution for bad programmers routinely fucking up path handling other than not letting (or at least strongly discouraging) programmers to touch paths?

There is no cure against bad programmers (and yes, that is gospel).

Fixing Unix/Linux/POSIX Filenames: Control Characters (such as Newline), Leading Dashes, and Other Problems

This doesn't help with Bash splitting file arguments on spaces. This doesn't help with accidentally making absolute path from relative path by leaving "/" in front after splitting. This doesn't help with messing up order argument and random text accidentally being treated as path. There are million other things that can (and do all the time) go wrong with paths other than ls beeping at you.

It fixes the most problematic cases that are hard to catch even if you know the rules. Someone who still gets bitten by space splitting or argument order will just screw up somewhere else.

Also, I don't see anything "hacky" about having a nice, strongly-typed, system-wide API for every operation you might ever need.

No, there isn't.

Then why did you call it a hack? Or maybe, what did you call a hack?

That IDs-instead-of-filenames thing of yours.

It exists

No it doesn't. You might have lost it in context, but the API I was talking about is the hypothetical API used in a system that identifies files by ID and not path string, where you can do everything with just IDs so you never have to use path strings.

Then what's that "strongly typed" about if it fixes mainly problems that are only problems in shell scripts?

and it's called the libc API; none of it requires any kind of quoting or escaping to deal with file names.

But it requires checking if your filename contains / if you want a file in current directory named exactly like the name you have in a variable you're using to open the file.

Libraries exist. In Perl I use Path::Class for transparent cross-platform path handling; if I did C++ it would probably be Boost.Filesystem. There's no cure for NIH syndrome but pain.

The problem is the untyped CLI sitting on top of it.

Yes, and in case you can't remember more than five posts back, I have already covered that - my idea would most likely require revamping the fundamentals of how CLI works.

For small values of "covered".

Gurth

@gąska said in Big list of software that cannot handle spaces or accents in paths:

interactive shells are as easy to use as those we have now

/var/usr/etc/bin/blakeyrant.sh --start

Gąska

@laoc said in Big list of software that cannot handle spaces or accents in paths:

@gąska said in Big list of software that cannot handle spaces or accents in paths:

it's kinda funny how you totally fucked up the quote

Is that a yes? Because it's a to any kind of performance.

The best performing code is one that never runs. You don't have to display filenames nearly as often as you have to open them. And even then, there's nothing that prevents implementing the name getting command as shell intrinsic, and having the actual name stored in dedicated field separate from other metadata. This would bring performance in line with what it is on current systems.

It would be fine if you proposed some kind of database but with existing filesystems it's lunacy.

How about a new filesystem? I never said it wouldn't require a new filesystem. I'm pretty sure it can't be done without a new filesystem. It's not a big deal - we're already replacing the entirety of how inter-process and system-process communication works. There wouldn't even be standard output as we know it - we'd have to make it strongly typed!

Not that this changed anything; for a name-to-ID-and-back translation to work you still need a bijective mapping between the two.

And good luck keeping that consistent.

Why would it be hard at all? At shell level, it would map paths to IDs the same way we currently map paths to inodes. At program level, you don't have to do any mappings at all.

What keeps me from setting the path metadata of two file IDs to the same value?

There's no path metadata. There's filename metadata. And what prevents you is the operating system - there would be a check similar to the one already performed on every system to ensure every file in a directory has unique name.

Proper Nouns Ending in S
AP: Add an apostrophe.
Charlaine Harris’ books

The Chicago Manual of Style has a lot of good advice but it's not The Gospel.

And neither is AP. Unless there's something I don't know?

![0_1527331277146_Not-sure-if-trolling-or-just-stupid.jpg](Enviando 100%)

Both. I actually know nothing about formal English language standards. Neither do I care. Oh, and the link is broken.

If you only want it to avoid quoting problems, you're breaking nuts with a sledgehammer.

Is there any other solution for bad programmers routinely fucking up path handling other than not letting (or at least strongly discouraging) programmers to touch paths?

There is no cure against bad programmers (and yes, that is gospel).

Garbage collector cured 99.9% of memory access violation errors. Static typing cured 99.9% of improper cast errors. File IDs would cure 99.9% of accidentally accessing wrong files errors. There will always be ways to fuck up, but there would be much less of them. It's definitely a win and a goal worthy of pursuing. Maybe not worth throwing away 50 years of software legacy, but a good goal nonetheless.

Fixing Unix/Linux/POSIX Filenames: Control Characters (such as Newline), Leading Dashes, and Other Problems

This doesn't help with Bash splitting file arguments on spaces. This doesn't help with accidentally making absolute path from relative path by leaving "/" in front after splitting. This doesn't help with messing up order argument and random text accidentally being treated as path. There are million other things that can (and do all the time) go wrong with paths other than ls beeping at you.

It fixes the most problematic cases that are hard to catch even if you know the rules.

Depends on what you by problematic. If you mean how often they come up in practice - it's really rare for special characters to get included in filename. Far more common is

Someone who still gets bitten by space splitting or argument order will just screw up somewhere else.

Not necessarily. And it won't necessarily be as destructive. Fucking up paths can result in lots of data loss, or even bricking the entire system, as shown time and time again by various software - NVidia, Steam, npm...

Also, I don't see anything "hacky" about having a nice, strongly-typed, system-wide API for every operation you might ever need.

No, there isn't.

Then why did you call it a hack? Or maybe, what did you call a hack?

That IDs-instead-of-filenames thing of yours.

Why's it a hack? What's hacky about it?

It exists

No it doesn't. You might have lost it in context, but the API I was talking about is the hypothetical API used in a system that identifies files by ID and not path string, where you can do everything with just IDs so you never have to use path strings.

Then what's that "strongly typed" about if it fixes mainly problems that are only problems in shell scripts?

Problems happen in regular apps too. The worst fuckups are almost always in shell, but path parsing bugs are widespread in regular applications too - especially when spawning processes through API that accepts shell command line, and most process spawning APIs accept shell command lines.

and it's called the libc API; none of it requires any kind of quoting or escaping to deal with file names.

But it requires checking if your filename contains / if you want a file in current directory named exactly like the name you have in a variable you're using to open the file.

Libraries exist.

But they're non-standarized, usually per-language, each has its own bugs, and they're susceptible to NIH syndrome.

In Perl I use Path::Class for transparent cross-platform path handling; if I did C++ it would probably be Boost.Filesystem. There's no cure for NIH syndrome but pain.

You can't exactly NIH the basic system APIs. Although I've seen it done in my career (but that was on very custom platform).

The problem is the untyped CLI sitting on top of it.

Yes, and in case you can't remember more than five posts back, I have already covered that - my idea would most likely require revamping the fundamentals of how CLI works.

For small values of "covered".

What do you mean?

Gąska

@gurth said in Big list of software that cannot handle spaces or accents in paths:

@gąska said in Big list of software that cannot handle spaces or accents in paths:

interactive shells are as easy to use as those we have now

/var/usr/etc/bin/blakeyrant.sh --start

Easier* to use.

marczellm

This became a "Big thread of people that cannot handle disagreement on paths"

Gąska

@marczellm if I couldn't handle it, I wouldn't argue about it.

Erufael

@gąska I feel like there's a file handle joke in there somewhere, but I can't quite find it.... Oh well.

Tsaukpaetra

@erufael said in Big list of software that cannot handle spaces or accents in paths:

@gąska I feel like there's a file handle joke in there somewhere, but I can't quite find it.... Oh well.

I was going to mention it, but decided not to.

LaoC

@gąska said in Big list of software that cannot handle spaces or accents in paths:

The best performing code is one that never runs. You don't have to display filenames nearly as often as you have to open them.

As long as "not nearly as often" does not equal "never" you'd better make sure it's not a linear scan over the entire volume.

How about a new filesystem? I never said it wouldn't require a new filesystem.

In case you can't remember more than five posts back, you said pretty much every file system in existence already supported what you need to have this parallel ID/filename scheme.

![0_1527331277146_Not-sure-if-trolling-or-just-stupid.jpg](Enviando 100%)

Both. I actually know nothing about formal English language standards. Neither do I care.

OK, good. I don't care about paths that much either.

Gąska

@laoc said in Big list of software that cannot handle spaces or accents in paths:

@gąska said in Big list of software that cannot handle spaces or accents in paths:

The best performing code is one that never runs. You don't have to display filenames nearly as often as you have to open them.

As long as "not nearly as often" does not equal "never" you'd better make sure it's not a linear scan over the entire volume.

Agreed.

How about a new filesystem? I never said it wouldn't require a new filesystem.

In case you can't remember more than five posts back, you said pretty much every file system in existence already supported what you need to have this parallel ID/filename scheme.

No. I said that every filesystem supports a unique ID for every file in form of path - as a rebuttal to the claim that every file having unique identifier is absurd.

LaoC

@gąska said in Big list of software that cannot handle spaces or accents in paths:

In case you can't remember more than five posts back, you said pretty much every file system in existence already supported what you need to have this parallel ID/filename scheme.

No. I said that every filesystem supports a unique ID for every file in form of path - as a rebuttal to the claim that every file having unique identifier is absurd.

I'm not sure who you think claimed that.

Gąska

@laoc said in Big list of software that cannot handle spaces or accents in paths:

@gąska said in Big list of software that cannot handle spaces or accents in paths:

In case you can't remember more than five posts back, you said pretty much every file system in existence already supported what you need to have this parallel ID/filename scheme.

No. I said that every filesystem supports a unique ID for every file in form of path - as a rebuttal to the claim that every file having unique identifier is absurd.

I'm not sure who you think claimed that.

Reading those posts again, I think we've both misunderstood each other there.

Tsaukpaetra

@gąska said in Big list of software that cannot handle spaces or accents in paths:

we've both misunderstood each other there.

A common theme, recently.... :P

ixvedeusi

@laoc said in Big list of software that cannot handle spaces or accents in paths:

In case you can't remember more than five posts back, you said pretty much every file system in existence already supported what you need to have this parallel ID/filename scheme.

That was me, if you're referring to this post. If it wasn't clear, that was more of a quick thought I threw out to see what could come of it, rather than a finalized design spec, so thanks for your feedback!

I wouldn't expect such an implementation to be particularly well-performing, but I suppose with some optimizations (indexing, caching etc) it could reach "usable" levels. Creating a dedicated file system which is actually optimized for this use case would of course be preferable., but more work to get to a first functional prototype.

Concerning

How do things like for f in *.txt *.log; do ... work

First off, this has nothing to do with file names (as others have already pointed out), you want to query by type. If I was to set out to try and actually design such a hypothetical nirvana file system, I'd definitely try to get rid of these bizarre Reverse Hungarian Notation warts at the same time, because the whole point of the exercise is to move file metadata out of the file name to where it actually belongs.

Second off, globbing is IMHO one other very problematic piece of functionality. Apart from the classic "who is responsible for it" Windows-vs-Linux problem, it mixes data and structure, and thus makes it impossible to treat file names as opaque binary blobs. As such it has the exact same issues (plus some more) that paths have and which has triggered this whole discussion.

So probably the goal would be to get something like
for f in glob(cwd, type="text/plain") do ...
which does a look-up in some kind of metadata index which would ideally permit efficient filtering on any type of metadata (names, types, creation/modification dates, permissions etc). cwd would be an argument passed to the script by the shell, which contains the ID of the working directory.

Yes, this would probably involve some indexing and parallel storing of data to get any kind of production-scale performance, which would probably be hard to get right. But AFAIK so does any kind of efficient lookup in SQL databases.

I'd say the "file names vs. file IDs" debate is essentially the "natural key vs. surrogate key" issue, with many of the same arguments on both sides applying to either discussion.

blakeyrat

@ixvedeusi said in Big list of software that cannot handle spaces or accents in paths:

because the whole point of the exercise is to move file metadata out of the file name to where it actually belongs.

(Another thing Macintosh did.)

@ixvedeusi said in Big list of software that cannot handle spaces or accents in paths:

I'd say the "file names vs. file IDs" debate is essentially the "natural key vs. surrogate key" issue, with many of the same arguments on both sides applying to either discussion.

Except it's not a debate because there's no such thing as a natural key. They're mythical. The only way you might get served something that works as a natural key is if it's a surrogate key in someone else's database, and they're good at not fucking it up.

Similarly, there's no (rational) argument here that the name of a file or the path of a file is data and not meta-data. It's just decades of broken computers, and bunch of people without the imagination to even visualize a better way of doing things.

dkf

@blakeyrat said in Big list of software that cannot handle spaces or accents in paths:

The only way you might get served something that works as a natural key is if it's a surrogate key in someone else's database, and they're good at not fucking it up.

Content hashes also work (with a good algorithm like SHA-512). They're not the sort of ID that most users like, being even less nice than UUIDs/GUIDs.

blakeyrat

@dkf said in Big list of software that cannot handle spaces or accents in paths:

Content hashes also work (with a good algorithm like SHA-512).

But by definition, if you've hashed it, you've created a surrogate key.

Rhywden

@tsaukpaetra said in Big list of software that cannot handle spaces or accents in paths:

@gąska said in Big list of software that cannot handle spaces or accents in paths:

we've both misunderstood each other there.

A common theme, recently.... :P

Recently? YMBNH.

Every single person on this forum is at least adept in talking past each other.

Gąska

@dkf said in Big list of software that cannot handle spaces or accents in paths:

@blakeyrat said in Big list of software that cannot handle spaces or accents in paths:

The only way you might get served something that works as a natural key is if it's a surrogate key in someone else's database, and they're good at not fucking it up.

Content hashes also work

No they don't. First, you can't guarantee their uniqueness. The other first, they change when content changes. The last first, you can't have identical files with different hashes.

(with a good algorithm like SHA-512)

This is even worse - not only you have non-unique ID that changes whenever you modify the file, it's also very computationally expensive, and the cost grows with file size!

They're not the sort of ID that most users like, being even less nice than UUIDs/GUIDs.

If the user sees ID, you've already failed.

Tsaukpaetra

@rhywden said in Big list of software that cannot handle spaces or accents in paths:

YMBNH

Perhaps, my memory grows worse every day. No, I do not know how old I am, why do you ask?

dkf

@gąska said in Big list of software that cannot handle spaces or accents in paths:

First, you can't guarantee their uniqueness.

Have you ever found a collision in SHA-512? Do you have any idea how unlikely that is? It comes well below the likelihood of bugs in the OS joining files up in other ways.

The other concerns you raise are at the level of not making sense; if two files are identical, what does it matter which one you've got? (A directory can be argued to be a mapping from names to IDs; saving generates a new tree up to the root in a way that any functional programming language aficionado should recognise.) Also, history is a matter of producing a graph of versions, which is precisely what a VCS does, and while the ID is expensive to compute, it's something that the OS could cache safely.

Gąska

@dkf said in Big list of software that cannot handle spaces or accents in paths:

@gąska said in Big list of software that cannot handle spaces or accents in paths:

First, you can't guarantee their uniqueness.

Have you ever found a collision in SHA-512? Do you have any idea how unlikely that is? It comes well below the likelihood of bugs in the OS joining files up in other ways.

It's not about likeliness - it's that you cannot control it, and once it happens, there is no way around it but alter files - which you might not access anymore because of conflict!

The other concerns you raise are at the level of not making sense; if two files are identical, what does it matter which one you've got?

Because you might want to modify it further, and the random conflicting file shouldn't change. Imagine having two separate intallations of the same program. Or log files that start out empty.

(A directory can be argued to be a mapping from names to IDs; saving generates a new tree up to the root in a way that any functional programming language aficionado should recognise.)

Yeah. Imagine the performance.

Also, history is a matter of producing a graph of versions

Wait, when did anybody say anything about history? We're talking about regular filesystems meant to store only the current state of files.

and while the ID is expensive to compute, it's something that the OS could cache safely.

Caching doesn't help when the content changes.

Hashes make sense for VCS. But general purpose filesystem is entirely different problem than VCS, with entirely different goals, and entirely different problems that need to be solved in entirely different way.

LaoC

@ixvedeusi said in Big list of software that cannot handle spaces or accents in paths:

@laoc said in Big list of software that cannot handle spaces or accents in paths:

In case you can't remember more than five posts back, you said pretty much every file system in existence already supported what you need to have this parallel ID/filename scheme.

That was me, if you're referring to this post. If it wasn't clear, that was more of a quick thought I threw out to see what could come of it, rather than a finalized design spec, so thanks for your feedback!

Nah, I wouldn't have been saying that if I wasn't sure I remembered :)

I wouldn't expect such an implementation to be particularly well-performing, but I suppose with some optimizations (indexing, caching etc) it could reach "usable" levels. Creating a dedicated file system which is actually optimized for this use case would of course be preferable., but more work to get to a first functional prototype.

Sure, for a toy system where you can always assume all names will comfortably fit into a RAM cache, that's fine.

How do things like for f in *.txt *.log; do ... work

First off, this has nothing to do with file names (as others have already pointed out), you want to query by type. If I was to set out to try and actually design such a hypothetical nirvana file system,

The one that releases files from the cycle of death and rebirth? Mounting /dev/null should come close

I'd definitely try to get rid of these bizarre Reverse Hungarian Notation warts at the same time, because the whole point of the exercise is to move file metadata out of the file name to where it actually belongs.

That's ugly indeed. AmigaOS had a better filetype system but it was also dog slow, doing essentially what file(1) does, only with a flexible plugin system.

LaoC

@gąska said in Big list of software that cannot handle spaces or accents in paths:

Have you ever found a collision in SHA-512? Do you have any idea how unlikely that is? It comes well below the likelihood of bugs in the OS joining files up in other ways.

It's not about likeliness - it's that you cannot control it, and once it happens, there is no way around it but alter files - which you might not access anymore because of conflict!

If you created 1000 files a second, it would take on average almost 56 million years until two distinct files yielded the same SHA512. Several orders of magnitude less likely than your computer being destroyed by a meteor.

The other concerns you raise are at the level of not making sense; if two files are identical, what does it matter which one you've got?

Because you might want to modify it further, and the random conflicting file shouldn't change. Imagine having two separate intallations of the same program. Or log files that start out empty.

Copy-on-changed-hash? AFAIK that's basically what ZFS is doing already with its built-in deduplication. They already do calculate block (or extent? probably extent, dunno.) checksums which is a good idea anyway, so while you may not want to recalculate the SHAsum of your entire 50 GB raw video file just because you changed your name in the metadata in the first block, but you could just as well hash the block checksums.

Tsaukpaetra

@laoc said in Big list of software that cannot handle spaces or accents in paths:

Copy-on-changed-hash? AFAIK that's basically what ZFS is doing already with its built-in deduplication. They already do calculate block (or extent? probably extent, dunno.) checksums which is a good idea anyway, so while you may not want to recalculate the SHAsum of your entire 50 GB raw video file just because you changed your name in the metadata in the first block, but you could just as well hash the block checksums.

Yes, if you enable Deduplication (not really recommended unless you really do have a ton of duplicate data) each block's hash is kept in memory and if a new block comes along that hashes to an existing block it just stores the link to the existing block.

Otherwise, it's copy on write against blocks that are changing from the last snapshot, which isn't exactly what we're talking about in this thread...

Gąska

@laoc said in Big list of software that cannot handle spaces or accents in paths:

I wouldn't expect such an implementation to be particularly well-performing, but I suppose with some optimizations (indexing, caching etc) it could reach "usable" levels. Creating a dedicated file system which is actually optimized for this use case would of course be preferable., but more work to get to a first functional prototype.

Sure, for a toy system where you can always assume all names will comfortably fit into a RAM cache, that's fine.

Why is it a requirement for all filenames or IDs to fit in RAM? I can't think of any scenario where this is necessary. Creating file, opening file, reading file's name , moving, copying, deleting - none of these operations require or benefit from having all filenames or IDs all the time in memory.

I'd definitely try to get rid of these bizarre Reverse Hungarian Notation warts at the same time, because the whole point of the exercise is to move file metadata out of the file name to where it actually belongs.

That's ugly indeed. AmigaOS had a better filetype system but it was also dog slow, doing essentially what file(1) does, only with a flexible plugin system.

AmigaOS also had orders of magnitude less power available, and had a few decades less computer science theory to work with. They might've also designed or implemented it wrong. But there's no reason we should repeat their mistakes.

@laoc said in Big list of software that cannot handle spaces or accents in paths:

@gąska said in Big list of software that cannot handle spaces or accents in paths:

Have you ever found a collision in SHA-512? Do you have any idea how unlikely that is? It comes well below the likelihood of bugs in the OS joining files up in other ways.

It's not about likeliness - it's that you cannot control it, and once it happens, there is no way around it but alter files - which you might not access anymore because of conflict!

If you created 1000 files a second, it would take on average almost 56 million years until two distinct files yielded the same SHA512. Several orders of magnitude less likely than your computer being destroyed by a meteor.

The thing about probability is that it usually bites someone in the ass much faster than it theoretically should.

The other concerns you raise are at the level of not making sense; if two files are identical, what does it matter which one you've got?

Because you might want to modify it further, and the random conflicting file shouldn't change. Imagine having two separate intallations of the same program. Or log files that start out empty.

Copy-on-changed-hash? AFAIK that's basically what ZFS is doing already with its built-in deduplication.

Because it makes sense to use hashes for deduplication. They are a great tool for exactly the kind of problem that deduplication needs to solve. But deduplication is completely different from identification. There's a completely different set of problems to be solved. None of the useful properties of hashes are useful for identification, and all the downsides become major problems. But even if we assume for a moment that SHA-512 is fast enough, unique enough and there will never be two separate files (inodes) with identical content. Why would you use hashes if you could use, say, consecutive numbers? They're fast to compute, they are actually unique, they stay the same, and you can have as many duplicate files as you want (they could be optimized into one physical block on disk via hashing etc., but the identifiers themselves don't have to be the same). Is there any reason at all to use SHA-512 rather than consecutive numbers?

LaoC

@gąska said in Big list of software that cannot handle spaces or accents in paths:

If you created 1000 files a second, it would take on average almost 56 million years until two distinct files yielded the same SHA512. Several orders of magnitude less likely than your computer being destroyed by a meteor.

The thing about probability is that it usually bites someone in the ass much faster than it theoretically should.

https://youtu.be/ggJboWWo7lM?t=6s

PleegWat

@laoc said in Big list of software that cannot handle spaces or accents in paths:

@gąska said in Big list of software that cannot handle spaces or accents in paths:

Have you ever found a collision in SHA-512? Do you have any idea how unlikely that is? It comes well below the likelihood of bugs in the OS joining files up in other ways.

It's not about likeliness - it's that you cannot control it, and once it happens, there is no way around it but alter files - which you might not access anymore because of conflict!

If you created 1000 files a second, it would take on average almost 56 million years until two distinct files yielded the same SHA512. Several orders of magnitude less likely than your computer being destroyed by a meteor.

You are guaranteed a duplicate by the time you generated 2^512 files. But this becomes a problem well before guaranteed duplicate. Reference the birthday problem (where only 70 persons is enough to get a 90% chance of duplicate). A one-in-a-million chance of collision will already be problematic.

Since the formula for this case includes numbers like (2^512)! I'm going to be lazy and not do the actual calculations.

Birthday problem - Wikipedia

Gąska

@pleegwat he already accounted for that. 56 million years is actually billion times lower than actual number. Still not as good as consecutive numbers.

boomzilla

@gurth said in Big list of software that cannot handle spaces or accents in paths:

@gąska said in Big list of software that cannot handle spaces or accents in paths:

interactive shells are as easy to use as those we have now

/var/usr/etc/bin/blakey\ rant.sh --start

STFY

dkf

@gąska said in Big list of software that cannot handle spaces or accents in paths:

he already accounted for that. 56 million years is actually billion times lower than actual number. Still not as good as consecutive numbers.

Yep. 2²⁵⁶ is still a really large number.

115792089237316195423570985008687907853269984665640564039457584007913129639936

dfdub

@lb_ said in Big list of software that cannot handle spaces or accents in paths:

Thankfully we do have at least one modern widely-used filesystem that works properly: Google Drive. Filenames can contain any characters at all, the concept of a path doesn't make sense and you never need to type it, a file can exist in multiple directories at once, etc. although I think unfortunately a file can only have one name, changing the name in one place changes it everywhere

I don't see much in your description that doesn't apply to most Linux file systems except from paths being a thing. Hard links can even have different names.

Most of the problems are the programs' fault (with the worst offender being the shell), the filesystems work perfectly fine.

Gąska

@dfdub programs' fault are caused in a big part due to operating system forcing the programs to use raw strings for manipulating files. It's similar to memory problems in C - it's not that C is inherently buggy; it just makes shooting yourself in the foot extremely easy, much easier than not shooting yourself.

PleegWat

@gąska Well, you can get around parts of the string manipulation by using repeated calls to openat instead.

Gąska

@pleegwat the only thing it helps with is CWD shenanigans, and only if you have FD of something you want to be in relation to. Or in the unlikely scenario where you receive path as fragments already, so instead of joining them, you open each directory until you open target file - in which case you unnecessarily open many directories, and in 99% of cases, you have to do the path splitting yourself anyway.

Rhywden

@gąska said in Big list of software that cannot handle spaces or accents in paths:

Is there any reason at all to use SHA-512 rather than consecutive numbers?

They make sense for distributed systems - i.e. 2+ devices creating items which then have to be synced over all devices later on.

Gąska

@rhywden there are other, better ways to make non-colliding UIDs in distributed environment.

Rhywden

@gąska said in Big list of software that cannot handle spaces or accents in paths:

@rhywden there are other, better ways to make non-colliding UIDs in distributed environment.

One collision in 56 million years at a hash rate of 1000 1/s? I think you should worry more about unexpected gravitationally-challenged pianos.

Gąska

@rhywden as I mentioned earlier, collisions aren't the only problem with using SHA for file IDs.

dkf

@gąska … and identifying things is very much not the only problem, especially when dealing with distributed systems (with their natural +20 to Fuck You rolls).

Gąska

@dkf identifying things is pretty much the only problem to solve when tasked with identifying things. Besides, the discussion wasn't about distributed systems - it was about filesystems and OS interaction with programs within a single machine.

dkf

@gąska said in Big list of software that cannot handle spaces or accents in paths:

identifying things is pretty much the only problem to solve when tasked with identifying things

But do you understand what that means in the first place?

Gąska

@dkf no, I don't understand what it means when you say it. But when I say it, it means the only topic I've been talking about the whole time in this thread - something that lets you refer to a specific file so you can open, read, modify and delete it. Like filename, but not filename because filename creates some very significant problems when used as file ID.

Gąska

I wonder who the downvote is from.

Tsaukpaetra

@gąska said in Big list of software that cannot handle spaces or accents in paths:

I wonder who the downvote is from.

@boomzilla , obviously.

boomzilla

@tsaukpaetra said in Big list of software that cannot handle spaces or accents in paths:

@gąska said in Big list of software that cannot handle spaces or accents in paths:

I wonder who the downvote is from.

@boomzilla , obviously.

Ironic.