Go, Go, Gadget Depends!

LB_

Wait, so if I have two hardlinks to the same file, changing the permissions on one changes the permissions for all of them? That's...really bad. I thought metadata was stored separately from the file content?

PleegWat

The only filesystem I know anything about is the ext series. Here there is only one inode, which contains all metadata about the file, and the data block references, but not the name. If there are 3 links, that means there are 3 directory nodes somewhere on the system with a member which points to this inode.

Here is the output of a couple of stat commands to different links to the same filesystem object:

$ stat /
  File: ‘/’
  Size: 4096          Blocks: 8          IO Block: 4096   directory
Device: 804h/2052d    Inode: 2           Links: 28
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2015-10-10 20:02:01.026730531 +0200
Modify: 2015-09-29 18:40:34.847794477 +0200
Change: 2015-09-29 18:40:34.847794477 +0200
 Birth: -
$ stat /bin/..
  File: ‘/bin/..’
  Size: 4096          Blocks: 8          IO Block: 4096   directory
Device: 804h/2052d    Inode: 2           Links: 28
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2015-10-10 20:02:01.026730531 +0200
Modify: 2015-09-29 18:40:34.847794477 +0200
Change: 2015-09-29 18:40:34.847794477 +0200
 Birth: -
$ stat /sbin/..
  File: ‘/sbin/..’
  Size: 4096          Blocks: 8          IO Block: 4096   directory
Device: 804h/2052d    Inode: 2           Links: 28
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2015-10-10 20:02:01.026730531 +0200
Modify: 2015-09-29 18:40:34.847794477 +0200
Change: 2015-09-29 18:40:34.847794477 +0200
 Birth: -

Of the information printed here, only the file path and device number don't originate from the inode.

I believe there are filesystems which use data deduplication where distinct files can share data segments. I'd assume however that in this case there are copy-on-write semantics when one of the references to a shared data block has its data changed.

LB_

Ok, great, but... are we still having the same conversation?

When I'm talking about giving UUIDs to files, it should not change even if the file moves to a different filesystem (e.g. from hard drive to flash drive). It shouldn't be related to the path at all. Obviously I'm talking about an imaginary feature of a filesystem that doesn't exist, though I was interested if any OS had a high-level API that worked in a similar manner with existing filesystems.

TwelveBaud

Windows NT's Distributed Link Tracking Coordinator?

PleegWat

Sorry, I must have missed the UUID discussion entirely. I've been out for a couple of days so I've been skimming. My last point was mainly meant to clarify how (to my understanding) inodes work.

The problems with a UUID for a file across filesystems probably arises when the data may be in multiple locations, and particularly when it may be in multiple locations at the same time. I don't think linux supports it right now - AFAIK it is the combination of device number and inode that is unique, so the ID changes with a cross-filesystem move.

There may be ways to accomplish it still (EG have a data file on dropbox, encrypt it, cache the plaintext locally, and mount the plaintext as its own filesystem) but then it's no longer separate filesystems in userspace.

ScholRLEA

@PleegWat said:

The problems with a UUID for a file across filesystems probably arises when the data may be in multiple locations, and particularly when it may be in multiple locations at the same time. I don't think linux supports it right now - AFAIK it is the combination of device number and inode that is unique, so the ID changes with a cross-filesystem move.

This was anticipated. A long time ago, in fact. Maybe not successfully addressed, but it is a Known Issue in at least some circles.

LB_

The UUID wouldn't be stored with the file on the filesystem as metadata, it would be managed by the OS. Restarting your computer changes all the UUIDs. Replugging the flash drive changes all the UUIDs. Basically, it's just meant to be a way of tracking files so that if you move a text file from your desktop to your flash drive, it can stay opened properly in your text editor.

PleegWat

That could probably be made to work. Huge practical problems if you'd want to implement it on an existing system, but that was noted at the start.

Hardlinks would stay iffy though - as soon as one of the participating filesystems is dismounted the relation is lost, data can get out of sync, and you're if you're not storing the relations at all it's impossible to relink later. Up to the user to remember which items on different filesystem are really the same file, and they can't be trusted to remember that.

LB_

Again: with hardlinks/symlinks, each link has a different UUID.

PleegWat

Symlinks, sure, those are a reference to a path and might as well be a UUID pointing to another UUID.

On hardlinks I think we'll have to agree to disagree - I do not think it is logical to assign different UUIDs to different links to the same file. That makes the UUID a property of the path, which to me seems in conflict with the things staying the same during moves.

It might make sense to a non-power user, but I think hardlinks are very much a power user feature.

I do think something like this would be required for editors keeping files open across moves to be useful in practice - Someone moving a file is more likely to be moving to a different storage medium than reorganizing their files on the same medium, and most user interfaces (even the linux CLI) don't distinguish between the two.
At the same time, the feature would break down with a manual copy+delete operation, and that seems undesirable. Note at least on windows the default operation for a file drag&drop between directories is move within a filesystem, but copy across filesystems.

LB_

Notepad++ has a feature to rename or delete the file you are currently editing. How do you propose that work if each hardlink has the same UUID?

PleegWat

I'm tempted to call , but I have run into that problem. You can actually create a new link to an open file (linkat(fd, "", AT_FDCWD, path, AT_EMPTY_PATH), may fail if all links to the file have been removed), but you cannot delete existing references if you don't know their names, and there is no generic mechanism to query all references to an existing file or even (AFAIK) to verify if a path and a file descriptor point to the same inode. I agree this is a shortcoming, though being able to query for references would be technically complicated.

I also think you are conflating two concepts, which are different in the presence of hardlinks: The file, and it's name. These separate concepts are also relevant on Windows, as NTFS supports hardlinks. A file on disk is uniquely identified by its device and inode number (or equivalent on other filesystems). A file's name is uniquely identified by its device number or equivalent, and the path from the mountpoint.

A third concept, what the user considers to be the file, is much more vague and context-dependant. They may consider different links to the same file to be different in some circumstances, but they will likely consider the copy in My Documents, the copy on the company file server, the copy on their backup USB stick (:gasp:), and the copy they just emailed to you to be the same file, and the computer should obviously know this since they were all copies made on a computer.

If you're trying to tackle the third concept, then good luck, you'll need it. And you won't be able to avoid the one-file-stored-in-multiple-places problem.

dkf

@LB_ said:

How do you propose that work if each hardlink has the same UUID?

What is the identity of the file in the first place, for which the UUID is to be a convenient shorthand? Until we can sort that out, we'll have horrible misunderstandings.

It obviously can't depend on the contents of the file, since files are mutable. Files can also be moved between computers; surely the identity wouldn't change if we had the file on a networked filesystem (even if the computer it was accessed from is packed up and shipped to another continent and accesses the network from a dodgy hotel wifi connection) and accessed the file from two computers, or if it was on a memory stick and was accessed from two computers at different times. Yet it also can't depend on the name, since “two” hard-linked files are really one; if you modify the file by a handle obtained for one name, anything reading the file using a handle obtained on the other name will see the change, and if that's not evidence that they are actually the same thing, I don't know what is. (This is separate from symbolic links; they're not the same thing, but rather an OS level way of saying “see over here instead”.)

I don't think that the identity of things is a simple matter. What happens if we compress the file and store it on tape for a while, then restore it back exactly as it was? Let's say that we didn't happen to look for it while it was compressed, and once it is back, it's the same in all reasonable ways to how it was before being archived. Yet it didn't exist for quite a while! What of the identity then?

Myself, I stop trying to solve the problem perfectly and instead use a simpler thing such as the disk ID and the inode number, with the filename as a proxy for that. It doesn't solve everything, but it's computable with a reasonable amount of effort and doesn't drag everything into the tarpit of perfect file identity.

LB_

I don't understand where the confusion is coming from here.

Let's say I have two files open in my fancy text editor: A.txt, and B.txt. They both refer to the same data on disk, be it via hardlinks or symlinks - it doesn't matter for this.

If I move A.txt to my flash drive and rename it to D.txt, it should update accordingly in my fancy text editor, but nothing should happen to B.txt. I didn't rename, move, or delete B.txt, so my fancy text editor shouldn't receive any events from the OS about it.

The identity of a file is my ability to move, rename, or delete it. Not its path. Not its filename. Not its data on disk. Not some magical metadata. It's that I can see it when I run dir or ls.

dkf

What you seek to do is something we cannot do. It would require a single information infrastructure with a single global identity system for everything. Online systems like DropBox and Google Drive can approximate to this, but it is an approximation and the seams definitely still show at the moment. Offline systems can't do it at all because of all the communication required to resolve what an ID really means.

I admit I sprung a trap on you by constructing “obvious” things in my previous message that force offline solutions to not work. ;) But they are real problems, and the ID/object linkage problem is a huge one that might have no solution at all. There are a number of semi-solutions that work really well, but they do so by rejecting one of the parts of your vision. Offline really forces that there's no central authority. File mutability means that the contents can't be used to generate IDs. Getting the OS to do all the reconciliation will be very complicated and utterly horrible when working with multiple OSes, and it might be better to use some local ID instead that is only system-unique and then map it between it and the global ID space (though what that really means, I don't know). And so on…

Gurth

@LB_ said:

If I move A.txt to my flash drive and rename it to D.txt, it should update accordingly in my fancy text editor, but nothing should happen to B.txt. I didn't rename, move, or delete B.txt, so my fancy text editor shouldn't receive any events from the OS about it.

Interesting … I just thought that for fun, I’d see what happens in OS X if I’d try the above. First I made a text file and hardlinked to it:

[code]$ echo foobar > A.txt
$ ln A.txt B.txt[/code]

Double-clicking A.txt in the Finder opened it in TextEdit; double-clicking B.txt did something unexpected: it popped up the window for A.txt. With another text editor (for the record: skEdit), I can open both files simultaneously, and if I change the text in A.txt and save it, the TextEdit window for it will immediately update to the new contents.

However, B.txt doesn’t change along — and will now open separately in TextEdit with its original contents. That, I should really have expected, since OS X’s file save API first creates a temporary file in some hidden directory deep in the file system before replacing the original file with that one (totally breaking any other hard links to it, of course).

As expected, moving A.txt to a different volume has similar effects: double-clicking it on the volume I moved it to while it remains open in TextEdit brings the same TextEdit window to the front, as the OS still considers it to be the same file even though I moved it, but double-clicking B.txt opens it in a new window instead of popping the one with A.txt up again.

dkf

@Gurth said:

OS X’s file save API

No, that's not strictly true. The API has several ways it can be used. It's just that for ordinary text files, making a temporary file and then switching it into place is by far the best way to do it. (That's true on all modern platforms.) Other types of file need a different approach.

OSX actually uses plain POSIX for this stuff. NSFileHandle is just a wrapper.

LB_

@dkf said:

It would require a single information infrastructure with a single global identity system for everything.

What part of "restarting the computer resets all the UUIDs" didn't you understand? The UUIDs are completely temporary and only last until the filesystem is unmounted. There's no need for any central identity system because identities are a local concept - they are local to your system.

dkf

@LB_ said:

The UUIDs are completely temporary and only last until the filesystem is unmounted.

So an involuntary network disconnect will change all the UUIDs and confuse the fuck out of the software even more than before? Sounds like a plan, O bearded guardian of wisdom!

LB_

If the network disconnects, it's as if the file was deleted. What's the issue? I mean, I guess you could cache it until the internet comes back, but that's a whole different can of worms.

blakeyrat

@dkf said:

What you seek to do is something we cannot do. It would require a single information infrastructure with a single global identity system for everything.

Right. Because GUIDs do not exist.

Listen to the wisdom of dkf.

blakeyrat

@LB_ said:

If the network disconnects, it's as if the file was deleted. What's the issue? I mean, I guess you could cache it until the internet comes back, but that's a whole different can of worms.

Right. Because network file caching doesn't exist. It hasn't been in Windows since like Windows 2000 Pro or anything.

Listen to the wisdom of LB_.

Lotsa wisdom all up in this thread.

LB_

@blakeyrat said:

Right. Because network file caching doesn't exist. It hasn't been in Windows since like Windows 2000 Pro or anything.

I wasn't suggesting that at all, I was talking about the fact that you have no idea what happened to files while the network was disconnected and so could not do anything sane with the UUIDs when network returned.

Gurth

@dkf said:

No, that's not strictly true. The API has several ways it can be used.

I stand corrected — I’ve never looked deeply into it, just wrote some programs that saved stuff and I observed this behaviour when finding out how to make them do that.

PleegWat

@LB_ said:

@blakeyrat said:
Right. Because network file caching doesn't exist. It hasn't been in Windows since like Windows 2000 Pro or anything.

I wasn't suggesting that at all, I was talking about the fact that you have no idea what happened to files while the network was disconnected and so could not do anything sane with the UUIDs when network returned.

In many current network file systems, you don't have a clue what happens while the network is connected either.

dkf

@PleegWat said:

In many current network file systems, you don't have a clue what happens while the network is connected either.

Anyone who thinks this is simple hasn't tried to do it for real. Please continue to think it is simple.

JBert

You really want new front page articles, don't you?

dkf

LB_

Something amazing just happened. In Windows I had a folder open that was in my Google Drive. I renamed the folder via the Google Drive web interface. Magically, the explorer window updated too! Google Drive, Windows, and Windows Explorer all agreed that the folder had been renamed, rather than deleted with an identical copy being created with a different name. It seems to also work for files, but only with Explorer.

So, it seems only Explorer understands what is really happening when Google Drive renames a file or folder due to tight coupling with Windows. Oh well.

dcon

@LB_ said:

So, it seems only Explorer understands what is really happening when Google Drive renames a file or folder due to tight coupling with Windows. Oh well.

Something worked as expected and you're complaining? ?

LB_

I guess you missed most of the previous discussion here? I didn't post to this thread at random.

dkf

@LB_ said:

I didn't post to this thread at random.

YMBNH!