Why does nobody use extended attributes?

flabdablet

I've used that a few times to my advantage, to do things like extracting the text content so that I could recover critical data and get working again. All with generic tools. Saved my ass.

That advantage deserves being repeated in larger type.

flabdablet

@EvanED said:

If I want to attach metadata to a .cpp file, I can't zip up that file with the metadata I want and then say gcc -c mything.zip and have it work.

No. But what you can do is build a really thin wrapper around gcc, based on completely standard archive manipulation tools, that does process your metadata any way you like (including ignoring it entirely) - and you can implement that wrapper easily, and completely portably, on any platform that gcc runs on and guarantee that it will work the same way everywhere when fed one of your augmented .cppx files.

dse

This makes perfect sense. In a recent project I had a bunch of .json files, then got more csv/json files that had extra information, the only change I made is for the parser to read .json.meta[X] and append them to .json
While the thin wrapper for gcc makes sense, no one stops the OP to add an extension to gcc itself to annotate .cpp files with .cpp.meta[X] while parsing them. While at it, the OS file manager can be changed too, to always copy along the .meta[X] files when copy-pasting the files, similar to how on Windows copying a .html file, copies along the directory that comes with it.

EvanED

@Maciejasjmj said:

Then you're missing thumbnails, album arts, and basically reduce the whole thing to a string-string keystore. And a dumb one to boot - without any standard on the keys, all you can do is dump them for viewing and maybe edit them.

Holy crap. Look, just because I said they're useful even if you don't know what keys are present doesn't mean you can't have some standard key sets. For example, EXIF could still have come along and said "hey use these xattr keys for these things" and then JPEG "we suggest using EXIF for metadata. It's (IMO) a strictly better situation than what we have now.

@Maciejasjmj said:

Across any conceivable content? Is an artist in a MP3 file an "author", or a less-standardized "artist"?

Whatever people agree is more useful. It's not like the ID3 "artist" tag is unambiguous, but it's still probably the third-most important tag on audio. Calling it "author" instead would probably make the most sense; it would be a minor departure from the natural terminology for a more generalized thing.

@Maciejasjmj said:

Thumbnails, for example. Or chapter markers. Dunno really, that's why I said "potentially".

I don't see why you couldn't put those in xattrs. (Actually, thumbnails are a really good example of something that could very productively go into xattrs right now. It's something that could be kept with the file, and yet can be easily reconstructed if it is lost.)

@Maciejasjmj said:

As for total size - I'd expect to be able to read 1000 bytes from a 1000-byte file, and I bet lots of programs would too.

Not if we had started off in this world in the beginning...

Besides, how many programs even look at the size of the file they're reading? I'm sure some do, but I'd guess they're a minority. I strongly suspect almost all programs that use a stream interface rather than mmap (which I'd guess is a majority) either just read until EOF, or else read offsets from a header and seek around the file reporting errors if the seek fails.

@Maciejasjmj said:

And even if we totally disregard all those things, I still have yet to see one argument for storing those in ADS/xattrs instead of having them at the top of the file (in any format).

If every format supported the same way to do it, then fine, that would work. But now you've just reinvented xattrs.

flabdablet

@dse said:

no one stops the OP to add an extension to gcc itself to annotate .cpp files with .cpp.meta[X] while parsing them. While at it, the OS file manager can be changed too

Personally I think making any such annotation depend on having got the .cpp concerned from inside a .cppx, then packing the annotations back inside the .cppx alongside the original .cpp, makes more sense than adding behaviors to the OS file manager.

This is exactly the kind of thing that thin wrapper scripts around application programs are good for; the internal layout of a .cppx can then be fooled about with as required to suit the purposes of the build suite rather than needing to be shoehorned into whatever format makes lowest-common-denominator sense to a file manager.

I actually rate the _files thing for Windows HTML saving as a pretty crude hack. The Windows file manager already has largely-transparent zip file handling inbuilt, and the only thing that would have stopped zip files being used as the standard format for saving web pages is an implementation detail.

Windows Explorer doesn't provide any way for a zip file to behave like a current directory: opening an archive member with Explorer will generally extract it and then hand it off to the right application program, but that application then gets no reasonable way to open other subfiles from the same zip. Which is a shame, especially given that NTFS has this concept of general-purpose reparse points which ought to make building proper loop-mount-on-demand functionality into Explorer fairly easy; easier, I would have thought, than the special file copy bodge for .htm and .html files. And it would have got rid of that whole unfortunate business with being unable to run certain applications directly from inside a zipped folder because they depend on other things that Explorer as it stands right now will never know it's supposed to unpack.

EvanED

@flabdablet said:

@EvanED said:
Imagine if images were distributed this way -- you had to download an archive (of image data + format metadata + EXIF data in separate files or something) and then open something in it every time you wanted to look at an image.

They are distributed this way.

Sorry, I left out something important in what I said. I'll try again:

Imagine if images were distributed this way -- you had to download an archive (of image data + format metadata + EXIF data in separate files or something) and then **explicitly extract the archive before you can** look at an image.

@flabdablet said:

No. But what you can do is build a really thin wrapper around gcc, based on completely standard archive manipulation tools, that does process your metadata any way you like (including ignoring it entirely) - and you can implement that wrapper easily, and completely portably, on any platform that gcc runs on and guarantee that it will work the same way everywhere when fed one of your augmented .cppx files.

Great. And then I just have to write a really thin wrapper around Emacs too, and then svn, and then Git, and then diff, and then grep, and then etags. And then I'll have most of what I need for that one kind of file. And then convince everyone to use my wrappers.

flabdablet

After which you would end up better off than an xattr-based solution, because you'd have metadata not subject to falling off the edge of the world on any cross-system file copy. And at worst you'd end up with a file you could still use on a system that didn't have the wrappered tools and didn't need your extra metadata, just by doing the unpack and repack steps by hand.

flabdablet

By the way, the thin-wrappers thing already exists to a certain extent; there are zless, zgrep, zcat and so on that allow you to do stuff with files whether or not they're gzipped. It's the same basic idea, and it really isn't much work.

flabdablet

@EvanED said:

and then explicitly extract the archive before you can look at an image.

The point is that if image file formats were a specific case of a general-purpose archive format, then image-processing tools would do the archive-handling step internally - much as OpenOffice does for .odt or MS Office does for .xlsx or Java does for .jar.

EvanED

@flabdablet said:

...you'd have metadata not subject to falling off the edge of the world on any cross-system file copy.

...which is missing my entire f'ing point that the fact that you have to worry about losing xattrs is really unfortunate, and you could do a lot of neat, currently-unrealistic/annoying stuff if they had wider and better support! (And yes, the stuff you'd have to do to get adoption of my .cppx thing, for example, even within an organization let alone the broader community, is effectively completely unrealistic.)

I'm not saying that tools should put stuff in xattrs given the current landscape, because I agree that's a bad idea for the reasons that, again, have been reiterated repeatedly is this thread.

@flabdablet said:

The point is that if image file formats were a specific case of a general-purpose archive format, then image-processing tools would do the archive-handling step internally - much as OpenOffice does for .odt or MS Office does for .xlsx or Java does for .jar.

And again, if it was widely supported to do this, then that would be fine; I would say it's just a different implementation of what I'm saying. But it doesn't have close to wide enough support to be useful for things outside of those specific formats.

flabdablet

@EvanED said:

you could do a lot of neat, currently-unrealistic/annoying stuff if they had wider and better support!

Exactly the same applies for my archive-augmented source files, which I still maintain are a Better Thing than xattrs because they're (a) more flexible (b) require no support at all from operating systems.

ben_lubar

What about just having a file next to the source code file with the archive of the extra stuff? That way, you can still read the text data with a text editor directly and you don't need any special filesystem support.

Maciejasjmj

@EvanED said:

But now you've reinvented xattrs

Except they move with the file, and are tied to the file, not the filesystem. And that's what the hassle is mostly about.

Not adressing your other points, because if it were the 70s,then maybe it could be made to work, fine. But as far as I'm concerned, that standardized keystore would be better off in the file itself.

EvanED

@flabdablet said:

Exactly the same applies for my archive-augmented source files, which I still maintain are a Better Thing than xattrs because they're (a) more flexible (b) require no support at all from operating systems.

I maintain that the solutions are effectively the same. (And also that the chance of implementing either in a working way in the next few decades is also the same, effectively zero.) FWIW, that means I would be very content with your idea as well.

@Maciejasjmj said:

Except they move with the file

Xattrs would move with the file as well if they had reasonable support. You could also imagine a version of cp, scp etc. that zips everything up when moving to a FS without xattr suport and then unzips it back (if in the same format) when moving back. But again, if most programs supported putting alternate streams into a single file, I'd be fine with that too.

blakeyrat

Hello, yes? Do you ever listen to anything I say here?

Apple started going to shit when they took on the engineers who made the (failed!) NeXT OS and put them in charge of their GUI. Everything turned all Unix-y, the old "Mac way of doing things" got shat on, and all forward progress stopped.

That's why I switched to Windows in the first place. I'd never have considered using Windows for anything other than video games until about 10.4 came out.

EvanED

@flabdablet said:

(b) require no support at all from operating systems.

Actually, you know, maybe this has more of a benefit than I'm giving you credit for. Because you could implement respectable support for this using something like LD_PRELOAD on *nix or Detours on Windows by intercepting file system calls. For example, in my foo.cppx example, when GCC calls open("foo.cppx", ...) and then reads from it, it could transparently redirect that to the primary file. If something accessed open.cppx:stream, it could transparently redirect that to some other stream in that file.

Now, this still sucks because a naive implementation will still lose everything but the primary stream (because when cp foo.cppx copies data from foo.cppx that will be transparently redirected to just the primary stream...), so you'd need at least a way to switch it on and off by processes. But maybe this would provide a vaguely realistic migration path...

(Of course, you could do the same at the OS, but (1) the programming would be harder and (2) you'd need root permission to use it.)

flabdablet

@EvanED said:

the solutions are effectively the same

except that xattrs typically have some fairly restrictive size and format limits, while archived file members inherit all the useful things you can do with filesystems (including hierarchical directories and standard filesystem metadata each).

flabdablet

@EvanED said:

For example, in my foo.cppx example, when GCC calls open("foo.cppx", ...) and then reads from it, it could transparently redirect that to the primary file.

Better still, when GCC calls open("foo.cpp", ...) you could have it check for foo.cppx and if found, open its primary stream instead. And you could make the same LD_PRELOAD wrapper do Reiser-ish things like open("foo.cpp/blah", ...) to open the blah subfile of foo.cppx. None of it is terribly difficult.

Maciejasjmj

Fine, you can work around it and make the simple tools more complicated, but why? How are xattrs better than an in-file header or wrapper?

ben_lubar

Archiving wouldn't work for something like storing HTTP caching-related headers.

wft

Show me a single use case of a C/C++ source that would be so immensely useful as to justify all that extra work and support from the compiler, I dare you.

dkf

@wft said:

Show me a single use case of a C/C++ source that would be so immensely useful as to justify all that extra work and support from the compiler, I dare you.

You do realise that “all that extra work” is going to be about as much as writing a line or two of makefile? In fact, here's a contribution (assuming that we adopt a standard filename inside the zip archive):

.cpp.cppx:
        unzip -p $< source.cpp >$@
.c.cx:
        unzip -p $< source.c >$@

ben_lubar

Why store the files inside a zip and make your source control history stupid when you can use the filesystem for storing files?

PleegWat

@EvanED said:

Totally feasible. With z/OS pipes, commands can have multiple input and/or multiple output pipes that you can connect up arbitrarily. I think they might be numbered instead of named, so that would have to change, but it's a minor point.

Reminds me of linux. By default, you have 3 streams: An input stream (0), an output stream (1), and an error stream (2). But you can extend that to any number you want. The following program expects to be passed an extra output channel on descriptor 4:

#include <stdio.h>

int main(int argc, char * argv[])
{
    FILE * ch4 = fdopen(4, "w");

    fprintf( stdout, "Hi from stdout!\n" );
    fprintf( stderr, "Hi from stderr!\n" );
    fprintf( ch4, "Hi from channel 4!\n" );
}

$ ./test 4>&1
Hi from stdout!
Hi from stderr!
Hi from channel 4!

flabdablet

@ben_lubar said:

Archiving wouldn't work for something like storing HTTP caching-related headers.

I can't see why a headers subfolder inside a zip-compatible .httpx archive could not be used for that.

About the only use case I can see where xattrs enable something you couldn't do just as easily with subfiles inside an archive is access control lists.

ben_lubar

Let's say you run wget to get some png file that you want to use for your website. But it downloads a httpx file which you have no idea how to open. Well, whatever, put it on the website anyway. Repeat the process N times and you now have a file that is 90% overhead and 10% the actual file you wanted.

flabdablet

You're assuming that wget would make a httpx from something you ask it to download, without your having told it specifically to save the output in that format; that's unlikely to be true. If a hypothetical extended wget were to support the hypothetical httpx format, it would most likely do so via something akin to its existing --save-headers option, which you could equally well abuse to create the broken scenario you describe.

ben_lubar

Or, using xattrs, wget could download the file and transparently store the Etag and Last-Modified headers and then have an option to update the file if it has changed that handles a 304 response.

flabdablet

A httpx-capable wget could do exactly the same thing by saving downloaded files in an archive that includes a headers sub-file (or even a sub-folder, if you wanted to use one subfile per header). The difference would be that subsequent uploading would require either explicit extraction of the primary subfile from the archive using a standard unzip tool, or a httpx-aware uploader that does the same thing implicitly.

What it all boils down to is that xattrs or alternate data streams are a better choice for metadata that you don't care about losing, because there are any number of opportunities for a file's xattrs to go missing.

Metadata about the primary content rather than about the file as such, like ID3 tags or EXIF data, needs to be bundled up into the file itself along with the primary content; historically that's been done by defining file formats with typed chunks, and there are endless ad-hoc methods for doing that. Personally I vastly prefer formats like jar, odt, deb and xlsx that re-purpose standard archive formats for the chunk-bundling job, simply because being able to manipulate all of these formats with standard archive-processing tools is really useful.

And there is really very little that xattrs and ADSs can do that can't also be achieved using archive-format simple files, albeit a bit more explicitly. Access control, where the xattrs control whether or not the file content can be processed at all, is the notable exception.

ben_lubar

But something like the server's last modification time or the etag isn't a property of the data contained in the file, it's a property of the file itself. Just like ACLs. Different servers compute Etag in different ways, so it would be useless to everyone except the person checking to see if the file is up to date to know it.

This would be something where you give a function a file handle and a URL and it gives back an error which is null if the file contains the data referred to by the URL. The user doesn't care about the etag - that's just transparently handled by the library.

dkf

@ben_lubar said:

But something like the server's last modification time or the etag isn't a property of the data contained in the file, it's a property of the file itself. Just like ACLs.

Technically, you're wrong. Subtly.

The etag is a property of the provenance of a file, of where it came from and how it was created. The modification time and the ACL are properties of the file itself; they don't depend on where the file came from, but rather what was done to it recently on the current system. You don't copy ACLs from an HTTP server (and it would make no sense to do so) but copying provenance information is entirely sensible, and is in fact a major use case for provenance tracking.

This is all rather too close to the sorts of things I work on professionally.

flabdablet

Do you actually have a use for an HTTP downloader that uses xattrs as you've described in the OP? Because whipping one up as a shell script wrapper around wget should be pretty quick and easy.

PleegWat

I doubt wget would store a new httpx format as default mode of operation, because of likely extensive existing scripting use.

flabdablet

Same applies to the likelihood of it doing anything in particular with xattrs by default.

PleegWat

To a lesser degree. I could see it storing some key headers in xattrs by default if they ever start seeing regular use, and use them for cache control. With xattrs (Either native or a kernel-based zipfile emulation) other applications still see the same primary datastream which significantly reduces impact in existing scripts.

Still, at the moment xattrs are stuck in a 'nobody uses them because nobody uses them' deadlock.

flabdablet

Samba uses them to emulate NTFS ACLs, IIRC.

ben_lubar

IIRC SELinux and real Linux ACLs use it as well.

flabdablet

So there you go. The answer to the original question of why nobody uses xattrs is that in fact they do.

ben_lubar

But they're kernel-space xattrs, which can't be interacted with by user-space code.

dkf

@ben_lubar said:

But they're kernel-space xattrs, which can't be interacted with by user-space code.

Which OS are we talking about here? User-space code on OSX most certainly uses xattrs (e.g., to flag a file as having been downloaded from the internet). It's just that it's only a minority of programs use xattrs, but then that's actually what you'd expect: the data is typically more important than the metadata.

But you need the filesystem to support xattrs. Most modern FSes do, but some (I'm looking at you, FAT-derivatives!) don't.

ben_lubar

@dkf said:

I'm looking at you, FAT-derivatives!

I had to explain to my dad that FAT32 does work on 64 bit computers recently.

HardwareGeek

@ben_lubar said:

FAT32 does work on 64 bit computers

I initially misread that as "does not," and was about to chastise you, but then I read it again.

ben_lubar

It ends up the reason he couldn't read his files from the drive was because it was broken and in the 20% of the time that it worked it contained an ext3 filesystem with my Steam library on it. From back when I ran Fedora on a Pentium 4.

EvanED

@wft said:

Show me a single use case of a C/C++ source that would be so immensely useful as to justify all that extra work and support from the compiler, I dare you.

The compiler wouldn't have to support anything. It would do the same damn thing that it does now.

Everything would be supplied by either the OS or the libc equivalent! (Or, with the thing I said, an LD_PRELOAD that interposes between the application and libc.) And then everything would get at least some support for this kind of stuff, not just the compiler.

EvanED

@PleegWat said:

@EvanED said:
Totally feasible. With z/OS pipes, commands can have multiple input and/or multiple output pipes that you can connect up arbitrarily. I think they might be numbered instead of named, so that would have to change, but it's a minor point.

Reminds me of linux. By default, you have 3 streams: An input stream (0), an output stream (1), and an error stream (2). But you can extend that to any number you want.

Sorta. I think it's closer to process substitution (the diff <(do_thing_1) <(do_thing_2) syntax). There's nothing fundamental that z/OS gives you that you couldn't do in Linux, it's just that (1) the suite of command line tools is designed around being able to do this and actually takes advantage of it (just because the program can fdopen fd #4 doesn't mean that it's actually likely to -- this is why I say process substitution is closer) and (2) I think the shell syntax is richer in terms of the pipelines you can build. (It's been a decade since I did anything with z/OS, and that only during a summer internship. So I forget everything that you can do.)

PleegWat

Yeah, the only 2 linux standard tools I can think of right now that can handle higher file descriptors are the bash builtin read and the tool flock. And you're gonna have to write some pretty tricky bash code to set up pipelines where the input FD isn't #0.

dkf

@PleegWat said:

And you're gonna have to write some pretty tricky bash code to set up pipelines where the input FD isn't #0.

Is /dev/fd/4 (or whatever number) not a thing on your operating system? That makes it easy to tell most tools to take input from (or write output to) an alternate stream.

PleegWat

There's that, and named pipes, on the process side. The problem's on the bash side. foo | bar is always stdout to stdin. You can do something like

bar 4< <(foo)

But, as mentioned, that gets unwieldy quickly, if it even scales at all.

Jaloopa

@dkf said:

I've used that a few times to my advantage, to do things like extracting the text content so that I could recover critical data and get working again. All with generic tools. Saved my ass.

I used to maintain an application that imported data from XLSX spreadsheets. It was very useful to be able to change the extension and look at the XML data. The same tool also dealt with XLS by manipulating the binary data, that was firmly in the "don't touch in case you break it" bucket, any problems were either find a workaround or convince the client to switch to XLSX instead