UNIX/WIN wildcards



  • So, today I was completely mindblown when i learnt that wildcards on file names are treated completely differently in UNIX and WIN (or better, their command interpreter).
    I thought that someone else not familiar with both worlds might find it interesting.

    If you have a folder which contains:

    File1.txt
    File2.txt
    File3.txt

    On Windows, if you run
    del *.txt
    you're running the del command, passing *.txt as the argument.

    On Unix, if you run
    rm *.txt
    bash expands the *.txt to File1.txt File2.txt File3.txt and runs the rm command with that argument.

    This is not a WTF per se, different systems work in a different way. I've always assumed the Windows way so I was surprised.

    IMHO, however, there are a couple of minor :wtf: with the UNIX way:

    1. There's a limit to the numer of files that you can process which is determined by the maximum length of the expanded argument.
      Practically speaking, this usually doesn't seem to be a problem.
    2. Bash will expand the wildcard ONLY if it can find some files. If no files are matched, instead of expanding to nothing, it will just use the original expression. So, if the files are already deleted in the previous example, it will execute rm *.txt instead of rm. In this particular case it doesn't cause problems, but if you're writing a script this could cascade to bigger stuff. It can be changed, but the default behavious is this.


  • @Zmaster The advantage of the unix way is that it's consistent - the same glob pattern will expand in the same way regardless of which command you are executing. This does not apply in windows, and I know I've read in the past there are commands which deviate significantly from what the majority does.

    There are certain things you can do on windows which don't work on linux (like rename *.htm *.html, which wouldn't work like that on linux), but I don't think it's a great loss. In general though it's hard to compare - I have very limited experience with cmd.exe, and I know for a fact bash is significantly more powerful such that comparing individual merits is tricky.



  • @PleegWat said in UNIX/WIN wildcards:

    (like rename *.htm *.html, which wouldn't work like that on linux)

    The command on Linux is rename 's/\.htm$/.html/' *.htm. Regular expressions are fun!


  • Winner of the 2016 Presidential Election

    @Zmaster There are some UNIX commands, like rsync, which can handle wildcards themselves. You just have to remember to escape the asterisks or use single quotes.



  • Another important difference between Windows and Unix:

    Arguments on the command line in Windows are passed as a single string.

    Arguments on the command line on Unix are passed separately.

    This means that programs don't separate arguments, shells do.



  • @Zmaster said in UNIX/WIN wildcards:

    There's a limit to the numer of files that you can process which is determined by the maximum length of the expanded argument.
    Practically speaking, this usually doesn't seem to be a problem.

    Until there's a runaway script that generates a million files and you can't just delete the entire directory (because there are things inside you need).

    Wanna know a handy command that neatly solves this problem?

    Me too! The way I've been doing it was with crap like this: find /path/to/dir --max-depth 1 -type f -something-else -im-doing-this-from-memory -exec rm {} \;



  • @cartman82 if you want to delete an entire directory, just rm -r directoryname.



  • @ben_lubar said in UNIX/WIN wildcards:

    @cartman82 if you want to delete an entire directory, just rm -r directoryname.

    Duh.

    Clarified.



  • @cartman82 find -name '*.foo' -remove, where *.foo is the wildcard for the files you want to delete.



  • @ben_lubar said in UNIX/WIN wildcards:

    @PleegWat said in UNIX/WIN wildcards:

    (like rename *.htm *.html, which wouldn't work like that on linux)

    The command on Linux is rename 's/\.htm$/.html/' *.htm. Regular expressions are fun!

    Wouldn't the shell expand that last *.htm first? Oops.



  • @dcon said in UNIX/WIN wildcards:

    @ben_lubar said in UNIX/WIN wildcards:

    @PleegWat said in UNIX/WIN wildcards:

    (like rename *.htm *.html, which wouldn't work like that on linux)

    The command on Linux is rename 's/\.htm$/.html/' *.htm. Regular expressions are fun!

    Wouldn't the shell expand that last *.htm first? Oops.

    That's the intention. The command expects a perl/sed regex replacement expression as the first argument and the names of the files to rename as the rest.



  • @ben_lubar said in UNIX/WIN wildcards:

    @dcon said in UNIX/WIN wildcards:

    @ben_lubar said in UNIX/WIN wildcards:

    @PleegWat said in UNIX/WIN wildcards:

    (like rename *.htm *.html, which wouldn't work like that on linux)

    The command on Linux is rename 's/\.htm$/.html/' *.htm. Regular expressions are fun!

    Wouldn't the shell expand that last *.htm first? Oops.

    That's the intention. The command expects a perl/sed regex replacement expression as the first argument and the names of the files to rename as the rest.

    Oh - got it. So the usual Unix way - totally change the order of arguments.



  • @ben_lubar said in UNIX/WIN wildcards:

    @cartman82 find -name '*.foo' -remove, where *.foo is the wildcard for the files you want to delete.

    WhyTF does find remove files? "Do one thing, and maybe sneak in another if you feel like it, we're not judging"?



  • @Maciejasjmj find is the generalized "find things on the filesystem and do stuff with them" command.



  • @ben_lubar said in UNIX/WIN wildcards:

    find things on the filesystem and do stuff with them

    THAT'S TWO THINGS WHARRGARBL

    As for wildcards, what kinda sucks about it is that MS has been introducing the incredibly useful ** wildcard, but the way they wrote it, it's impossible to put into the older tools without a rewrite. Would probably require going back to CP/M days to fix that though.


  • Winner of the 2016 Presidential Election

    @Maciejasjmj said in UNIX/WIN wildcards:

    THAT'S TWO THINGS WHARRGARBL

    If you really want to keep the two things separate (I usually do in this instance), use something like

    find -name '*.foo' -print0 | xargs -0 rm

    Where find does the finding, xargs passes the found things as parameters to a command, and the command does things with them.



  • @ben_lubar said in UNIX/WIN wildcards:

    Arguments on the command line in Windows are passed as a single string.

    Not sure what you mean. If you write a C or C++ program in windows, your main looks like this:

    int main(int argc, char** argv) {
    // your program here
    return 0;
    }

    argc tells you the number of arguments, and argv points to an array of null terminated strings, where each is an argument to the program. The shell does some work on the arguments, too. It splits on whitespace, unless you use quotes, in which case it gives you the quoted string without the quotes (so -arg="a quoted string" looks to your application as -arg=a quoted string).



  • @Kian While we're on the subject, C/C++ on UNIX tend to use POSIX's getopt (and glibc's getopt_long) to parse command line arguments.



  • @Kian as far as I know, there's no equivalent to execvp on Windows, or at least you can't access it easily from Microsoft's programming languages like C#.



  • @ben_lubar said in UNIX/WIN wildcards:

    @Kian as far as I know, there's no equivalent to execvp on Windows, or at least you can't access it easily from Microsoft's programming languages like C#.

    ... Process class, StartInfo object, Arguments property.



  • @powerlord said in UNIX/WIN wildcards:

    @ben_lubar said in UNIX/WIN wildcards:

    @Kian as far as I know, there's no equivalent to execvp on Windows, or at least you can't access it easily from Microsoft's programming languages like C#.

    ... Process class, StartInfo object, Arguments property.

    That is quite clearly a string and not an array or a list or an IEnumerable.



  • @ben_lubar No, but if you read the documentation, you'd notice that it mentions splitting on spaces that aren't in quotation marks. Which makes sense since even C# applications receive arguments as a string array.



  • @powerlord said in UNIX/WIN wildcards:

    @ben_lubar No, but if you read the documentation, you'd notice that it mentions splitting on spaces that aren't in quotation marks. Which makes sense since even C# classes receive arguments as a string array.

    My point is that on Unix, you give the OS an array of strings and it gives those strings to the program you're running. On Windows, you do some string manipulation that every program gets wrong in a different way and then the program's startup code parses the arguments to get the string array back.



  • @powerlord That's a convenient way to unify the handling of the command line. Not sure if windows has something similar.

    @ben_lubar said in UNIX/WIN wildcards:

    as far as I know, there's no equivalent to execvp on Windows, or at least you can't access it easily from Microsoft's programming languages like C#.

    Not sure I understand what execvp does. Does it kill your current process and replace it with another?

    I'm not sure how easy it is to call the win32 api functions from C#, I can't imagine it's too hard or that there aren't C# alternatives. But there you have CreateProcess to launch new processes (passing whatever command line arguments you want), and from GUI Unicode applications (that might not have access to a normal main) you can call GetCommandLine and CommandLineToArgvW to get it split into separate strings the way argv gets it by default.

    Ah, here we go. Environment.GetCommandLineArgs does the same from C#.

    For reference, this took like one google search per method. I don't even use C# and was able to find the docs for how to do it.



  • @Kian I have yet to find a standard library function for encoding the string you give to the command line for Windows programs, whereas fork/exec* on Linux just takes the arguments as an array in the first place.

    Programs and runtimes shouldn't be doing the shell's job.



  • @ben_lubar said in UNIX/WIN wildcards:

    @cartman82 find -name '*.foo' -remove, where *.foo is the wildcard for the files you want to delete.

    I think you mean -delete, not -remove.

    @Dreikin said in UNIX/WIN wildcards:

    If you really want to keep the two things separate (I usually do in this instance), use something like

    find -name '*.foo' -print0 | xargs -0 rm

    Where find does the finding, xargs passes the found things as parameters to a command, and the command does things with them.

    :pendant: -delete, -print0, -0 are all GNU extensions, not parts of POSIX. If portability matters, you might want to avoid them.

    Unix filenames have a lot of interesting edge cases: https://www.dwheeler.com/essays/fixing-unix-linux-filenames.html



  • @DCoder said in UNIX/WIN wildcards:

    @ben_lubar said in UNIX/WIN wildcards:

    @cartman82 find -name '*.foo' -remove, where *.foo is the wildcard for the files you want to delete.

    I think you mean -delete, not -remove.

    @Dreikin said in UNIX/WIN wildcards:

    If you really want to keep the two things separate (I usually do in this instance), use something like

    find -name '*.foo' -print0 | xargs -0 rm

    Where find does the finding, xargs passes the found things as parameters to a command, and the command does things with them.

    :pendant: -delete, -print0, -0 are all GNU extensions, not parts of POSIX. If portability matters, you might want to avoid them.

    Unix filenames have a lot of interesting edge cases: https://www.dwheeler.com/essays/fixing-unix-linux-filenames.html

    Also, a protip for anyone using commands like find -name 'bar*' -exec foo {} \; as a replacement for foo bar*: that's horribly inefficient because you're spawning a foo for each file. Instead, find -name 'bar*' -exec foo {} +, which as an added bonus takes fewer characters.


  • Discourse touched me in a no-no place

    @Kian said in UNIX/WIN wildcards:

    Not sure what you mean. If you write a C or C++ program in windows, your main looks like this:

    Sure, but main isn't the actual process entry point. That's actually in the particular runtime that you're using (the MSVC one is common, but not universally used) and it is the runtime that does the wildcard expansion. This is a problem precisely because different runtimes do the handling of quoting and wildcards a little bit differently; reliably passing an arbitrary (not very long) string via the Windows process invocation system call is irritatingly difficult as a result.

    I forget what exactly were the awkward cases (except for some tricky bits with calling builtins of cmd.exe like start, as those don't use the most common runtime routines for the parsing). I've really not dealt with this in a very long time.


  • Discourse touched me in a no-no place

    @ben_lubar said in UNIX/WIN wildcards:

    a standard library function for encoding the string you give to the command line

    There isn't one. The string you need to generate depends on the peculiarities of the runtime that the target program is using.



  • @ben_lubar said in UNIX/WIN wildcards:

    @cartman82 if you want to delete an entire directory, just rm -r directoryname.

    I'm sure that -r is there for a reason, but to me it's equally annoying as cd /D in Windows.

    (For the *nix folks: In Windows you can't cd D:\stuff if the current directory is on the C: drive. You need cd /D D:\stuff)



  • @cartman82 said in UNIX/WIN wildcards:

    Me too! The way I've been doing it was with crap like this: find /path/to/dir --max-depth 1 -type f -something-else -im-doing-this-from-memory -exec rm {} \;

    You don't want to do -exec rm {} \; as that invokes a new rm for each file. Specifically for deleting, you want to use the dedicated -delete command. For other commands, you want to use -exec rm {} + which passes as many files as fit in the command line, or -print0 | xargs -0 if you need more complicated invocations than find can handle (rare).
    This can be handy if you want to invoke something like grep on a carefully selected list of files.

    As mentioned by others, if you need to do this in a posix-compliant fashion you may need find -print | xargs instead, which isn't safe for filenames containing whitespace or control characters.

    @Kian said in UNIX/WIN wildcards:

    Not sure I understand what execvp does. Does it kill your current process and replace it with another?

    Starting a new process on unix is a two-step process. First you use fork to create a copy of your current process image. Both the parent and the child return from this system call with different return values (0 for the child, the child's pid for the parent, -1 for the parent and no child on error). You can distinguish based on that, and have the child do setup tasks like closing open files and setting up the child end of pipes before calling a function from the exec family to replace the process image, while other key attributes like the pid stay the same.
    In windows, there is no real equivalent to fork. Instead, a function like CreateProcess creates a new process with specified executable image in a single call. This is faster (because you do not need to copy a large process image in fork) but limits what setup work you can do for the new process.

    Posix does also specify the posix_spawn function with a CreateProcess-like semantic, but this is typically implemented as a library function.



  • @PleegWat said in UNIX/WIN wildcards:

    In windows, there is no real equivalent to fork. Instead, a function like CreateProcess creates a new process with specified executable image in a single call. This is faster (because you do not need to copy a large process image in fork) but limits what setup work you can do for the new process.

    fork() uses copy-on-write so setting up a few MMU pages is is usually the bulk of the work, that's much faster than on Windows. There's an Apache MPM module that actually forks for every incoming request.



  • @LaoC Well, just last month I tracked down a performance bottleneck in a 40gb parent process too it taking about a full second to fork a child. For 8ish child jobs per minute. And that time goes into all threads, since the whole process is paused during fork.



  • I like how all the people correcting and pointing out nuances in my find code are just proving my point.


  • Discourse touched me in a no-no place

    @PleegWat said in UNIX/WIN wildcards:

    Well, just last month I tracked down a performance bottleneck in a 40gb parent process too it taking about a full second to fork a child.

    The best fix, though an intrusive one, is to launch a child process as early as possible (preferably while the amount of memory allocated is in the few hundred MB range or less) to handle the forking on the parent's behalf, assuming that the child is going to do an execve() system call. That avoids a lot of trouble. There's a reasonable discussion linked below, though be aware that there's a fair amount of cargo-cult stupidity in how people tend to try to handle the problem, based on them not understanding what's really wrong in the first place.



  • @cartman82 The whole affair tends to be a hardening circlejerk. People who use the shell manually tend to avoid problematic filenames, and I think the maximum argument list memory is on the order of megabytes in modern systems - I've done rm * on some pretty huge directories without errors.

    And how common is non-gnu userland anyway? Tiny routers which run busybox? BSD? HP/UX? Does that stuff even still exist?

    @dkf said in UNIX/WIN wildcards:

    The best fix, though an intrusive one, is to launch a child process as early as possible (preferably while the amount of memory allocated is in the few hundred MB range or less) to handle the forking on the parent's behalf

    Yeah, that's what I ended up doing. Other alternatives I found involved vfork or or clone(CLONE_VM) and didn't give me a warm fuzzy feeling.


  • Discourse touched me in a no-no place

    @PleegWat said in UNIX/WIN wildcards:

    Other alternatives I found involved vfork or or clone(CLONE_VM) and didn't give me a warm fuzzy feeling.

    vfork pauses the parent process. Don't know clone nearly well enough to comment on it.



  • @dkf vfork pauses until exec, but doesn't clone memory, so may be faster in certain circumstances.

    The clone approach basically starts a hybrid between a process and a thread, which shares memory space but nothing else with the calling process. It gains its own memory image on exec. It just feels off, and more concretely I have no clue how I'd go about cleaning up that thread's stack.


  • area_pol

    @cartman82 said in UNIX/WIN wildcards:

    Until there's a runaway script that generates a million files and you can't just delete the entire directory (because there are things inside you need).
    Wanna know a handy command that neatly solves this problem?
    Me too!

    import os, glob
    for fp in glob.glob('*.html'): os.remove(fp)
    


  • @Adynathos said in UNIX/WIN wildcards:

    import os, glob
    for fp in glob.glob('*.html'): os.remove(fp)

    Nice.

    I think I'll do that next time.


  • area_pol

    @cartman82 said in UNIX/WIN wildcards:

    Nice.
    I think I'll do that next time.

    :)
    Before you delete things, print them, I did not test that.

    If you want to run python directly in shell: http://xon.sh/



  • @PleegWat said in UNIX/WIN wildcards:

    @LaoC Well, just last month I tracked down a performance bottleneck in a 40gb parent process too it taking about a full second to fork a child. For 8ish child jobs per minute. And that time goes into all threads, since the whole process is paused during fork.

    True, it can get slower than CreateProcess is the process is already huge. However, I think the fastest CreateProcess on my laptop was something like 5ms for a small benchmark program; if I do the same thing with fork I get about 200 µs, in a Perl script.



  • @Adynathos said in UNIX/WIN wildcards:

    import os, glob
    for fp in glob.glob('*.html'): os.remove(fp)
    

    Or perl -e'unlink <*.html>'



  • @PleegWat said in UNIX/WIN wildcards:

    @LaoC Well, just last month I tracked down a performance bottleneck in a 40gb parent process too it taking about a full second to fork a child. For 8ish child jobs per minute. And that time goes into all threads, since the whole process is paused during fork.

    Wait... a 40GB process? Which kind of process uses so much memory?

    Then... is there any other reason a process would fork() itself, other than starting other programs? Cause for threads, you don't need (or want) a separate address space.



  • @ben_lubar said in UNIX/WIN wildcards:

    I have yet to find a standard library function for encoding the string you give to the command line for Windows programs,

    That slides firmly into "why would you do this?"

    The Windows model is that the application's functional logic is in a library separate from any interfaces. (Both user interfaces, like CLIs or GUIs, and machine interfaces, like Services.)

    If you're ever calling a user interface from another user interface in Windows, you done screwed something up. Just link to the library directly.



  • @cartman82 said in UNIX/WIN wildcards:

    I like how all the people correcting and pointing out nuances in my find code are just proving my point.

    Reading this essay https://www.dwheeler.com/essays/fixing-unix-linux-filenames.html leads me to the conclusion that it's literally impossible to do "correctly" for all POSIX OSes. You can get 99% there on all of them, but that last 100% is a bitch.



  • @marczellm said in UNIX/WIN wildcards:

    (For the *nix folks: In Windows you can't cd D:\stuff if the current directory is on the C: drive. You need cd /D D:\stuff)

    Sure you can. And the current directory on D: changes. You just happen to stay on C:. (Now I can do somecmd filesonC filesonD)



  • @Maciejasjmj It's not "two things". find just traverses the filesystem and applies a function/command to the nodes. It's fmap for filesystems.



  • @LaoC said in UNIX/WIN wildcards:

    @PleegWat said in UNIX/WIN wildcards:

    @LaoC Well, just last month I tracked down a performance bottleneck in a 40gb parent process too it taking about a full second to fork a child. For 8ish child jobs per minute. And that time goes into all threads, since the whole process is paused during fork.

    True, it can get slower than CreateProcess is the process is already huge. However, I think the fastest CreateProcess on my laptop was something like 5ms for a small benchmark program; if I do the same thing with fork I get about 200 µs, in a Perl script.

    I wouldn't know, I don't really do windows, so I don't know how its details work.

    @Zmaster said in UNIX/WIN wildcards:

    @PleegWat said in UNIX/WIN wildcards:

    @LaoC Well, just last month I tracked down a performance bottleneck in a 40gb parent process too it taking about a full second to fork a child. For 8ish child jobs per minute. And that time goes into all threads, since the whole process is paused during fork.

    Wait... a 40GB process? Which kind of process uses so much memory?

    Then... is there any other reason a process would fork() itself, other than starting other programs? Cause for threads, you don't need (or want) a separate address space.

    Major data processing daemon. Forks are needed to start followup processing, mostly out-of-process database insert jobs. The fact we do that multiple times per batch is someone else's design decision which I see no overly urgent reason to fix (now that the forks are out of the main process anyway).

    EDIT: Actually, probably the best way would be to watch for incoming files with inotify or similar. We'll put that on the 'when I get to do pet architecture changes' list.


  • area_can

    @ben_lubar said in UNIX/WIN wildcards:

    find things on the filesystem and do stuff with them

    I mean, that is what 90% of *Nix programs do...


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.