GNU coreutils `sort -u`
-
sort -u
is impressively bad performancewise on large files with a large number of duplicate lines. (-u
is "unique".)In particular, it appears to not take advantage of the
-u
flag until output, so if you have a 25-million line file, it will read in all 25 million lines, sort all 25 million lines, and then output just the unique ones. At least that's what I infer from the performance.I have file that's about that size; 25.2 million lines, but only 44,565 that are unique. Let's see how
sort
does.$ time sort -u demo > /dev/null sort -u demo > /dev/null 173.57s user 10.67s system 545% cpu 33.804 total $ time sort demo > /dev/null sort demo > /dev/null 171.00s user 8.68s system 558% cpu 32.195 total $ time (sort demo | uniq > /dev/null) ( sort demo | uniq > /dev/null; ) 1.07s user 0.07s system 3% cpu 33.121 total
So
sort
andsort -u
takes about the same amount of time. Actually,-u
makes it take more, though maybe just jitter. I tossed in a comparison withsort | uniq
for the curious. That usually winds up faster for me actually. (Which means that the authors ofsort
went out of their way to make it less Unixy so that they could make it perform worse. Yay?)(Incidentally, and also interestingly,
cat demo | sort
is far slower thansort -u
. By watchinghtop
, it's easy to see why:sort
parallelizes the sort when you give it the file name but not otherwise.)I wonder if we can do better by not being stupid.
$ time (cat demo | python -c $'import sys\nfor line in sorted(set(sys.stdin)): sys.stdout.write(line)' > /dev/null) ( cat demo | python -c > /dev/null; ) 1.14s user 0.05s system 99% cpu 1.190 total
Yup.
27 times faster, in Python, with no attempt at all to do any parallelization. 54 times faster than the
cat demo | sort
version that doesn't parallelize the sort. 1/162 of the total CPU time of the parallelized version.
-
python might have some other reasons for yielding faster results
relevant:
-
The question is, how else to grab unique values from a set. If you don't do some sort of hash-based sieving (which makes sense, but may be tough to implement), sorting is pretty much the best option - removing duplicates in a sorted set is trivial. Much better than nested loops anyway.
And as for parallelization - duh, if you get data streamed, there isn't really much to parallelize if you use a streaming algorithm.
-
sorting is pretty much the best option
The best technique is probably a heap sort using a balanced tree as store, where you discard duplicates (which is a trivial change to the tree insertion function). The data structure builds the sorted data anyway, so you get exactly what you want straight off, and there's some high-quality library implementations that could be used/adapted.
-
python might have some other reasons for yielding faster results
Do I feel a timsort joke coming up?
-
$ time (LC_ALL=C sort -u demo > coreutils.txt)
real 0m38.675s user 0m26.633s sys 0m1.169s $ time (LC_ALL=C sort -u < demo > coreutils2.txt) real 0m41.449s user 0m26.171s sys 0m1.098s $ time (LC_ALL=C sort demo | uniq > coreutils3.txt) real 0m41.141s user 0m28.906s sys 0m1.286s $ time (LC_ALL=C sort < demo | uniq > coreutils4.txt) real 0m43.910s user 0m28.912s sys 0m1.861s $ time (cat demo | python -c $'import sys\nfor line in sorted(set(sys.stdin)): sys.stdout.write(line)' > python.txt) real 0m8.374s user 0m5.486s sys 0m0.364s $ time (cat demo | python -c $'import sys\nfor line in set(sorted(sys.stdin)): sys.stdout.write(line)' > python2.txt) real 0m54.794s user 0m40.112s sys 0m1.751s $ time (cat demo | python -c $'import sys\nfor line in sorted(set(sorted(sys.stdin))): sys.stdout.write(line)' > python3.txt) real 0m59.434s user 0m40.475s sys 0m1.371s $ sha1sum *.txt 0e78a6a4ed68c02390320f23ae12ed7624cc6ec6 coreutils2.txt 0e78a6a4ed68c02390320f23ae12ed7624cc6ec6 coreutils3.txt 0e78a6a4ed68c02390320f23ae12ed7624cc6ec6 coreutils4.txt 0e78a6a4ed68c02390320f23ae12ed7624cc6ec6 coreutils.txt 2f48d1cc17fecbd475fb36ba9a81a730f2032980 python2.txt 0e78a6a4ed68c02390320f23ae12ed7624cc6ec6 python3.txt 0e78a6a4ed68c02390320f23ae12ed7624cc6ec6 python.txt $ head python2.txt 12747 15563 10354 21407 8244 34386 598 43901 1939 25690 $ head coreutils.txt 0 1 10 100 1000 10000 10001 10002 10003 10004
data generated using this program: http://play.golang.org/p/chisTa2HNg
The reason python is faster is that you're comparing "sort an array with 25200000 elements" with "sort an array with 44565 elements".
sort -u
does exactly whatsort | uniq
does, assuming you don't have any special sorting comparison function. That's how it's defined. If you do the same thing with both programs, you get python slower, and if you want python to actually return the correct results, it's even slower than that.
-
python might have some other reasons for yielding faster results
That's actually really interesting, and addingLC_ALL=C
helps a great deal.sort -u
takes 3.8 seconds andsort
takes 5.8 seconds, which argues against my earlier hypothesis.Python still wins though, by more than 3x.
(The numbers above are with
cat demo | sort
rather thansort demo
because they are actually faster.sort demo
takes 5.2 seconds, andsort -u demo
takes 6.2 seconds. So fun?)The question is, how else to grab unique values from a set. ... (which makes sense, but may be tough to implement) ...
Wut? I showed you how to do it in two lines including animport
! And it's demonstrably faster by several times, even after Python instead of C. And that's with next to zero effort on my part. Would a balanced tree be better? What about an array-based heap? Sorted array? How must faster could you get it ? Who knows!sort -u does exactly what sort | uniq does, assuming you don't have any special sorting comparison function. That's how it's defined.
That's... kind of my point. That's the WTF. One of the main reasons that the Unix approach of composing a bunch of small tools isn't always the right thing to do is because of performance. In this case, if I issuesort | uniq
thensort
can't know what the reader is going to do and has to sort. Butsort -u
has that knowledge. It can optimize! Why else put the-u
switch intosort
in the first place if I might as well just pipe it intouniq
? Because it saves three characters?And this ignores the fact that I would be mildly surprised if
uniq
was used (intentionally) without a sorted input (or other knowledge that if there are duplicates then they'll be adjacent), which means I'm not talking about an edge case but the common use case of uniquifying a file.
-
And it's demonstrably faster by several times
Good. Now test it on a set with all unique lines and I bet my ass you'll see the performance go fully down the drain.
I'm too lazy to dig through sources for both Python and coreutils, but if the uniqueness test is done with nested loops, then yeah, no shit it goes close to O(n) when the data don't have many unique values. But in the worst case, it'll be O(n^2),while sort-then-test would take as much as the sort, which is O(nlogn) if they aren't retarded.
Also, "I did it in two lines" doesn't count if the first line imports every function you need.
-
I'm too lazy to dig through sources for both Python and coreutils, but if the uniqueness test is done with nested loops, then yeah, no shit it goes close to O(n) when the data don't have many unique values.
ITYM O(n2) there. And I'd be willing to bet that Python backs the set with a hash table of some kind; those have great performance in normal cases. Oh look, it does exactly that.
The only questions remaining are whether coreutils could use a better implementation than Python currently does, and if it did so, what would that algorithm be? I think we've demonstrated that it can certainly do better than it does at the moment (it's current performance is miserable) though not whether something like a tree-based set would be superior; I'd have to write code to work that out so fuck it. ;)
-
ITYM O(n2) there.
When all values are the same, you only need to check one value for each one and get an immediate match.
And I'd be willing to bet that Python backs the set with a hash table of some kind; those have great performance in normal cases. Oh look, it does exactly that.
Yeah, so as I said it my first post, hash sieving is better.
-
When all values are the same, you only need to check one value for each one and get an immediate match.
OK, O(n) is the hard minimum; you've got to look at each value in the input at least once. ;)
-
If there are many duplicate lines and the input is already mostly sorted you can take advantage of that (and the UNIX paradigm of splitting up tasks (to separate CPUs)) like:
(export LC_ALL=C; uniq | sort -u) < demo
-
Why else put the -u switch into sort in the first place if I might as well just pipe it into uniq?
Because sort -u uses the same equality rules for sorting as it does for uniqueness testing, and sort has more flexible field selection rules than uniq does.
When you're working with whole lines, sort -u is indeed equivalent to sort | uniq. But as far as I know there is no way to express the equivalent of e.g. sort -k1 -u as a sort|uniq pipeline.
Also, uniq has no numeric parsing rules: lines or fields within lines which are numerically but not textually equal will be treated as equal by sort -n but not by uniq.
-
Also uniq requires the data to be sorted to produce uniq results across the whole set
-
Doesn't actually require it, but will certainly yield usually-useless results if it isn't.
-
How do you define "requires", since it's obviously not "condition that must be met to produce the expected result"? Would it have to refuse to run if the data isn't sorted? Format your hard drive? Call a drone strike on you?
-
Exiting with a nonzero return code would be enough to convince me. But uniq doesn't do that, because despite the name it does have valid uses beyond finding unique lines in sorted input, like cleaning up repetitious log file spam for example.
-
That just means it doesn't bother to check that the content is sorted first. That you can use a tool for something other than it was intended for (using uniq to remove repeated lines instead of producing a list of unique lines) is just a handy side-benefit. But someone could change the implementation and ruin your use case while conforming to the specification.
Essentially, you're using the tool outside of the specifications. Requirements are part of the specification, not the implementation.
-
The side-benefit is actually the reverse. Per http://linux.die.net/man/1/uniq :
Filter adjacent matching lines from INPUT (or standard input), writing to OUTPUT (or standard output).
Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'.
So finding all unique lines is a special case when the input is sorted. The normal case is just removing duplicate adjacent lines.
-
A version I use occasionally is
sort | uniq -c | sort -n
.-c
is a flag to uniq that counts the number of duplicates, and is not available from sort.
-
The best technique is probably a heap sort using a balanced tree as store, where you discard duplicates (which is a trivial change to the tree insertion function).
That is what I did when I had to do it in Haskell. Insert the list elements to a Set (which is a balanced tree, internally, and which automatically excludes duplicates). Extract the elements, already sorted.
-
First thing I thought of when I read "25.2 million lines, but only 44,565 that are unique". But it could mean a block of 44,565 different lines repeated 500+ times. Or almost worse, a concatenation of 500+ different permutations of those 44,565 lines.
-
I'll have more to say on this topic that actually turned into a pretty interesting discussion (and probably retract some of what I said) when I get more of a spare moment than I had today, but I can hit on this point now:
@pixelbeat said:(export LC_ALL=C; uniq | sort -u) < demo
That is actually almost guaranteed to do nothing in my case; there's almost no situation (and... maybe literally no situation) in which the process that produced the files I was operating on can generate two consecutive equal lines.And indeed, if I arbitrarily pick a "similar" file to the one I was picking on before:
$ wc -l file 38065170 file $ uniq badpass.eas | wc -l 38065170
Obviously there are many situations in which that's not the case, but I did think about that. (Actually the first thing I tried was, instead of reading everything into a Python
set
, I wrote a little script that would read in 10,000 lines (arbitrarily-chose number), sort them, then output them; then I could do a couple rounds of alternating that script withuniq
before piping into the finalsort
. That was before I decided to just see how well reading into a set would work and expected it to perform pretty badly.)
-
If the similarities are on a higher level than each line, then I suggest you tell sort to do more focused work initially by using the --buffer-size option. By default that is a % of RAM and thus very large, so you'd want to reduce to the same magnitude as the "structures" in your file. You may want to adjust --batch-size above the default value of 16 too
-
If the similarities are on a higher level than each line, then I suggest you tell sort to do more focused work initially by using the --buffer-size option. By default that is a % of RAM and thus very large, so you'd want to reduce to the same magnitude as the "structures" in your file. You may want to adjust --batch-size above the default value of 16 too
And we'd suggest using a better algorithm for
sort -u
. Unique-ing (and counting) sorts should be using something much more akin to a binsort so that the resource consumption scales with the number of unique keys, not the number of input elements, and the time can then go from O(NlogN) to O(N+KlogK). (The dominant term will depend on the input dataset, of course.)
-
(Which means that the authors of sort went out of their way to make it less Unixy so that they could make it perform worse. Yay?)
Must... avoid.. dig at...
systemd
...