GNU coreutils `sort -u`

EvanED

sort -u is impressively bad performancewise on large files with a large number of duplicate lines. (-u is "unique".)

In particular, it appears to not take advantage of the -u flag until output, so if you have a 25-million line file, it will read in all 25 million lines, sort all 25 million lines, and then output just the unique ones. At least that's what I infer from the performance.

I have file that's about that size; 25.2 million lines, but only 44,565 that are unique. Let's see how sort does.

$ time sort -u demo > /dev/null
sort -u demo > /dev/null  173.57s user 10.67s system 545% cpu 33.804 total

$ time sort demo > /dev/null 
sort demo > /dev/null  171.00s user 8.68s system 558% cpu 32.195 total

$ time (sort demo | uniq > /dev/null)
( sort demo | uniq > /dev/null; )  1.07s user 0.07s system 3% cpu 33.121 total

So sort and sort -u takes about the same amount of time. Actually, -u makes it take more, though maybe just jitter. I tossed in a comparison with sort | uniq for the curious. That usually winds up faster for me actually. (Which means that the authors of sort went out of their way to make it less Unixy so that they could make it perform worse. Yay?)

(Incidentally, and also interestingly, cat demo | sort is far slower than sort -u. By watching htop, it's easy to see why: sort parallelizes the sort when you give it the file name but not otherwise.)

I wonder if we can do better by not being stupid.

$ time (cat demo | python -c $'import sys\nfor line in sorted(set(sys.stdin)): sys.stdout.write(line)' > /dev/null) 
( cat demo | python -c  > /dev/null; )  1.14s user 0.05s system 99% cpu 1.190 total

Yup.

27 times faster, in Python, with no attempt at all to do any parallelization. 54 times faster than the cat demo | sort version that doesn't parallelize the sort. 1/162 of the total CPU time of the parallelized version.

Monarch

python might have some other reasons for yielding faster results

relevant:

Why is coreutils sort slower than Python?

I wrote the following script to test the speed of Python's sort functionality: from sys import stdin, stdout lines = list(stdin) lines.sort() stdout.writelines(lines) I then compared this to the

Maciejasjmj

The question is, how else to grab unique values from a set. If you don't do some sort of hash-based sieving (which makes sense, but may be tough to implement), sorting is pretty much the best option - removing duplicates in a sorted set is trivial. Much better than nested loops anyway.

And as for parallelization - duh, if you get data streamed, there isn't really much to parallelize if you use a streaming algorithm.

dkf

@Maciejasjmj said:

sorting is pretty much the best option

The best technique is probably a heap sort using a balanced tree as store, where you discard duplicates (which is a trivial change to the tree insertion function). The data structure builds the sorted data anyway, so you get exactly what you want straight off, and there's some high-quality library implementations that could be used/adapted.

Zecc

@Monarch said:

python might have some other reasons for yielding faster results

Do I feel a timsort joke coming up?

ben_lubar

$ time (LC_ALL=C sort -u demo > coreutils.txt)

real    0m38.675s
user    0m26.633s
sys     0m1.169s

$ time (LC_ALL=C sort -u < demo > coreutils2.txt)

real    0m41.449s
user    0m26.171s
sys     0m1.098s

$ time (LC_ALL=C sort demo | uniq > coreutils3.txt)

real    0m41.141s
user    0m28.906s
sys     0m1.286s

$ time (LC_ALL=C sort < demo | uniq > coreutils4.txt)

real    0m43.910s
user    0m28.912s
sys     0m1.861s

$ time (cat demo | python -c $'import sys\nfor line in sorted(set(sys.stdin)): sys.stdout.write(line)' > python.txt)

real    0m8.374s
user    0m5.486s
sys     0m0.364s

$ time (cat demo | python -c $'import sys\nfor line in set(sorted(sys.stdin)): sys.stdout.write(line)' > python2.txt)

real    0m54.794s
user    0m40.112s
sys     0m1.751s

$ time (cat demo | python -c $'import sys\nfor line in sorted(set(sorted(sys.stdin))): sys.stdout.write(line)' > python3.txt)

real    0m59.434s
user    0m40.475s
sys     0m1.371s

$ sha1sum *.txt
0e78a6a4ed68c02390320f23ae12ed7624cc6ec6  coreutils2.txt
0e78a6a4ed68c02390320f23ae12ed7624cc6ec6  coreutils3.txt
0e78a6a4ed68c02390320f23ae12ed7624cc6ec6  coreutils4.txt
0e78a6a4ed68c02390320f23ae12ed7624cc6ec6  coreutils.txt
2f48d1cc17fecbd475fb36ba9a81a730f2032980  python2.txt
0e78a6a4ed68c02390320f23ae12ed7624cc6ec6  python3.txt
0e78a6a4ed68c02390320f23ae12ed7624cc6ec6  python.txt

$ head python2.txt 
12747
15563
10354
21407
8244
34386
598
43901
1939
25690

$ head coreutils.txt 
0
1
10
100
1000
10000
10001
10002
10003
10004

data generated using this program: http://play.golang.org/p/chisTa2HNg

The reason python is faster is that you're comparing "sort an array with 25200000 elements" with "sort an array with 44565 elements". sort -u does exactly what sort | uniq does, assuming you don't have any special sorting comparison function. That's how it's defined. If you do the same thing with both programs, you get python slower, and if you want python to actually return the correct results, it's even slower than that.

EvanED

@Monarch said:

python might have some other reasons for yielding faster results

That's actually really interesting, and adding LC_ALL=C helps a great deal. sort -u takes 3.8 seconds and sort takes 5.8 seconds, which argues against my earlier hypothesis.

Python still wins though, by more than 3x.

(The numbers above are with cat demo | sort rather than sort demo because they are actually faster. sort demo takes 5.2 seconds, and sort -u demo takes 6.2 seconds. So fun?)

@Maciejasjmj said:

The question is, how else to grab unique values from a set. ... (which makes sense, but may be tough to implement) ...

Wut? I showed you how to do it in two lines including an import! And it's demonstrably faster by several times, even after Python instead of C. And that's with next to zero effort on my part. Would a balanced tree be better? What about an array-based heap? Sorted array? How must faster could you get it ? Who knows!

@ben_lubar said:

sort -u does exactly what sort | uniq does, assuming you don't have any special sorting comparison function. That's how it's defined.

That's... kind of my point. That's the WTF. One of the main reasons that the Unix approach of composing a bunch of small tools isn't always the right thing to do is because of performance. In this case, if I issue sort | uniq then sort can't know what the reader is going to do and has to sort. But sort -u has that knowledge. It can optimize! Why else put the -u switch into sort in the first place if I might as well just pipe it into uniq? Because it saves three characters?

And this ignores the fact that I would be mildly surprised if uniq was used (intentionally) without a sorted input (or other knowledge that if there are duplicates then they'll be adjacent), which means I'm not talking about an edge case but the common use case of uniquifying a file.

Maciejasjmj

@EvanED said:

And it's demonstrably faster by several times

Good. Now test it on a set with all unique lines and I bet my ass you'll see the performance go fully down the drain.

I'm too lazy to dig through sources for both Python and coreutils, but if the uniqueness test is done with nested loops, then yeah, no shit it goes close to O(n) when the data don't have many unique values. But in the worst case, it'll be O(n^2),while sort-then-test would take as much as the sort, which is O(nlogn) if they aren't retarded.

Also, "I did it in two lines" doesn't count if the first line imports every function you need.

dkf

@Maciejasjmj said:

I'm too lazy to dig through sources for both Python and coreutils, but if the uniqueness test is done with nested loops, then yeah, no shit it goes close to O(n) when the data don't have many unique values.

ITYM O(n²) there. And I'd be willing to bet that Python backs the set with a hash table of some kind; those have great performance in normal cases. Oh look, it does exactly that.

The only questions remaining are whether coreutils could use a better implementation than Python currently does, and if it did so, what would that algorithm be? I think we've demonstrated that it can certainly do better than it does at the moment (it's current performance is miserable) though not whether something like a tree-based set would be superior; I'd have to write code to work that out so fuck it. ;)

Maciejasjmj

@dkf said:

ITYM O(n2) there.

When all values are the same, you only need to check one value for each one and get an immediate match.

@dkf said:

And I'd be willing to bet that Python backs the set with a hash table of some kind; those have great performance in normal cases. Oh look, it does exactly that.

Yeah, so as I said it my first post, hash sieving is better.

dkf

@Maciejasjmj said:

When all values are the same, you only need to check one value for each one and get an immediate match.

OK, O(n) is the hard minimum; you've got to look at each value in the input at least once. ;)

pixelbeat

If there are many duplicate lines and the input is already mostly sorted you can take advantage of that (and the UNIX paradigm of splitting up tasks (to separate CPUs)) like:

(export LC_ALL=C; uniq | sort -u) < demo

flabdablet

@EvanED said:

Why else put the -u switch into sort in the first place if I might as well just pipe it into uniq?

Because sort -u uses the same equality rules for sorting as it does for uniqueness testing, and sort has more flexible field selection rules than uniq does.

When you're working with whole lines, sort -u is indeed equivalent to sort | uniq. But as far as I know there is no way to express the equivalent of e.g. sort -k1 -u as a sort|uniq pipeline.

Also, uniq has no numeric parsing rules: lines or fields within lines which are numerically but not textually equal will be treated as equal by sort -n but not by uniq.

pixelbeat

Also uniq requires the data to be sorted to produce uniq results across the whole set

flabdablet

Doesn't actually require it, but will certainly yield usually-useless results if it isn't.

Kian

How do you define "requires", since it's obviously not "condition that must be met to produce the expected result"? Would it have to refuse to run if the data isn't sorted? Format your hard drive? Call a drone strike on you?

flabdablet

Exiting with a nonzero return code would be enough to convince me. But uniq doesn't do that, because despite the name it does have valid uses beyond finding unique lines in sorted input, like cleaning up repetitious log file spam for example.

Kian

That just means it doesn't bother to check that the content is sorted first. That you can use a tool for something other than it was intended for (using uniq to remove repeated lines instead of producing a list of unique lines) is just a handy side-benefit. But someone could change the implementation and ruin your use case while conforming to the specification.

Essentially, you're using the tool outside of the specifications. Requirements are part of the specification, not the implementation.

IngenieurLogiciel

The side-benefit is actually the reverse. Per http://linux.die.net/man/1/uniq :

Filter adjacent matching lines from INPUT (or standard input), writing to OUTPUT (or standard output).

Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'.

So finding all unique lines is a special case when the input is sorted. The normal case is just removing duplicate adjacent lines.

PleegWat

A version I use occasionally is sort | uniq -c | sort -n. -c is a flag to uniq that counts the number of duplicates, and is not available from sort.

Captain

The best technique is probably a heap sort using a balanced tree as store, where you discard duplicates (which is a trivial change to the tree insertion function).

That is what I did when I had to do it in Haskell. Insert the list elements to a Set (which is a balanced tree, internally, and which automatically excludes duplicates). Extract the elements, already sorted.

Watson

First thing I thought of when I read "25.2 million lines, but only 44,565 that are unique". But it could mean a block of 44,565 different lines repeated 500+ times. Or almost worse, a concatenation of 500+ different permutations of those 44,565 lines.

EvanED

I'll have more to say on this topic that actually turned into a pretty interesting discussion (and probably retract some of what I said) when I get more of a spare moment than I had today, but I can hit on this point now:
@pixelbeat said:

(export LC_ALL=C; uniq | sort -u) < demo

That is actually almost guaranteed to do nothing in my case; there's almost no situation (and... maybe literally no situation) in which the process that produced the files I was operating on can generate two consecutive equal lines.

And indeed, if I arbitrarily pick a "similar" file to the one I was picking on before:

$ wc -l file
38065170 file
$ uniq badpass.eas | wc -l
38065170

Obviously there are many situations in which that's not the case, but I did think about that. (Actually the first thing I tried was, instead of reading everything into a Python set, I wrote a little script that would read in 10,000 lines (arbitrarily-chose number), sort them, then output them; then I could do a couple rounds of alternating that script with uniq before piping into the final sort. That was before I decided to just see how well reading into a set would work and expected it to perform pretty badly.)

pixelbeat

If the similarities are on a higher level than each line, then I suggest you tell sort to do more focused work initially by using the --buffer-size option. By default that is a % of RAM and thus very large, so you'd want to reduce to the same magnitude as the "structures" in your file. You may want to adjust --batch-size above the default value of 16 too

dkf

@pixelbeat said:

If the similarities are on a higher level than each line, then I suggest you tell sort to do more focused work initially by using the --buffer-size option. By default that is a % of RAM and thus very large, so you'd want to reduce to the same magnitude as the "structures" in your file. You may want to adjust --batch-size above the default value of 16 too

And we'd suggest using a better algorithm for sort -u. Unique-ing (and counting) sorts should be using something much more akin to a binsort so that the resource consumption scales with the number of unique keys, not the number of input elements, and the time can then go from O(NlogN) to O(N+KlogK). (The dominant term will depend on the input dataset, of course.)

tar

@EvanED said:

(Which means that the authors of sort went out of their way to make it less Unixy so that they could make it perform worse. Yay?)

Must... avoid.. dig at... systemd...