WTF: Intel CPUs

Dark_Shikari

Here's a story about some recent development that reveals quite an annoying WTF about Intel CPUs.

A few weeks ago, one of the other x264 developers was doing some timing tests on the 16x16 sum of absolute differences operation on his Core 2 Duo, and noticed something odd. The timings looked something like this (in clock cycles):

3500 48 48 48 240 48 48 48 240 48 48 48 240 [this pattern (48-48-48-240) repeats 64 times total] 3500 48 48 48 240...

Interesting, he thought. After some testing, it became clear that what was happening is that since the assembly code was forced to use unaligned loads due to the nature of the data, 1/4 of the time the data crossed cache boundaries, and 1/64 of cache boundaries were page boundaries, resulting in this pattern. Loading across a cache boundary was so expensive that the assembly code was nearly as slow as the C code! So we were curious: what about other CPUs? We tested it on a whole bucketload of other processors. The results were shocking: all Intel chips had the same problem, not just on SSE2 but on all loads back to MMX! Athlons had no penalty for loading across cache lines, and a mere 5% penalty for page lines! This was somewhat shocking to us--when we looked at amortized time, the Athlon actually considerably outperformed the Core 2 clock-for-clock due to the cache line misses dragging down the speed.

So, we figured, there must be a way to deal with this. The most obvious was the LDDQU SSE3 operation, which can load 128 bytes of data across a cache line without any penalty. We were only loading 16 bytes, but it would still be vastly faster in the case when the data crossed a cache line. A simple branch would swap between the LDDQU and regular load. We tested it on the Core 2... and there was no speed change. Weird, we thought. So we tested it on all other Intel CPUs with SSE3: the Core 1, and the Prescott... and it worked perfectly; LDDQU entirely eliminated the cacheline penalty. In other words, Intel allowed quite a nasty regression on the Core 2!

So we had to come up with ways to deal with this on the Core 2, and just as importantly, on non-SSE3 Intel CPUs. For the Core 2, a hack involving PALIGNR was used. For non-SSE3 chips, a genius hack was contrived involving 16 copies of the load statement, one for each possible misalignment; this was actually still drastically faster than loading across a cacheline.

The result was a near 50% reduction in clock cycles for the SAD operation on Intel CPUs. Interestingly enough, this could also apply to motion compensation, since it has the same issue of unaligned loading. If the hack was implemented in FFDshow, for example, it could speed up H.264 playback by a solid 10-20%.

The moral of the story: Never trust the processor documentation. Never assume that every assembly operation does what it should on every chip. And never trust averaged clock cycles; it could be averaging two separate sets of numbers together to yield a third number of clock cycles that the operation never actually takes (i.e. 48 48 48 averaged with 240 to get 100 or so).

Of course even though SAD is the most commonly used operation in the entire program, the Core 2s historically have trashed the Athlon 64s clock-for-clock, even without this change.

P.S. For those masochistic enough to want to look at the assembly code itself, check out [url="http://trac.videolan.org/x264/changeset/696"]the patch itself[/url] here.

misguided

so... would this 16-copy hack apply a performance increase in ffmpeg even to, say, a certain coppermine-based SSE1 mobile celeron?

the XBMC community would be really pleased to find that kind of optimization in 264 playback... there are some scattered reports of 720p30 264 being achievable on the xbox, and an extra 10% or 20% in processor headroom might really make that possible, especially over smb which is apparently currently not happening.

Dark_Shikari

@misguided said:

so... would this 16-copy hack apply a performance increase in ffmpeg even to, say, a certain coppermine-based SSE1 mobile celeron?
the XBMC community would be really pleased to find that kind of optimization in 264 playback... there are some scattered reports of 720p30 264 being achievable on the xbox, and an extra 10% or 20% in processor headroom might really make that possible, especially over smb which is apparently currently not happening.

I assume that the Xbox processor is probably using MMX for decoding, not SSE, given that MMX is likely faster on the Coppermine core.

It would be nice to be able to test timings on that thing to see if it has the same issue as Pentium 3s do (we didn't test any chips older than the Pentium 3s for the cache line issue). I would suspect it does, however, making this just as useful.

Squeezing true 720p out of the Xbox would be quite nice :)

One thing I'm quite curious about is how "known" this issue is in general; I've never heard of it before this, so I wonder how many proprietary programs have similar code.

misguided

@Dark Shikari said:

I assume that the Xbox processor is probably using MMX for decoding, not SSE, given that MMX is likely faster on the Coppermine core.
It would be nice to be able to test timings on that thing to see if it has the same issue as Pentium 3s do (we didn't test any chips older than the Pentium 3s for the cache line issue). I would suspect it does, however, making this just as useful.

Squeezing true 720p out of the Xbox would be quite nice :)
One thing I'm quite curious about is how "known" this issue is in general; I've never heard of it before this, so I wonder how many proprietary programs have similar code.

I thought coppermine was a p3?

well you won't get true 720p60 out of it, the processor just can't handle it... but if you half the framerate your real-time load obviously goes way down. and who needs 60fps anyway?

Dark_Shikari

@misguided said:

@Dark Shikari said:
I assume that the Xbox processor is probably using MMX for decoding, not SSE, given that MMX is likely faster on the Coppermine core.
It would be nice to be able to test timings on that thing to see if it has the same issue as Pentium 3s do (we didn't test any chips older than the Pentium 3s for the cache line issue). I would suspect it does, however, making this just as useful.

Squeezing true 720p out of the Xbox would be quite nice :)
One thing I'm quite curious about is how "known" this issue is in general; I've never heard of it before this, so I wonder how many proprietary programs have similar code.
I thought coppermine was a p3?

Ah yeah, those were the early P3 line.