WTF: Intel CPUs



  • Here's a story about some recent development that reveals quite an annoying WTF about Intel CPUs. 

    A
    few weeks ago, one of the other x264 developers was doing some timing
    tests on the 16x16 sum of absolute differences operation on his Core 2
    Duo, and noticed something odd. The timings looked something like this
    (in clock cycles):

    3500 48 48 48 240 48 48 48 240 48 48 48 240 [this pattern (48-48-48-240) repeats 64 times total] 3500 48 48 48 240...

    Interesting,
    he thought. After some testing, it became clear that what was happening
    is that since the assembly code was forced to use unaligned loads due
    to the nature of the data, 1/4 of the time the data crossed cache
    boundaries, and 1/64 of cache boundaries were page boundaries,
    resulting in this pattern. Loading across a cache boundary was so
    expensive that the assembly code was nearly as slow as the C code! So
    we were curious: what about other CPUs? We tested it on a whole
    bucketload of other processors. The results were shocking: all Intel chips had the same problem, not just on SSE2 but on all loads back to MMX! Athlons had no penalty
    for loading across cache lines, and a mere 5% penalty for page lines!
    This was somewhat shocking to us--when we looked at amortized time, the
    Athlon actually considerably outperformed the Core 2 clock-for-clock
    due to the cache line misses dragging down the speed.

    So, we
    figured, there must be a way to deal with this. The most obvious was
    the LDDQU SSE3 operation, which can load 128 bytes of data across a
    cache line without any penalty. We were only loading 16 bytes, but it
    would still be vastly faster in the case when the data crossed a cache
    line. A simple branch would swap between the LDDQU and regular load. We
    tested it on the Core 2... and there was no speed change. Weird, we
    thought. So we tested it on all other Intel CPUs with SSE3: the Core 1,
    and the Prescott... and it worked perfectly; LDDQU entirely eliminated
    the cacheline penalty. In other words, Intel allowed quite a nasty
    regression on the Core 2!

    So we had to come up with ways to deal
    with this on the Core 2, and just as importantly, on non-SSE3 Intel
    CPUs. For the Core 2, a hack involving PALIGNR was used. For non-SSE3
    chips, a genius hack was contrived involving 16 copies of the load
    statement, one for each possible misalignment; this was actually still
    drastically faster than loading across a cacheline.

    The result was a near 50% reduction in clock cycles
    for the SAD operation on Intel CPUs. Interestingly enough, this could
    also apply to motion compensation, since it has the same issue of
    unaligned loading. If the hack was implemented in FFDshow, for example,
    it could speed up H.264 playback by a solid 10-20%.

    The moral of
    the story: Never trust the processor documentation. Never assume that
    every assembly operation does what it should on every chip. And never
    trust averaged clock cycles; it could be averaging two separate sets of
    numbers together to yield a third number of clock cycles that the
    operation never actually takes (i.e. 48 48 48 averaged with 240 to get
    100 or so).

     Of course even though SAD is the most commonly used operation in the entire program, the Core 2s historically have trashed the Athlon 64s clock-for-clock, even without this change.

     P.S. For those masochistic enough to want to look at the assembly code itself, check out [url="http://trac.videolan.org/x264/changeset/696"]the patch itself[/url] here.
     



  • so... would this 16-copy hack apply a performance increase in ffmpeg even to, say, a certain coppermine-based SSE1 mobile celeron?

    the XBMC community would be really pleased to find that kind of optimization in 264 playback...  there are some scattered reports of 720p30 264 being achievable on the xbox, and an extra 10% or 20% in processor headroom might really make that possible, especially over smb which is apparently currently not happening.



  • @misguided said:

    so... would this 16-copy hack apply a performance increase in ffmpeg even to, say, a certain coppermine-based SSE1 mobile celeron?

    the XBMC community would be really pleased to find that kind of optimization in 264 playback...  there are some scattered reports of 720p30 264 being achievable on the xbox, and an extra 10% or 20% in processor headroom might really make that possible, especially over smb which is apparently currently not happening.

    I assume that the Xbox processor is probably using MMX for decoding, not SSE, given that MMX is likely faster on the Coppermine core.

     It would be nice to be able to test timings on that thing to see if it has the same issue as Pentium 3s do (we didn't test any chips older than the Pentium 3s for the cache line issue).  I would suspect it does, however, making this just as useful.

     Squeezing true 720p out of the Xbox would be quite nice 🙂

     One thing I'm quite curious about is how "known" this issue is in general; I've never heard of it before this, so I wonder how many proprietary programs have similar code.
     



  • @Dark Shikari said:

    I assume that the Xbox processor is probably using MMX for decoding, not SSE, given that MMX is likely faster on the Coppermine core.

     It would be nice to be able to test timings on that thing to see if it has the same issue as Pentium 3s do (we didn't test any chips older than the Pentium 3s for the cache line issue).  I would suspect it does, however, making this just as useful.

     Squeezing true 720p out of the Xbox would be quite nice 🙂

     One thing I'm quite curious about is how "known" this issue is in general; I've never heard of it before this, so I wonder how many proprietary programs have similar code.

    I thought coppermine was a p3?

    well you won't get true 720p60 out of it, the processor just can't handle it...  but if you half the framerate your real-time load obviously goes way down.  and who needs 60fps anyway?



  • @misguided said:

    @Dark Shikari said:
    I assume that the Xbox processor is probably using MMX for decoding, not SSE, given that MMX is likely faster on the Coppermine core.

     It would be nice to be able to test timings on that thing to see if it has the same issue as Pentium 3s do (we didn't test any chips older than the Pentium 3s for the cache line issue).  I would suspect it does, however, making this just as useful.

     Squeezing true 720p out of the Xbox would be quite nice 🙂

     One thing I'm quite curious about is how "known" this issue is in general; I've never heard of it before this, so I wonder how many proprietary programs have similar code.

    I thought coppermine was a p3?

    Ah yeah, those were the early P3 line.


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.