The new memcpy!



  • Its well known that memcpy on 32-bit Linux in glibc isn't exactly the most optimized piece of code--while 64-bit uses all sorts of nice MMX and prefetch and such, the 32-bit version isn't as great.  And its likely not as efficient on Windows, either.  So in a program that is heavily performance-intensive where every bit counts, and there's lots of memcpying, its worth it to write one's own routine for the specific task.  One example in the program I'm working on is copying an image frame in memory--it uses a combination of prefetch and mmx registers to very quickly copy the frame by prefetching the next row ahead of time.

    So, we were discussing the fact that a struct of a decent size (a few hundred bytes) was memcpy'd in a very commonly called function, and might make up a decent percentage of execution time.  So I mention the idea of writing a simple assembly routine to copy the struct.  It'd be easy enough, since the struct consists primarily of a big array of bytes, and a couple other variables which can be easily copied with C code.

    The actual function looks something like this:

    CopyOfCABACStruct = ActualCABACStruct

    entropy_encode(CopyOfCABACStruct)

    bits += CopyOfCABACStruct.bits

    In other words, a struct is being copied, passed to a function, and then thrown away.  The copy is used so that we can run the function without affecting the actual struct. 

    So this chat ensued:

    [01:44:17]    <pengvado>    because 320 bytes stay in register throughout the CABAC encode.
    [01:44:28]    <Dark_Shikari>    Wait--you mean to back up the original?
    [01:44:37]    <Dark_Shikari>    so instead of copying it, and passing the copy
    [01:44:40]    <Dark_Shikari>    you back up the original?
    [01:44:43]    <pengvado>    yes
    [01:44:51]    <Dark_Shikari>    and pass the original

    Pengvado had proposed an "ingenious" idea: instead of memcpying the struct, using it in the function, and throwing it away... you copy most of the struct to the mmx/SSE registers on the processor, store the last few bytes in memory, and pass the original struct to the function!  Then, when the function finishes, you empty the vector registers back into memory.  This would definitely be faster, since it means you only have to copy a small portion of the bytes of the struct, and the rest you just back up in the registers.

    Of course, I think we both saw the WTF in this insanity...

    [01:45:07]    <Dark_Shikari>    and this relies on knowing GCC will never use an mm or xmm register during CABAC?
    [01:45:13]    <pengvado>    yes
    [01:45:17]    <Dark_Shikari>    what if someone decides to compile with Intel?
    [01:45:24]    <pengvado>    then it breaks

    Fortunately, being an experienced coder, pengvado wasn't entirely serious about this one.


  • Discourse touched me in a no-no place

    @Dark Shikari said:

    Of course, I think we both saw the WTF in this insanity...
    That you were doing premature optimisation without actually doing any profiling first?



  •  As always, security (aka working on copy of datas) is good, trusting is faster. If your copies are a problem, don't try to play with mmx for copying datas, just don't copy them. It's easy in C to mark a variable as const, the compiler will prevent attempts at modifying it, so wouldn't be a problem to ensure datas are not altered! You are heading for a maintenance nightmare ^^



  • If you really think the data copying is the problem, then you should instead look into the entropy_encode() function.

    It probably takes the bits from the struct, encodes them and then copies them back into the struct. So if you change the destination, you do not need the copy.

    But please - do a profiling before changing anything. The entropy_encode() function most likely takes up most of the execution time anyway.

     



  •  We've done more profiling than you can shake a stick at, don't worry about that.  I'm almost sick of staring at oprofile results at this point!  And obviously before I actually implemented such a function, I'd check out the exact number of clock cycles spent on memcpy with some quick "bench.h" magic.

    The entropy_encode() function has to write things back into the struct because the struct represents the state of the context-adaptive encoder, and every bit depends on past bits.  We already take a shortcut by avoiding writing certain things to the bitstream that don't cause such a dependency, and instead calculating how many bits they *would* have cost had we written them.

    In some modes this function can take up 20-30% of execution time, and the function (the one that calls entropy_encode) takes about 10k clocks--so shaving off 500 would be an acceptable speed boost.

    Clearly though, its a great idea to intentionally break the Intel C compiler ;)



  • @Dark Shikari said:

     We've done more profiling than you can shake a stick at, don't worry about that.
     

    Alright, I am curious now. Just how much profiling CAN you shake a stick at?



  • @Dark Shikari said:

    The entropy_encode() function has to write things back into the struct because the struct represents the state of the context-adaptive encoder, and every bit depends on past bits.

     

    Why does the state have to be stored in the same place all the time?  e.g. Ping-pong it between two buffers.

     



  • @MasterPlanSoftware said:

    Alright, I am curious now. Just how much profiling CAN you shake a stick at?
     

    About the same amount as how much wood a woodchuck would chuck if it could chuck wood. 



  • @Dark Shikari said:

    Its well known that memcpy on 32-bit Linux in glibc isn't exactly the most optimized piece of code
     

    Well, I mean, of course... that's obvious...

    Seriously?  "Well known?"



  • @Heron said:

    @MasterPlanSoftware said:

    Alright, I am curious now. Just how much profiling CAN you shake a stick at?
     

    About the same amount as how much wood a woodchuck would chuck if it could chuck Norris

    Answering that age old question 



  • I actually didn't know that the MMX/SSE extensions includes such vector registers. That is a great improvement to the general lack of registers on the x86. (I normally work with embedded DSPs)

    In the "great" context of hoping the compiler does not use the registers (I've actually seen thinks like this in working code), maybe it could be done the other way around - copying the struct into the registers and then encoding that copy.

    It would probably take a rewrite of the entropy_encode() function, but could possibly be faster due to the values already being in registers.  But would still be a huge WTF.

     

    I am a bit curious: Do I understand correctly that you just need the number of bits used, or do you need the encoded bits? If the latter then I assume the bits += is overloaded with some kind of memcpy too?




  • @olm said:

    I actually didn't know that the MMX/SSE extensions includes such vector registers. That is a great improvement to the general lack of registers on the x86. (I normally work with embedded DSPs)

     

    The vector "registers" are just appropriated FPU registers. IA-32 still sucks for register count.



  • @TehFreek said:

    @olm said:

    I actually didn't know that the MMX/SSE extensions includes such vector registers. That is a great improvement to the general lack of registers on the x86. (I normally work with embedded DSPs)

     

    The vector "registers" are just appropriated FPU registers. IA-32 still sucks for register count.

    I think that the MMX registers are overlayed with the FPU ones, however the XMM ones are not.



  • If this approach really gives a substantial performance boost, why not just use a preprocessor macro to restrict it to (certain versions of) GCC? I've certainly done this kind of thing before, though admittedly on proprietary projects where we knew there was essentially no chance of using a different compiler; I don't know if portability standards are different in the open-source world.



  • Most every memcpy I've seen in Windows (at least out of a half-recent compiler) uses the REP MOVS instruction.



  • @djork said:

    Seriously?  "Well known?"

     

    Well known by people who know it.  Less well known by the rest of us. 

    http://softwarecommunity.intel.com/wiki/linux/719.htm

     



  • @Heron said:

    @MasterPlanSoftware said:

    Alright, I am curious now. Just how much profiling CAN you shake a stick at?
     

    About the same amount as how much wood a woodchuck would chuck if it could chuck wood. 

     

    The answer is quite simple: The letter 3!



  • As a general response, if in the hypothetical situation that code works in all compilers, can you imagine someone trying to read this after you guys leave the company

    You can always create a modified object which only contains deltas that way you don't need to clone. But the question is: Do you REALLY need that optimization?



  • @MasterPlanSoftware said:

    @Dark Shikari said:

     We've done more profiling than you can shake a stick at, don't worry about that.
     

    Alright, I am curious now. Just how much profiling CAN you shake a stick at?

     

     

    Depends.  How much profiling can you fit in a stadium? 



  • @dlikhten said:

    You can always create a modified object which only contains deltas that way you don't need to clone. But the question is: Do you REALLY need that optimization?
    Unless I'm mistaken, the code is from x264, an open-source H264 encoder, and knowing several users of said encoder, they'll apprechiate every speedup, no matter how small.


Log in to reply