#if FAST_MODE
-
From the Tandberg/ Nokia/ Ericsson H.265 video compression proposal (A119 here). Bonus points: their powerpoint says "Clean and fast software written from scratch using C".
/*-----------------------------------------------------------------------------------------
Function: SAD_64x64
Purpose: Calculate SAD for a 64x64 block
Input: *a - Pointer to first block
*b - Pointer to second block
stridea - stride of first block
strideb - stride of second block
Return: sad - SAD value
Parameters: None
-------------------------------------------------------------------------------------------*/
static inline unsigned int SAD_64x64(const unsigned char *a,
const unsigned char *b,
const int stridea,
const int strideb)
{
int i,j;
unsigned int sad = 0;
#if FAST_MODE
for (i=0;i<64;i++)
{
unsigned char b_[64];
for (j=0;j<64;j++)
b_[j] = b[i*strideb+j];
for (j=0;j<64;j++)
sad += abs(a[stridea*i+j] - b_[j]);
}
#else
for (i=0;i<64;i++)
for (j=0;j<64;j++)
sad += abs(a[stridea*i+j] - b[i*strideb+j]);
#endif
return(sad);
}This "FAST_MODE" pattern is repeated for about 10 different copy-pastes of the same function with different input sizes. Sometimes they use memcpy instead of a for loop. I still am not quite sure what the author of this was trying to accomplish.
-
@Dark Shikari said:
I still am not quite sure what the author of this was trying to accomplish.
Don't be bad to the author. He has to spend his time at work writing very sad sad code with lots of sad values :( Maybe he's just depressed?
@Dark Shikari said:
Return: sad - SAD value
-
@Dark Shikari said:
Perhaps the FAST_MODE stuff is more efficient on some processors, maybe that's why it is have two separate codes like that.This "FAST_MODE" pattern is repeated for about 10 different copy-pastes of the same function with different input sizes. Sometimes they use memcpy instead of a for loop. I still am not quite sure what the author of this was trying to accomplish.
-
I was going to fix that post to actually have monospaced fonts, but was scared away by CS' horrible attempts at producing HTML.
Here's a C/P which will hopefully look good./*----------------------------------------------------------------------------------------- Function: SAD_64x64 Purpose: Calculate SAD for a 64x64 block Input: *a - Pointer to first block *b - Pointer to second block stridea - stride of first block strideb - stride of second block Return: sad - SAD value Parameters: None -------------------------------------------------------------------------------------------*/ static inline unsigned int SAD_64x64(const unsigned char *a, const unsigned char *b, const int stridea, const int strideb) { int i,j; unsigned int sad = 0; #if FAST_MODE for (i=0;i<64;i++) { unsigned char b_[64]; for (j=0;j<64;j++) b_[j] = b[i*strideb+j]; for (j=0;j<64;j++) sad += abs(a[stridea*i+j] - b_[j]); } #else for (i=0;i<64;i++) for (j=0;j<64;j++) sad += abs(a[stridea*i+j] - b[i*strideb+j]); #endif return(sad); }
-
@zzo38 said:
@Dark Shikari said:
Perhaps the FAST_MODE stuff is more efficient on some processors, maybe that's why it is have two separate codes like that.This "FAST_MODE" pattern is repeated for about 10 different copy-pastes of the same function with different input sizes. Sometimes they use memcpy instead of a for loop. I still am not quite sure what the author of this was trying to accomplish.
Yeah, it could be that the compiler automatically vectorizes the FAST_MODE loop. A comment line explaining this would have nice though..
-
Probably the programmer found out that precomputing the actual [i*stride+j] offsets resulted in better performance (that's what is stored in the b_ array), due to cache locality instead of computing it in the middle of another expression and yet kept the "normal mode" for reference or as a fallback in case he missed something or the trick didn't work. Probably it's just premature optimization/little faith in the compiler.
-
Yes, its either data locality or (possibly) register spilling.
I would like to know the target compiler and system... Of course, the author did miss a trick -- once a stride is used, it is likely to be used a lot. Stride pre-compute can lift the multiplies. Can we see some more code, please?
TRWTF is a simple lack of detail commenting here, but, given the poster has decided it is a WTF, maybe that is simply elsewhere.
-
TRWTF is the comments. The code does a strict superset of the calculations as the non-FAST MODE; it does all the same loads, so it cannot possibly be "cache locality". Even x86 has enough registers to make that trick worthless, and this code isn't targeted at an 8-bit 8086. The only possible benefit is SIMD -- and if they wanted to use that, they could just write two lines of intrinsics.
-
The "non-fast" loop may access 3 separate regions. If there are two cache lines available, AND the stride gives a hit within, recoding it gives the minimal cache miss.
Of course the accesses are the same -- its the temporality that counts.
-
What's wrong with pointer math?
He could eliminate two mults and two adds per inner loop...