Premature optimization...

Carnildo

Just found this one. Someone needs to teach this guy Knuth's aphorism about optimizing.

/* GetByteSize returns a byte size based on the record size and number of
records. (The reason I do it this way is to avoid excessive multiplication for
many memory_refs that have 1 byte record sizes).  */

long GetByteSize (short rec_size, long num_recs)
{
	switch (rec_size) {
		
		case 1:
			return	num_recs;
			
		case 2:
			return	num_recs << 1;
			
		case 4:
			return	num_recs << 2;

		case 8:
			return	num_recs << 3;

		case 16:
			return	num_recs << 4;
		
			break;
	}
	
	return	num_recs * rec_size;
}

tiller

The funny thing is that this is most likely not even an optimization.

The case 1: might be(Don't think so, but even if it is, it's still stupid) but I am pretty sure that a compare+branch+shift is slower then a multiplication in most cases.

Not to talk about the fact that if rec_size is not in(1,2,4,8,16) then the function will have to make 5 compares before the multiplication. (And why is the function not inlined. Calling it most likely cost much much more then a multiplication anyway :}

Dudehole

unsigned int NumTimes = 100000000;

long GetByteSizeA(short rec_size, long num_recs)
{
    switch (rec_size)
    {

        case 1:
            return num_recs;

        case 2:
            return num_recs << 1;

        case 4:
            return num_recs << 2;

        case 8:
            return num_recs << 3;

        case 16:
            return num_recs << 4;
    }

    return num_recs * rec_size;
}

long GetByteSizeB(short rec_size, long num_recs)
{
    return num_recs * rec_size;
}

void TestA()
{
	printf("Testing A %u times\n", NumTimes);
	for (unsigned int i = 0; i < NumTimes; i++)
	{
		int recSize = (rand() + 1) / 32 - 1; // limit to some predefined sane amount
		int numRecs = rand();
		GetByteSizeA(recSize, numRecs);
	}
	printf("Done testing.\n");
}

void TestB()
{
	printf("Testing B %u times\n", NumTimes);
	for (unsigned int i = 0; i < NumTimes; i++)
	{
		int recSize = (rand() + 1) / 32 - 1; // limit to some predefined sane amount
		int numRecs = rand();
		GetByteSizeB(recSize, numRecs);
	}
	printf("Done testing.\n");
}

So I did a best out of 5 for each. Curiously...

>timer fooA.exe

fooA.exe
With A
Testing A 100000000 times
Done testing.
Results of execution:

Exit code 0
Time of execution 4.024

>timer fooB.exe
fooB.exe
With B
Testing B 100000000 times
Done testing.
Results of execution:

Exit code 0
Time of execution 4.023

Paid by KLOC?

lolwtf

Oh, I see the problem.

/* GetByteSize returns a byte size based on the record size and number of

records. (The reason I do it this way is to avoid excessive multiplication for

many memory_refs that have 1 byte record sizes).  */
long GetByteSize (short rec_size, long num_recs)

{

switch (rec_size) {
	case 1:
		return	num_recs;
		
	case 2:
		return	num_recs &lt;&lt; 1;
		
	case 4:
		return	num_recs &lt;&lt; 2;

	case 8:
		return	num_recs &lt;&lt; 3;

	case 16:
		return	num_recs &lt;&lt; 4;
	
	default:
		return	num_recs * rec_size;
}

}

That's what the 'default' statement is for!

bugmenot11

But structs are normally aligned, so we can assume, that one of the switch-branches is called.
And is recSize always 16 and numRecs i (the loop variable), we get:
TestA:
00:00:01.492
00:00:01.311

TestB:
00:00:01.211
00:00:01.131

And with optimization (g++ -O3)
TestA:
00:00:00.240
00:00:00.240
 00:00:00.230
TestB:
 00:00:00.340
 00:00:00.220
 00:00:00.240

Repeated ten times (additional outer loop)
A:
  00:00:00.951
 00:00:00.961
TestB:
 00:00:00.931
 00:00:00.911

tgape

@Carnildo said:

Just found this one. Someone needs to teach this guy Knuth's aphorism about optimizing.

Actually, I think the real lesson that individual needs to learn is that you should always performance test your optimizations. Since that particular optimization is guaranteed to always result in equal or worse performance (and only yields equal performance if the processor's ancient and either (rec_size == 1) or (rec_size in (1, 2, 4, 8, 16) *and* the compiler uses a jump table instead of a cascading if to do the switch).)...

Of course, it is possible that lack of understanding of data variance and statistics could lead some people to make false conclusions about the relative performance from time to time. That shouldn't happen very often - and even in those cases, the individual should also learn the next lesson of optimization: don't spend more time optimizing code than the maximum amount of time your optimization could possibly save.

Bulb

@Dudehole said:

...

		int recSize = (rand() + 1) / 32 - 1; // limit to some predefined sane amount
		int numRecs = rand();

...

rand() is probably several orders of magnitude slower than the GetByteSize function, so the difference in the tested function can easily be shadowed by the complexity of the testing code.

By the way, I tried looking at the assembly gcc generates and it does actually use a jump table for size 0 to 16. The jump table will be probably predicted correctly by the CPU and I assume gcc generated the condition as likely to go into the table, though I can't check that. Given that the function overhead is 6 instructions inside the function plus 3 at the call site, the extra code for the switch is still comparably small penalty. Of course the amount of code for all the cases might hurt in causing extra cache miss or extra page fault, which can easily cost two or three orders of magnitude more than the function itself.

akozakie

Best of all - this means he does a function call to do a simple multiplication. Even if the switch would speed this up a lot (and it won't at all), it will still be slower (unless the compiler inlines it automatically, which is less probable with a longer function).

If this was done as a macro or inline, it would be naive. This way, it's simply stupid.

JoelKatz1

What I really can't understand is what neural process lead to him placing that single 'break' statement where he did. Did he mean to put a 'default:' statement before the break? What's also strange is that his comment doesn't at all justify the particular way he did this.

If his logic is that 1 is common, why not:

return (rec_size==1) ? num_recs : (num_recs * recv_size);

Or, in GCC land:

return __builtin_expect(rec_size==1, 1) ? num_recs : (num_recs*recv_size);

Obviously, it's still a pointless optimization.

Dudehole

I haven't done any C coding in a long time (assuming that's what this is), but that makes sense to me. It also occured to me that I don't even know what a "record" is. I am assuming it's a struct, but it could just as easily be some blob of bytes given the context. Makes it kind of hard to define a problem domain. Anyway, you probably don't even need to test this function to see it's a bad idea.

tgape

@Bulb said:

rand() is probably several orders of magnitude slower than the GetByteSize
function, so the difference in the tested function can easily be shadowed by the complexity of the testing code.

Not to mention, for it to be truly valid, you'd really want the same comparisons performed both times. So one would be better off with something like:

@What Dudehole surely meant to post, but didn't have enough far too much time on his hands said:



#include <sys/time.h>

#include <time.h>

#include <stdio.h>

#include <stdlib.h>
#define NumTimes	10000000000 // I had to add a couple zeros because it ran too

// fast after I fixed the code. Yeah, it's a lot of memory.  Upgrade, dude, memory's cheap.
short recSize[NumTimes];

long numRecs[NumTiimes];
long GetByteSizeA(short rec_size, long num_recs)

{

switch (rec_size)

{

case 1:

return num_recs;
case 2:
	return num_recs &lt;&lt; 1;

case 4:
	return num_recs &lt;&lt; 2;

case 8:
	return num_recs &lt;&lt; 3;

case 16:
	return num_recs &lt;&lt; 4;
}

return num_recs * rec_size;

}
long GetByteSizeB(short rec_size, long num_recs)

{

return num_recs * rec_size;

}
long TestA()

{

long A = 0;
printf("Testing A %u times\n", NumTimes);
for (unsigned int i = 0; i &lt; NumTimes; i++)
{
	<b>//</b> I used to compute recSize and numRecs here, but that distorted the results.
	<b>//</b> int recSize = (rand() + 1) / 32 - 1; // limit to some predefined sane amount
	<b>//</b> int numRecs = rand();
	<b>A +=</b> GetByteSizeA(recSize<b>[i]</b>, numRecs<b>[i]</b>);
}
printf("Done testing.\n");
<b>return A;</b>

}
long TestB()

{

long B = 0;
printf("Testing B %u times\n", NumTimes);
for (unsigned int i = 0; i &lt; NumTimes; i++)
{
	<b>//</b> I used to compute recSize and numRecs here, but that distorted the results.
	<b>//</b> int recSize = (rand() + 1) / 32 - 1; // limit to some predefined sane amount
	<b>//</b> int numRecs = rand();
	<b>B +=</b> GetByteSizeB(recSize<b>[i]</b>, numRecs<b>[i]</b>);
}
printf("Done testing.\n");
<b>return B;</b>

}
int main() {

struct timeval start, middle, end, res;

int a, b;

for (unsigned int i = 0; i &lt; NumTimes; i++)
{
	recSize[i] = (rand() + 1) / 32 - 1; // limit to some predefined sane amount
	numRecs[i] = rand();
}

start = gettimeofday(&start, NULL);
a = testA();
middle = gettimeofday(&start, NULL);
b = testB();
end = gettimeofday(&start, NULL);

timersub(&middle, &start, &res);
printf("testA: %ld.%06lds\n", (long)res.tv_sec, (long)res.tv_usec);
timersub(&end, &middle, &res);
printf("testB: %ld.%06lds\n", (long)res.tv_sec, (long)res.tv_usec);

// compare a and b so that we're clearly using all of the return values.
if (a != b) printf("You messed something up.\n");
return 0;

}

So I did a best out of 5 for each. Curiously...



Testing A 100000000 times

Done testing.

Testing B 100000000 times

Done testing.

testA: 7.998337s

testB: 5.020133s

But the really bizarre shit is, inlining GetByteSizeB resulted in a best time of 5.034771s for B.

Paid by KLOC?

FTFY

dwilliss

If the function was inlined and if passed a constant for the size (like sizeof(somestruct)), a good optimizing compiler would optimize out the compare and branches, figure out the size and take the path that does the correct << if applicable.

Of course if num_recs was unsigned, a good optimizing compiler would do also recgonize if rec_size is a power of two and just do a shift anyway.

(why it would only do it for an unsigned num_recs is left as an excersize to the reader)

tgape

@dwilliss said:

Of course if num_recs was unsigned, a good optimizing compiler would do also recgonize if rec_size is a power of two and just do a shift anyway.

Given the performance considerations of multiply versus shift on modern processors, I doubt it¹. That's kind of why this is a WTF.

Just in case you missed it:

@tgape, speaking for Dudehole said:



So I did a best out of 5 for each. Curiously...
Testing A 100000000 times

Done testing.

Testing B 100000000 times

Done testing.

testA: 7.998337s

testB: <font size=3>5.020133s</font>

¹ It's possible there's a recent processor out there where it would do this - but it's not one of the ones I have.