Premature optimization...



  • Just found this one. Someone needs to teach this guy Knuth's aphorism about optimizing.

    /* GetByteSize returns a byte size based on the record size and number of
    records. (The reason I do it this way is to avoid excessive multiplication for
    many memory_refs that have 1 byte record sizes).  */
    
    long GetByteSize (short rec_size, long num_recs)
    {
    	switch (rec_size) {
    		
    		case 1:
    			return	num_recs;
    			
    		case 2:
    			return	num_recs << 1;
    			
    		case 4:
    			return	num_recs << 2;
    
    		case 8:
    			return	num_recs << 3;
    
    		case 16:
    			return	num_recs << 4;
    		
    			break;
    	}
    	
    	return	num_recs * rec_size;
    }


  • The funny thing is that this is most likely not even an optimization.

    The case 1: might be(Don't think so, but even if it is, it's still stupid) but I am pretty sure that a compare+branch+shift is slower then a multiplication in most cases.

    Not to talk about the fact that if rec_size is not in(1,2,4,8,16) then the function will have to make 5 compares before the multiplication. (And why is the function not inlined. Calling it most likely cost much much more then a multiplication anyway :}

     




  • unsigned int NumTimes = 100000000;
    
    long GetByteSizeA(short rec_size, long num_recs)
    {
        switch (rec_size)
        {
    
            case 1:
                return num_recs;
    
            case 2:
                return num_recs << 1;
    
            case 4:
                return num_recs << 2;
    
            case 8:
                return num_recs << 3;
    
            case 16:
                return num_recs << 4;
        }
    
        return num_recs * rec_size;
    }
    
    long GetByteSizeB(short rec_size, long num_recs)
    {
        return num_recs * rec_size;
    }
    
    void TestA()
    {
    	printf("Testing A %u times\n", NumTimes);
    	for (unsigned int i = 0; i < NumTimes; i++)
    	{
    		int recSize = (rand() + 1) / 32 - 1; // limit to some predefined sane amount
    		int numRecs = rand();
    		GetByteSizeA(recSize, numRecs);
    	}
    	printf("Done testing.\n");
    }
    
    void TestB()
    {
    	printf("Testing B %u times\n", NumTimes);
    	for (unsigned int i = 0; i < NumTimes; i++)
    	{
    		int recSize = (rand() + 1) / 32 - 1; // limit to some predefined sane amount
    		int numRecs = rand();
    		GetByteSizeB(recSize, numRecs);
    	}
    	printf("Done testing.\n");
    }
    
    
    So I did a best out of 5 for each. Curiously...
    >timer fooA.exe
    

    fooA.exe
    With A
    Testing A 100000000 times
    Done testing.
    Results of execution:

    Exit code 0
    Time of execution 4.024
    

    >timer fooB.exe
    fooB.exe
    With B
    Testing B 100000000 times
    Done testing.
    Results of execution:

    Exit code 0
    Time of execution 4.023
    

     

    Paid by KLOC?



  • Oh, I see the problem.

    /* GetByteSize returns a byte size based on the record size and number of
    records. (The reason I do it this way is to avoid excessive multiplication for
    many memory_refs that have 1 byte record sizes). */

    long GetByteSize (short rec_size, long num_recs)
    {
    switch (rec_size) {

    	case 1:
    		return	num_recs;
    		
    	case 2:
    		return	num_recs &lt;&lt; 1;
    		
    	case 4:
    		return	num_recs &lt;&lt; 2;
    
    	case 8:
    		return	num_recs &lt;&lt; 3;
    
    	case 16:
    		return	num_recs &lt;&lt; 4;
    	
    	default:
    		return	num_recs * rec_size;
    }
    

    }

    That's what the 'default' statement is for!



  • But structs are normally aligned, so we can assume, that one of the switch-branches is called.
    And is recSize always 16 and numRecs i (the loop variable), we get:
    TestA:
    00:00:01.492
    00:00:01.311
    
    TestB:
    00:00:01.211
    00:00:01.131
    
    And with optimization (g++ -O3)
    TestA:
    00:00:00.240
    00:00:00.240
     00:00:00.230
    TestB:
     00:00:00.340
     00:00:00.220
     00:00:00.240
    
    Repeated ten times (additional outer loop)
    A:
      00:00:00.951
     00:00:00.961
    TestB:
     00:00:00.931
     00:00:00.911
    


  • @Carnildo said:

    Just found this one. Someone needs to teach this guy Knuth's aphorism about optimizing.

    Actually, I think the real lesson that individual needs to learn is that you should always performance test your optimizations.  Since that particular optimization is guaranteed to always result in equal or worse performance (and only yields equal performance if the processor's ancient and either (rec_size == 1) or (rec_size in (1, 2, 4, 8, 16) *and* the compiler uses a jump table instead of a cascading if to do the switch).)...

    Of course, it is possible that lack of understanding of data variance and statistics could lead some people to make false conclusions about the relative performance from time to time.  That shouldn't happen very often - and even in those cases, the individual should also learn the next lesson of optimization: don't spend more time optimizing code than the maximum amount of time your optimization could possibly save.



  • @Dudehole said:

    ...

    		int recSize = (rand() + 1) / 32 - 1; // limit to some predefined sane amount
    		int numRecs = rand();
    

    ...

    rand() is probably several orders of magnitude slower than the GetByteSize function, so the difference in the tested function can easily be shadowed by the complexity of the testing code.

    By the way, I tried looking at the assembly gcc generates and it does actually use a jump table for size 0 to 16. The jump table will be probably predicted correctly by the CPU and I assume gcc generated the condition as likely to go into the table, though I can't check that. Given that the function overhead is 6 instructions inside the function plus 3 at the call site, the extra code for the switch is still comparably small penalty. Of course the amount of code for all the cases might hurt in causing extra cache miss or extra page fault, which can easily cost two or three orders of magnitude more than the function itself.



  • Best of all - this means he does a function call to do a simple multiplication. Even if the switch would speed this up a lot (and it won't at all), it will still be slower (unless the compiler inlines it automatically, which is less probable with a longer function).

    If this was done as a macro or inline, it would be naive. This way, it's simply stupid.



  •  What I really can't understand is what neural process lead to him placing that single 'break' statement where he did. Did he mean to put a 'default:' statement before the break? What's also strange is that his comment doesn't at all justify the particular way he did this.

     If his logic is that 1 is common, why not:

    return (rec_size==1) ? num_recs : (num_recs * recv_size);

     Or, in GCC land:

    return __builtin_expect(rec_size==1, 1) ? num_recs : (num_recs*recv_size);

     Obviously, it's still a pointless optimization.

     



  • I haven't done any C coding in a long time (assuming that's what this is), but that makes sense to me. It also occured to me that I don't even know what a "record" is. I am assuming it's a struct, but it could just as easily be some blob of bytes given the context. Makes it kind of hard to define a problem domain. Anyway, you probably don't even need to test this function to see it's a bad idea.



  • @Bulb said:

    rand() is probably several orders of magnitude slower than the GetByteSize
    function, so the difference in the tested function can easily be shadowed by the complexity of the testing code.

    Not to mention, for it to be truly valid, you'd really want the same comparisons performed both times.  So one would be better off with something like:

    @What Dudehole surely meant to post, but didn't have enough far too much time on his hands said:


    #include <sys/time.h>
    #include <time.h>
    #include <stdio.h>
    #include <stdlib.h>

    #define NumTimes 10000000000 // I had to add a couple zeros because it ran too
    // fast after I fixed the code. Yeah, it's a lot of memory. Upgrade, dude, memory's cheap.

    short recSize[NumTimes];
    long numRecs[NumTiimes];

    long GetByteSizeA(short rec_size, long num_recs)
    {
    switch (rec_size)
    {
    case 1:
    return num_recs;

    case 2:
    	return num_recs &lt;&lt; 1;
    
    case 4:
    	return num_recs &lt;&lt; 2;
    
    case 8:
    	return num_recs &lt;&lt; 3;
    
    case 16:
    	return num_recs &lt;&lt; 4;
    }
    
    return num_recs * rec_size;
    

    }

    long GetByteSizeB(short rec_size, long num_recs)
    {
    return num_recs * rec_size;
    }

    long TestA()
    {
    long A = 0;

    printf("Testing A %u times\n", NumTimes);
    for (unsigned int i = 0; i &lt; NumTimes; i++)
    {
    	<b>//</b> I used to compute recSize and numRecs here, but that distorted the results.
    	<b>//</b> int recSize = (rand() + 1) / 32 - 1; // limit to some predefined sane amount
    	<b>//</b> int numRecs = rand();
    	<b>A +=</b> GetByteSizeA(recSize<b>[i]</b>, numRecs<b>[i]</b>);
    }
    printf("Done testing.\n");
    <b>return A;</b>
    

    }

    long TestB()
    {
    long B = 0;

    printf("Testing B %u times\n", NumTimes);
    for (unsigned int i = 0; i &lt; NumTimes; i++)
    {
    	<b>//</b> I used to compute recSize and numRecs here, but that distorted the results.
    	<b>//</b> int recSize = (rand() + 1) / 32 - 1; // limit to some predefined sane amount
    	<b>//</b> int numRecs = rand();
    	<b>B +=</b> GetByteSizeB(recSize<b>[i]</b>, numRecs<b>[i]</b>);
    }
    printf("Done testing.\n");
    <b>return B;</b>
    

    }

    int main() {
    struct timeval start, middle, end, res;
    int a, b;

    for (unsigned int i = 0; i &lt; NumTimes; i++)
    {
    	recSize[i] = (rand() + 1) / 32 - 1; // limit to some predefined sane amount
    	numRecs[i] = rand();
    }
    
    start = gettimeofday(&start, NULL);
    a = testA();
    middle = gettimeofday(&start, NULL);
    b = testB();
    end = gettimeofday(&start, NULL);
    
    timersub(&middle, &start, &res);
    printf("testA: %ld.%06lds\n", (long)res.tv_sec, (long)res.tv_usec);
    timersub(&end, &middle, &res);
    printf("testB: %ld.%06lds\n", (long)res.tv_sec, (long)res.tv_usec);
    
    // compare a and b so that we're clearly using all of the return values.
    if (a != b) printf("You messed something up.\n");
    return 0;
    

    }

    So I did a best out of 5 for each. Curiously...


    Testing A 100000000 times
    Done testing.
    Testing B 100000000 times
    Done testing.
    testA: 7.998337s
    testB: 5.020133s

    But the really bizarre shit is, inlining GetByteSizeB resulted in a best time of 5.034771s for B.

     

    Paid by KLOC?

    FTFY



  •  If the function was inlined and if passed a constant for the size (like sizeof(somestruct)), a good optimizing compiler would optimize out the compare and branches, figure out the size and take the path that does the correct << if applicable.

     Of course if num_recs was unsigned, a good optimizing compiler would do also recgonize if rec_size is a power of two and just do a shift anyway.

    (why it would only do it for an unsigned num_recs is left as an excersize to the reader)



  • @dwilliss said:

    Of course if num_recs was unsigned, a good optimizing compiler would do also recgonize if rec_size is a power of two and just do a shift anyway.

    Given the performance considerations of multiply versus shift on modern processors, I doubt it1.  That's kind of why this is a WTF.

    Just in case you missed it:

    @tgape, speaking for Dudehole said:


    So I did a best out of 5 for each. Curiously...

    Testing A 100000000 times
    Done testing.
    Testing B 100000000 times
    Done testing.
    testA: 7.998337s
    testB: <font size=3>5.020133s</font>

    1 It's possible there's a recent processor out there where it would do this - but it's not one of the ones I have.


Log in to reply