500 KB saved is 500 KB earned



  • Now that's not really a wtf. That's actually a pretty clever way of reducing memory allocation calls, which, for a game, is very useful. Now reviewing some code from the other team (which I thought we might reuse for our project) I bumped into this:

    It's a code for a class height_map that reads the heights for a area 512x512 units. The data in the file is stored in 16 bit integers, then converted to floating point heights.

    bool height_map::read_heights(float* out_heights)
    {
        char* data = reinterpret_cast<CHAR*>(out_heights + AREA/2);
        _file.read(data, AREA*sizeof(short));
        short* iheights = reinterpret_cast<SHORT*>(data);

        for(int i=0; i<AREA; ++i)
        {
            *out_heights++ = float(*iheights++)*H_FACTOR;
        }

        return true;
    }

    For you non-C++ folks out there. short is a 16-bit integer, float is a 32 bit floating point variable. This code is actually using the 1MB array that it needs to fill with heights to store the unconverted data from the file. But since the short is 2-times smaller it uses the array from the middle up. That's clever memory usage :)



  • With all the memory and caching issues and stuff: Are you sure that regular, block-wise reading and filling by converting from the block wouldn't be faster? Have you profiled it?



  • Hint: your selected forum editor interprets <something> as an HTML tag, and promptly deletes it. This tends to make code, such as yours, invalid.



  •  The perfect example of premature optimisation.

    You can be sure that the simpler, less-space efficient version is much faster, much easier to code, much less prone to errors: and, after the compiler has had its way, probably smaller as well.



  • Actually I tried it. Turns out that on an Intel Core Duo @ 2.13 ghz the tricky version is slightly faster (by about 50 ms per 100 calls) or 0.5 ms per call. I suspect that on a single core cpu or on a cpu without streaming extensions the margin will be greater. Granted, currently this doesn't add much to the productivity but when it was written (2004) it probably wasn't that premature.

    Here is my test. Try it for yourselves:

    #include <iostream>
    #include <duration/Duration.h> //http://www.codeproject.com/KB/cpp/duration.aspx

    void read_data(short* outdata, size_t shorts_count)
    {
     srand(50);

     for(size_t i=0; i<shorts_count; ++i)
     {
      *outdata++ = rand();
     }
    }

    static const size_t HM_AREA = 512*512; //heightmap size
    static const int DENOM = 30000; //sample value

    void read_simple(float* out)
    {
     short* sdata = new short[HM_AREA];
     short* sptr = sdata;
     read_data(sdata, HM_AREA);

     for(size_t i=0; i<HM_AREA; ++i)
     {
      *out++ = float(*sptr++)/DENOM;
     }

     delete[] sdata;
    }
     
    void read_tricky(float* out)
    {
     short* sdata = reinterpret_cast<short*>(out + (HM_AREA>>1));
     read_data(sdata, HM_AREA);

     for(size_t i=0; i<HM_AREA; ++i)
     {
      *out++ = float(*sdata++)/DENOM;
     }
    }

    int main()

     using namespace std;

     float* hm1 = new float[HM_AREA];
     float* hm2 = new float[HM_AREA];

     CDuration timer;

     timer.Start();
     for(int i=0; i<100; ++i)
      read_simple(hm1);
     timer.Stop();

     cout << "Simple timing: " << timer.GetDuration()/1000 << " milliseconds" << endl;

     timer.Start();
     for(int i=0; i<100; ++i)
      read_tricky(hm2);
     timer.Stop();

     cout << "Tricky timing: " << timer.GetDuration()/1000 << " milliseconds" << endl;

     cout << "Sanity check: " << (memcmp(hm1, hm2, HM_AREA*sizeof(float)) == 0 ? "ok" : "wtf") << endl;

     delete[] hm1;
     delete[] hm2;

     return 0;
    }

     

    It outputs.


    Simple timing: 1278.71 milliseconds
    Tricky timing: 1221.91 milliseconds

    Tried it 5 or 6 times. The 50 ms difference is stable



  • Your test isn't quite comparing apples with apples. You're basically profiling the new operator. To make the test fairer, try using the stack instead of a costly heap allocation.

    Try replacing:

     short* sdata = new short[HM_AREA];

    with:

     short sdata[HM_AREA];

    And dont forget to remove:

    delete [ ] sdata;



  • that's unapplicable, because although in my sample HM_AREA could be determined compile-time, in the actual example it couldn't. And yes, the whole point of the operation is to save the 'new' call.



  •  I don't really see the point of using shorts. They much slower than ints. Sure, you do save a bit on memory, but with most systems out there being 32 bit, extra work needs to be done to ensure that the short behaves like a short.



  • The point of using shorts is to be able to read them with a single "read" call. If ints were used the whole read process would have been a lot more tedious and ultimately slower.



  • @wraith said:

    The point of using shorts is to be able to read them with a single "read" call.
    bool height_map::read_heights(float* out_heights)
    {
        char* data = reinterpret_cast(out_heights);
        _file.read(data, AREA*sizeof(DWORD));
        DWORD* iheights = reinterpret_cast(data);

        for(int i=0; i<AREA; ++i)
        {
            *out_heights++ = float(*iheights++)*H_FACTOR;
        }

        return true;
    }

    Am I missing something here? I see only one "read" call. :P (Unless for hysterical reasons the on-disk heightmap must be shorts.)



  • yep. For disk-space reasons the heightmap files are with shorts. Instead of 1 GB, they use 500 MB.



  •  @wraith said:

    @ActionMan said:
    Your test isn't quite comparing apples with apples. You're basically profiling the new operator. To make the test fairer, try using the stack instead of a costly heap allocation.

    Try replacing:

     short* sdata = new short[HM_AREA];

    with:

     short sdata[HM_AREA];

    And dont forget to remove:

    delete [ ] sdata;

    that's unapplicable, because although in my sample HM_AREA could be determined compile-time, in the actual example it couldn't. And yes, the whole point of the operation is to save the 'new' call.

    In an actual example, could an upper-bound for HM_AREAbe determined at compile time? If so, you could still use stack-based storage, or pre-allocate some heap-storage for temporary usage (both acceptable ways of removing the new call).

    I think it would be interesting to profile these 3 approaches (#1 your method, #2 temp buffer on stack, #3 pre-allocated temp buffer on heap) to see which performs better. Depending on the architecture there coud actually be different performance results between these 3 solutions...



  • Actually the preallocated buffer is the first that comes to mind. And it will probably be the fastest, because of read/write-only memory optimizations, but nobody likes to have a 2mb buffer standing there and being used only once in a while. :)

    With modern computers and memory sizes I guess thiss approach would be better, because if your game uses 1gb of memory these 2mb will indeed be negligible.

    Once again hardware makes us go against our sence and will as programmers. Oh woe is to us, poor souls, trapped between common sence and hyper-threading, and hardware caching.... and... um... I got a little carried away here.

    Bottom line is that I really liked the original approach. My coleagues did show a little lateral thinking. :)



  •  Wait a minute.

    How is the data aligned in modern computers?

    I remember that in Pascal you could align short types (8bit)  to 16 bit length for the processor's comfort.

     I'm not sure you actually win 500kB at all.


Log in to reply