On Uninitialized Pointers

phaedrus

I am not the hero of this tale. I am merely the only actor--my own enemy and my own victim. Which is good, as I would not have wanted to inflict this upon anyone.

My company makes an embedded device, and we need it to interface with an embedded device made by multiple other companies. I'm sorry about being vague, but it's unannounced and in its early stages. Enough so that there is only a draft spec for the protocol we're all using (it runs over USB). As you can guess, the code below is heavily anonymized.

So, I got asked to write an emulator for our developers to test against while developing the software. It's been a fun project. It runs on our Linux workstations and it's written in C. Anyway, on Monday, I spent my afternoon debugging an amazing land-mine of a segfault. I'd planted it myself via copy/pasting some code from another function within the program.

The interesting code looked like this:

void send_crypto_key(void) {
    uint8_t *buf;
    uint8_t *b;
    pkt_t *pkt; // internal representation of the packet
    int i;
    msg_t msg;
    key_msg_t key;

    // [ ... snip ... ]
    // The packet to be sent gets setup in the msg and key
    // locals here. They are packed structs representing
    // the bitfields in the packet for the wire protocol.

    // copy msg and key into buf
    buf = malloc(sizeof(msg_t) + sizeof(key_msg_t));
    assert(buf);
    b = buf;

    memcpy(b, &msg, sizeof(msg_t));
    b += sizeof(msg_t);
    memcpy(b, &key, sizeof(key_msg_t));

    // now that we have buf with the data for the wire, we
    // store it in the pkt_t *pkt struct, which then gets
    // shipped over to the 'dispatcher' thread which handles
    // communication with the hardware:

    pkt->tag = TAG_KEY_MESSAGE;
    pkt->payload = buf;
    pkt->len = sizeof(msg_t) + sizeof(key_msg_t);

    // Enqueue the pkt_t to the dispatcher queue so it gets
    // sent over the wire:

    pkt_enqueue(pkt);
    // don't free() buf or pkt, the dispatcher will do that.
}

So, my program would get to this function and segfault. My normal reaction to a segfault is to fire up gdb and get a back trace. I always compile with debugging symbols on, so I was shocked when I saw this:

$$ gdb ./emulator
(gdb) run
...
Program received signal SIGSEGV, Segmentation fault.
0x00000084 in ?? ()
(gdb) bt
#0 0x00000084 in ?? ()
#1 0x00000108 in ?? ()
#2 0x000011b0 in ?? ()
#3 0x0000a000 in ?? ()
(gdb)

I then reran the program (actually, I did this about eleventy billion times, I was so confused). I set a breakpoint and reran the function and stepped into the segfault. It happened right at the end of the send_crypto_key() function, every time. And the stack on that thread would be so completely fubar that gdb couldn't backtrace it for me.

So, asuffield, have you figured it out yet?

And now, for the spoiler (You'll want to be familiar with how stack frames are setup, see [url=http://en.wikipedia.org/wiki/Call_stack]the wikipedia article on Call Stacks[/url]):

Did you see what I forgot? Yep. I didn't copy these two lines of code:

pkt = malloc(sizeof(pkt_t));
assert(pkt);

There are two other problems with this chunk of code which set me up for this monster of a land mine. First and foremost, notice the local variable declarations. I didn't initialize my pointers to NULL. If I had done that, the segfault would have occured on this line:

pkt->tag = TAG_KEY_MESSAGE;

And I'd have gotten a backtrace with line numbers and have been done with this in five minutes.

The key is that my code is littered with these memcpy() calls:

memcpy(b, &msg, sizeof(msg_t));

Where I'm taking a pointer to a local variable and using it as a parameter to a function call. Local variables and function parameters both get stored on the stack, thus, my stack page was littered with pointers back into the stack. My luck was such that every time I ran the program, pkt didn't hold garbage, but a pointer aimed right at the activation record for send_crypto_key(). The address 0x00000084 in the backtrace happens to be exactly equal to sizeof(msg_t) + sizeof(key_msg_t). And it was that value every time I ran the program. pkt->len dereferenced to my return address.

[url=http://xkcd.com/371/]I leave you with today's XKCD. I'm sorry computer. I fixed it as fast as I could.[/url]

PJH

@phaedrus said:

[...]
    buf = malloc(sizeof(msg_t) + sizeof(key_msg_t));
    assert(buf);
[...]
    pkt = malloc(sizeof(pkt_t));
    assert(pkt);

I see two more WTF's. And they'll only happen in non-debug code. I can only hope you don't use this idiom in production code.

asuffield

@phaedrus said:

So, asuffield, have you figured it out yet?

Saw it when I eyeballed the code after reading that it segfaulted. But that kind of output from gdb always means you smashed the stack (the key is in the hex numbers that you're used to ignoring: the first one is not anywhere near where the stack goes - learn the address ranges for your system). The fastest way to turn that into a line number is usually to run the program under valgrind, which will spit it right out for most of them (if it's available on your platform). The second fastest is to recognise the value on the first line of the trace (sometimes it's obvious). Failing that, what you do is to scatter fprintf(stderr, "1\n"), fprintf(stderr, "2\n"), (etc) calls through the code until you get it down to one function, then single-step it - it's most likely going to crash on either a function call or return somewhere, once you find it the rest is simple. With a little practice you can do a binary-chop search, and trace even the most extremely outsized codebases by moving one printf call and iterating the build/test sequence log2 n times for n lines of code (usually less than 10 iterations).

I have lost count of the number of times somebody has brought me one of these. Everybody tells students how to read a stack trace, nobody tells them what it means when the stack trace fails. Yours is a relatively common variation (in x86 C code, the addresses of stack variables are what you most frequently find on the stack - integer arithmetic tends to stay in registers, x86 floating point has its own stack, and people rarely put anything else in stack variables).

asuffield

@PJH said:

@phaedrus said:
[...]
    buf = malloc(sizeof(msg_t) + sizeof(key_msg_t));
    assert(buf);
[...]
    pkt = malloc(sizeof(pkt_t));
    assert(pkt);
I see two more WTF's. And they'll only happen in non-debug code.

Who actually bothers using -DNDEBUG any more?

Grimoire

@PJH said:

@phaedrus said:
[...]
    buf = malloc(sizeof(msg_t) + sizeof(key_msg_t));
    assert(buf);
[...]
    pkt = malloc(sizeof(pkt_t));
    assert(pkt);
I see two more WTF's. And they'll only happen in non-debug code. I can only hope you don't use this idiom in production code.

<sarcasm>But if you didn't run out of memory in Debug mode, why would you run out of memory in Release?</sarcasm>

dtech

If I were you I would make amends to your computer.

[url=http://www.xkcd.com/371/][img]http://imgs.xkcd.com/comics/compiler_complaint.png[/img][/url]

(don't forget to read the title text on the comic's images)

PJH

@asuffield said:

@PJH said:
@phaedrus said:
[...]
    buf = malloc(sizeof(msg_t) + sizeof(key_msg_t));
    assert(buf);
[...]
    pkt = malloc(sizeof(pkt_t));
    assert(pkt);
I see two more WTF's. And they'll only happen in non-debug code.
Who actually bothers using -DNDEBUG any more?

Assuming there was no sarcasm intended in that comment, I do.

512MB for the OS and applications on the embedded devices I'm dealing with soon fills up.

Regardless, gracefully dealing with memory failures is, I would have thought, preferable to "I give up," and simply quitting in most situations.

Lingerance

@dtech said:

If I were you I would make amends to your computer.
snip
(don't forget to read the title text on the comic's images)

Why did you bother posting the image when the link is already in the OP (presuming OP == "opening post").

GalacticCowboy

Or "original"... same difference.

belgariontheking

make pretend

@Lingerance said:

(presuming OP == "opening post").

I always thought it was "Original Poster" like "Original Gangster"

phaedrus

@PJH said:

@phaedrus said:
[...]
    buf = malloc(sizeof(msg_t) + sizeof(key_msg_t));
    assert(buf);
[...]
    pkt = malloc(sizeof(pkt_t));
    assert(pkt);
I see two more WTF's. And they'll only happen in non-debug code. I can only hope you don't use this idiom in production code.

Yes. I am being lazy. No, this is not production code.

dtech

@Lingerance said:

@dtech said:
If I were you I would make amends to your computer.
snip
(don't forget to read the title text on the comic's images)
Why did you bother posting the image when the link is already in the OP (presuming OP == "opening post").

I saw a minute ago. Read right over it in the OP. Just read the start of the message and immediately thought of today's xkcd...

Sorry :(

phaedrus

@PJH said:

@asuffield said:
@PJH said:
@phaedrus said:
[...]
    buf = malloc(sizeof(msg_t) + sizeof(key_msg_t));
    assert(buf);
[...]
    pkt = malloc(sizeof(pkt_t));
    assert(pkt);
I see two more WTF's. And they'll only happen in non-debug code.
Who actually bothers using -DNDEBUG any more?
Assuming there was no sarcasm intended in that comment, I do.
512MB for the OS and applications on the embedded devices I'm dealing with soon fills up.
Regardless, gracefully dealing with memory failures is, I would have thought, preferable to "I give up," and simply quitting in most situations.

I want to know what you do about error returns on malloc(). I've never been able to come up with anything besides fail and die. The embedded systems I work with (which have 32 to 256MiB of RAM) reboot when they run out of memory. That design decision had been in place for a long time by the time I started. The Linux kernel's best is to randomly kill processes, which is hardly graceful.

phaedrus

@dtech said:

@Lingerance said:
@dtech said:
If I were you I would make amends to your computer.
snip
(don't forget to read the title text on the comic's images)
Why did you bother posting the image when the link is already in the OP (presuming OP == "opening post").
I saw a minute ago. Read right over it in the OP. Just read the start of the message and immediately thought of today's xkcd...
Sorry :(

'saright. Great minds think alike, although, I've posted a nasty gaff, so maybe you don't want me comparing you to me.

LoztInSpace

It's been a long time since I did C, but why did you not get an error along the lines of "uninitialised variable being used" (pkt)?

asuffield

@PJH said:

@asuffield said:
Who actually bothers using -DNDEBUG any more?
Assuming there was no sarcasm intended in that comment, I do.
512MB for the OS and applications on the embedded devices I'm dealing with soon fills up.

A call to assert() is probably no larger than any other error handling technique. It's about four instructions plus the string, on most platforms.

Regardless, gracefully dealing with memory failures is, I would have thought, preferable to "I give up," and simply quitting in most situations.

It's really very hard to gracefully deal with memory allocation failures - there's not a whole lot you can do about it other than freeze the application and whinge at the user for help. Most of the time it isn't worth bothering, since you have to make the application crash-safe anyway - if you arrange things so that a crash never loses significant data, then crashing is a good enough solution for the majority of applications.

And that's all assuming that you aren't running on a regular linux system in the default overcommit mode or something similar, where malloc only returns NULL when you're out of virtual address space, not physical memory. If you're targeting platforms like that, it's a waste of time bothering.

Obfuscator

@PJH said:

@phaedrus said:
[...]
    buf = malloc(sizeof(msg_t) + sizeof(key_msg_t));
    assert(buf);
[...]
    pkt = malloc(sizeof(pkt_t));
    assert(pkt);
I see two more WTF's. And they'll only happen in non-debug code. I can only hope you don't use this idiom in production code.

Why not use assert? Assert is designed to make assertions. The option to disable assertions is only available so that you can sacrifice safety for speed. If you use your own assert, you're doing the exact same thing, but you remove the option to turn it off. That's just ridiculous.

PJH

@Obfuscator said:

@PJH said:
@phaedrus said:
[...]
    buf = malloc(sizeof(msg_t) + sizeof(key_msg_t));
    assert(buf);
[...]
    pkt = malloc(sizeof(pkt_t));
    assert(pkt);
I see two more WTF's. And they'll only happen in non-debug code. I can only hope you don't use this idiom in production code.
Why not use assert?

Why not, you know, not bomb out (with debug enabled) or crash unpredictably (without,) and die gracefully when malloc() returns 0?

Obfuscator

@PJH said:

@Obfuscator said:
@PJH said:
@phaedrus said:
[...]
    buf = malloc(sizeof(msg_t) + sizeof(key_msg_t));
    assert(buf);
[...]
    pkt = malloc(sizeof(pkt_t));
    assert(pkt);
I see two more WTF's. And they'll only happen in non-debug code. I can only hope you don't use this idiom in production code.
Why not use assert?
Why not, you know, not bomb out (with debug enabled) or crash unpredictably (without,) and die gracefully when malloc() returns 0?

I can imagine that being a requirement in some high-availability programs, but generally I would be against such a thing. In order to be productive, you need to set a limit on what you assume can go wrong (of course depending on the specific project). There are always things that can go wrong, and maintaining a healthy balance between what you take into account and what you consider being runtime errors is one of the most important design decisions in my eyes. It affects a lot in a project.

I was, however, mainly referring to the use of assert to guard invariants, postconditions etc, which I think is a good thing. Of course, using assert and turning it of could cause your program to run in an undefined state, but I'm arguing that sometimes you're willing to take that risk to gain performance , and that's exactly why there's an option to turn them off. That's also why I think that you shouldn't turn them of too early, since that would be premature optimization.

Some people that I've discussed this with argue that assert should only be used for expensive debug checks, and that you should always check for invariants and post conditions with other means than assert, in order to always fulfill the contract of the affected function. I haven't really made up my mind about that, but it would be nice to hear peoples opinions on it.

operagost

OT: Today's XKCD

After looking at the rather topical XKCD, I clicked on the RANDOM button. I was greeted with this comic: http://xkcd.com/221/

The irony is that when I clicked the RANDOM button again, I saw comic 221 again.

I LOLed.