I am not the hero of this tale. I am merely the only actor--my own enemy and my own victim. Which is good, as I would not have wanted to inflict this upon anyone.
My company makes an embedded device, and we need it to interface with an embedded device made by multiple other companies. I'm sorry about being vague, but it's unannounced and in its early stages. Enough so that there is only a draft spec for the protocol we're all using (it runs over USB). As you can guess, the code below is heavily anonymized.
So, I got asked to write an emulator for our developers to test against while developing the software. It's been a fun project. It runs on our Linux workstations and it's written in C. Anyway, on Monday, I spent my afternoon debugging an amazing land-mine of a segfault. I'd planted it myself via copy/pasting some code from another function within the program.
The interesting code looked like this:
void send_crypto_key(void) {
uint8_t *buf;
uint8_t *b;
pkt_t *pkt; // internal representation of the packet
int i;
msg_t msg;
key_msg_t key;
// [ ... snip ... ]
// The packet to be sent gets setup in the msg and key
// locals here. They are packed structs representing
// the bitfields in the packet for the wire protocol.
// copy msg and key into buf
buf = malloc(sizeof(msg_t) + sizeof(key_msg_t));
assert(buf);
b = buf;
memcpy(b, &msg, sizeof(msg_t));
b += sizeof(msg_t);
memcpy(b, &key, sizeof(key_msg_t));
// now that we have buf with the data for the wire, we
// store it in the pkt_t *pkt struct, which then gets
// shipped over to the 'dispatcher' thread which handles
// communication with the hardware:
pkt->tag = TAG_KEY_MESSAGE;
pkt->payload = buf;
pkt->len = sizeof(msg_t) + sizeof(key_msg_t);
// Enqueue the pkt_t to the dispatcher queue so it gets
// sent over the wire:
pkt_enqueue(pkt);
// don't free() buf or pkt, the dispatcher will do that.
}
So, my program would get to this function and segfault. My normal reaction to a segfault is to fire up gdb and get a back trace. I always compile with debugging symbols on, so I was shocked when I saw this:
$$ gdb ./emulator
(gdb) run
...
Program received signal SIGSEGV, Segmentation fault.
0x00000084 in ?? ()
(gdb) bt
#0 0x00000084 in ?? ()
#1 0x00000108 in ?? ()
#2 0x000011b0 in ?? ()
#3 0x0000a000 in ?? ()
(gdb)
I then reran the program (actually, I did this about eleventy billion times, I was so confused). I set a breakpoint and reran the function and stepped into the segfault. It happened right at the end of the send_crypto_key() function, every time. And the stack on that thread would be so completely fubar that gdb couldn't backtrace it for me.
So, asuffield, have you figured it out yet?
And now, for the spoiler (You'll want to be familiar with how stack frames are setup, see [url=http://en.wikipedia.org/wiki/Call_stack]the wikipedia article on Call Stacks[/url]):
Did you see what I forgot? Yep. I didn't copy these two lines of code:
pkt = malloc(sizeof(pkt_t));
assert(pkt);
There are two other problems with this chunk of code which set me up for this monster of a land mine. First and foremost, notice the local variable declarations. I didn't initialize my pointers to NULL. If I had done that, the segfault would have occured on this line:
pkt->tag = TAG_KEY_MESSAGE;
And I'd have gotten a backtrace with line numbers and have been done with this in five minutes.
The key is that my code is littered with these memcpy() calls:
memcpy(b, &msg, sizeof(msg_t));
Where I'm taking a pointer to a local variable and using it as a parameter to a function call. Local variables and function parameters both get stored on the stack, thus, my stack page was littered with pointers back into the stack. My luck was such that every time I ran the program, pkt didn't hold garbage, but a pointer aimed right at the activation record for send_crypto_key(). The address 0x00000084 in the backtrace happens to be exactly equal to sizeof(msg_t) + sizeof(key_msg_t). And it was that value every time I ran the program. pkt->len dereferenced to my return address.
[url=http://xkcd.com/371/]I leave you with today's XKCD. I'm sorry computer. I fixed it as fast as I could.[/url]