debugging a crash in someone else's code

ben_lubar

I have a plugin for a third party closed source program. It crashes on Windows and Wine for a specific input, but not on the native Linux version.

The crash looks like this:

Unhandled exception: page fault on read access to 0x0000005b in 32-bit code (0x00b76e68).
Register dump:
 CS:0023 SS:002b DS:002b ES:002b FS:0063 GS:006b
 EIP:00b76e68 ESP:02c4dbf0 EBP:15a3c7a0 EFLAGS:00010286(  R- --  I S - -P- )
 EAX:e8a796be EBX:5cd1c0d8 ECX:00000007 EDX:00000000
 ESI:00000000 EDI:00000000
Stack dump:
0x02c4dbf0:  1b048e90 02c4dca8 e08e8a72 5d46f388
0x02c4dc00:  1b049f78 5d585858 e8a796be 00000004
0x02c4dc10:  00000000 00000009 ffffffff ffffffff
0x02c4dc20:  00000000 5d61a518 5d61a51c 5d61a51c
0x02c4dc30:  5a63f000 7bcbe000 5d61a4e8 5d61a4ec
0x02c4dc40:  5d61a4ec 00000000 02c4dde8 15a3cc28
Backtrace:
=>0 0x00b76e68 in [redacted] (+0x776e68) (0x15a3c7a0)
0x00b76e68: movl	0x54(%ecx),%edx

How would I go about figuring out what's causing the crash? Both the program and the plugin are insanely complex.

PJH

Raymond's in the middle of a series dealing with debugging, starts here: https://blogs.msdn.microsoft.com/oldnewthing/20160608-00/?p=93615.

Anything there of use?

blakeyrat

@ben_lubar Do you have the source to the plugin? Or is it also closed-source?

pydsigner

@ben_lubar Is there a specific reason you're not reporting this to the plugin author(s) and waiting for them to fix it?

PJH

@pydsigner said in debugging a crash in someone else's code:

@ben_lubar Is there a specific reason you're not reporting this to the plugin author(s) and waiting for them to fix it?

It's made by CDCK Inc.? ‹/snark›

ben_lubar

@blakeyrat said in debugging a crash in someone else's code:

Do you have the source to the plugin?

Yes, it's here: https://github.com/BenLubar/df-ai

blakeyrat

@ben_lubar Just a couple days ago you told me that wasn't a plugin, it was just a DLL loader hack. Plugin implies the program it's modifying has some sort of API.

So you know which action of yours causes the crash, does that help narrow it any?

ben_lubar

@blakeyrat said in debugging a crash in someone else's code:

So you know which action of yours causes the crash

That's the problem - I don't because the crash doesn't happen after any specific function call in my plugin I've been able to determine.

blakeyrat

@ben_lubar Does it happen if your plugin is not involved at all?

ben_lubar

@blakeyrat said in debugging a crash in someone else's code:

@ben_lubar Does it happen if your plugin is not involved at all?

No, but my plugin is driving all the input to the program, so it's very unlikely that it would not be related to my plugin.

blakeyrat

@ben_lubar Right; but can you replicate the input without the plugin being involved?

What I'm getting at here is, did you find a bug in their code that occurs at a specific input, or do you have a bug in your code that stomps all over a data structure somewhere?

ben_lubar

Ok, so someone in #dfhack introduced me to cl-linux-debug's browse-addr function.

The EBX register points to a pile of camel fat. The EBP register points to the RENDER_FAT reaction. So at this point I'm pretty sure something changed in the data structure for kitchens.

Tsaukpaetra

@ben_lubar said in debugging a crash in someone else's code:

So at this point I'm pretty sure something changed in the data structure for kitchens.

Well at least it's not changed in the data structure for raisins?

ben_lubar

Ok, I'm able to reproduce the crash on a fresh install of the program with default plugins and an empty dfhack.init file. I can't reproduce the crash with all plugins unloaded.

Edit: reported:

crash on render fat · Issue #943 · DFHack/dfhack

To reproduce: Construct a kitchen and a butcher shop. Mark the pack animals for butchering. Wait until fat is rendered. This happens on Windows and Wine, but not on native Linux. The crash does not...

fbmac

@ben_lubar said in debugging a crash in someone else's code:

Ok, so someone in #dfhack introduced me to cl-linux-debug's browse-addr function.

The EBX register points to a pile of camel fat. The EBP register points to the RENDER_FAT reaction. So at this point I'm pretty sure something changed in the data structure for kitchens.

it took me a while to accept you meant what you said

sloosecannon

@fbmac said in debugging a crash in someone else's code:

@ben_lubar said in debugging a crash in someone else's code:

Ok, so someone in #dfhack introduced me to cl-linux-debug's browse-addr function.

The EBX register points to a pile of camel fat. The EBP register points to the RENDER_FAT reaction. So at this point I'm pretty sure something changed in the data structure for kitchens.

it took me a while to accept you meant what you said

Clearly new to Dwarf Fortress, I see

Matches

@ben_lubar can you not just add an unhandled exception handler so that fatal application errors can go to the handler where you've conveniently added writing the error stack trace to disk, which includes a line number and file of the offending code?

It's like 8 lines of code.

ben_lubar

@Matches said in debugging a crash in someone else's code:

@ben_lubar can you not just add an unhandled exception handler so that fatal application errors can go to the handler where you've conveniently added writing the error stack trace to disk, which includes a line number and file of the offending code?

It's like 8 lines of code.

I'm sure that's possible to do with SEGFAULTs on executables with no symbols.

drurowin

@ben_lubar said in debugging a crash in someone else's code:

Ok, so someone in #dfhack introduced me to cl-linux-debug's browse-addr function.

The EBX register points to a pile of camel fat. The EBP register points to the RENDER_FAT reaction. So at this point I'm pretty sure something changed in the data structure for kitchens.

I believe I've just had a minor seizure. Anyway, since you have the memory map, why not just directly update the variable that contains the output of the render_fat reaction? Also, consider converting to ntfs. It's easier to render camel ntfs than camel fat.

ben_lubar

@drurowin said in debugging a crash in someone else's code:

the variable that contains the output

Yeah, I'll just put the result of the play_a_video_game function into the variable. That makes sense.

drurowin

@ben_lubar Well doesn't it like give you some other raw material? Just bump up the raw material or resource you need.

ben_lubar

@drurowin said in debugging a crash in someone else's code:

@ben_lubar Well doesn't it like give you some other raw material? Just bump up the raw material or resource you need.

It's not running my code when it crashes. In fact, it's not running any part of the code I can touch when it crashes. Something somewhere in DFHack is corrupting some value that eventually causes the rendering of fat to dereference an invalid pointer.

dkf

@ben_lubar Bad pointers can be a complete ass to hunt down. Is it possible to run things with a memory debugging tool like valgrind or efence? Those can tell you a great deal even without source, though it helps if you've got a test case that can trigger the problem rapidly as they've got a lot of overhead. (I don't know the state of availability on Windows, and I try to make my own code not require techniques like that, but they're very good indeed when you need them…)

cvi

@dkf said in debugging a crash in someone else's code:

I don't know the state of availability on Windows, and I try to make my own code not require techniques like that, but they're very good indeed when you need them…

There's Dr. Memory, which is supposed to have similar features to the vanilla valgrind (i.e., not cachegrind). I don't have as much experience with it as with valgrind, though, so YMMV.

But, yeah, one of these could tell you whose memory is being stomped on, and possibly how. After that... GLHF.

dkf

@cvi said in debugging a crash in someone else's code:

cachegrind

BTW, that's a very nice tool despite being a PITA to work with. It's possible to chisel away quite a bit of performance trouble with the help of the detailed metrics it produces.

drurowin

@ben_lubar said in debugging a crash in someone else's code:

@drurowin said in debugging a crash in someone else's code:

@ben_lubar Well doesn't it like give you some other raw material? Just bump up the raw material or resource you need.

It's not running my code when it crashes. In fact, it's not running any part of the code I can touch when it crashes. Something somewhere in DFHack is corrupting some value that eventually causes the rendering of fat to dereference an invalid pointer.

So just don't run the code that renders fat, just update the variable that render_fat updates with whatever value you need it to have. Using a Minecraft example, if you know the address that stores the number of Dark Oak Wood Planks you have, and the code to craft Dark Oak Wood Planks from Dark Oak Wood Logs crashes, why not just directly update the Dark Oak Wood Planks value and skip the crashing code?

ben_lubar

@drurowin said in debugging a crash in someone else's code:

So just don't run the code that renders fat

It's an [AUTOMATIC] reaction, so it doesn't get run by my code.

@drurowin said in debugging a crash in someone else's code:

why not just directly update the Dark Oak Wood Planks value and skip the crashing code?

Because I don't feel like intercepting a thing that's being intercepted and crashing because of the interception is a good idea.

Anyway, the workaround is unload eventful.

OH GOD WHY

ben_lubar

Checking linux vtables:
VTable size mismatch: active_script_varst (active_script_varst) - expected 7, found 1
VTable size mismatch: script_varst (script_varst) - expected 1, found 0
VTable size unchecked: ui_build_selector
VTable size unchecked: renderer
Checking windows vtables:
Argument size mismatch: 0043e6f0 item(itemst)::addImprovementFromJob #86 - expected 44, found 52 bytes.
Argument size mismatch: 00b6c7b0 reaction_product(reaction_productst)::produce #2 - expected 36, found 44 bytes.
VTable size unchecked: layer_object
VTable size unchecked: active_script_varst
VTable size unchecked: build_req_choicest
VTable size unchecked: script_varst
VTable size unchecked: ui_build_selector

Adynathos

@ben_lubar Maybe the binaries have been compiled with different compilers?

drurowin

@ben_lubar said in debugging a crash in someone else's code:

@drurowin said in debugging a crash in someone else's code:

So just don't run the code that renders fat

It's [an [AUTOMATIC] reaction]

Patch the image in memory so the whole reaction is replaced with the 0x90 opcode.

@drurowin said in debugging a crash in someone else's code:

why not just directly update the Dark Oak Wood Planks value and skip the crashing code?

Because I don't feel like intercepting a thing that's being intercepted and crashing because of the interception is a good idea.

You don't necessarily even need to go to the trouble of gathering the fat and ending up in the point that you run into the offending code anyhow. Just update whatever resources you need manually if you have the memory map.