Debugging multithreaded issue

Kermos

Working on a multi-writer & reader lock-free stack implementation. For the most part, it works great when I allow it to leak memory all over the place.

In my test case I have 2 threads simultaneously popping values onto the stack and 2 threads popping them back off. In the end, I take the values that the 2 threads popped off and compare them against what they should have popped off and it all matches perfectly, well minus the mem leak.

Now the issue is, when popping off a node from the stack can't immediately free it, another thread may still be referencing the same node. One solution to the issue is to use a free node list. So when the node is popped off the thread, instead of being deleted, it is being put into another thread-safe container that stores that now unused node. When it becomes time to push a value back onto the stack, instead of allocating a new node, can just grab one from the free node list. As an added bonus this helps keep memory allocations down as well.

Mostly this is working but despite all writes being done via atomic operations, somewhere there is a race condition that is causing things to go all bad. Somehow eventually a node ends up pointing to itself (the stack is implemented as a singly linked list).

Any debug output I put in there adds processing time to the thread which makes the race condition go away and as a result 'solves' the problem. So by trying to LOOK at the issue, I make it go away!

Any suggestions? :)

fyjham

Simply leave the debug code in when you go live.

Just kidding, I'd normally just say lock the linked list, but it sounds like you're trying to achieve the opposite.

If adding debug code prevents debugging the only remaining course of action is simply to look at it logically. My guess would be that 2 threads try to add it to the end of the list at the same time, one gets added then the next one tries to put it on the end of the list which now consists of itself. Is the adding process thread-safe or? Other than that you could try putting assert (or your language's equivalent) statements in on the commands that change the linked list so you can find out which one is causing the issue, wouldn't have as much impact as debug output...

arty

First off, if you're on a *nix-ish OS, you might try valgrind with the ``helgrind'' tool.

You might give qemu's userland emulation a try, having it wait for a gdb connection (-g port) when you start it. You'll be able to do instruction accurate debugging without disturbing much.

On windows, I've always had good luck with good ol' windbg. You might try, instead of doing console or file handle output, using OutputDebugString. Generally, it won't disturb timing as much.

Kermos

@fyjham said:

Simply leave the debug code in when you go liv

Sadly I know a guy who would do almost just that. His solution to this problem after seeing this would be to just add a delay...he loves fixing things with delays somehow (of course, it ultimately just results in a WTF).

@fyjham said:

Just kidding, I'd normally just say lock the linked list, but it sounds like you're trying to achieve the opposite.

Yea, it's supposed to be lock-free. Using locks like Mutexes and such doesn't scale terribly well as the # of threads go up.

The adding process is supposed to be thread-safe, but obviously isn't else I wouldn't be seeing this issue. I think that the very thing that I'm trying to fix by having this free node list is what may also be causing the problem.

I just did have an idea though how to rewrite some of my code a bit simpler and better and will do that first thing in the morning. Maybe that'll help, at the very minimum it should simplify my debugging.

fyjham

Yeah, more what I meant was if you can identify the point where your code definitively decides to add something to the end of the list (The point after which it won't check if it's already there again) and the point where it gets it's reference to the final node in the list (Which I presume are close together - otherwise you're in a world of pain). Once you narrow down where the thread must be when the other one changes the data you can simply freeze a single thread at that point and see whether the other threads wait for it or whether they update accordingly (Unless you think the check and the adding are an atomic step - which if you're not using mutexes I'm not sure how they would be).

Kermos

@fyjham said:

(Unless you think the check and the adding are an atomic step - which if you're not using mutexes I'm not sure how they would be).

Actually, they *are* atomic, or else lockless would not be possible. It's commonly referred to as Compare-and-swap or CAS.

It's a special instruction that roughly does the following in one atomic hardware operation:

int cas(int*destination, int currentValue, int newValue)
{
    if(*destination == currentValue)
    {
        *destination = newValue;
        return 1;
    }
    return 0;
}

This is available in either 32-bit or 64-bit length. Here's a wikipedia on it:

http://en.wikipedia.org/wiki/Compare-and-swap

There are a couple of other instructions like this as well, fetch-and-add for instance, all of which are atomic.

This allows me to do the following inside my stack class for instance:

do
{
    // Point node->next to what currently is the top element on the stack
    node->next = head->next;
    // Use CAS to set head->next pointer to the new node element
}while(!cas(head->next, node->next, node));

This will ONLY update head->next if head->next has *not* changed and the previous assignment still holds true. If the topmost element has changed, the loop runs again, gets the new topmost element pointer and retries.

And like I mentioned earlier, I tested this and it works great. I had two threads pushing and popping almost 20 million items on the stack with no issues as long as I didn't try to delete the topmost node after popping it off the stack. The reason I can't delete it is because another thread may still be referencing the node in a loop as you see above causing a crash on the node->next access.

Two ways around the problem:

1. Reference counting to avoid deleting the node too early, though that comes at a major cost of performance. If I push 20 million items, I have 20 million memory allocs and frees, not to mention refcount overhead.

2. Don't delete the node, put it in an empty node list instead. This even increases performance as it drastically reduces the number of required memory allocations. Assuming a perfect world, technically speaking 1 memory allocation would be sufficient if the item is immediately consumed by another thread after being pushed. It also solves the issue if another thread is still referencing the node as the memory pointer remains valid, but the node is no longer in use. The cas however should fail because head->next has now changed. I suspect that this area is where the issue is happening.

fyjham

Yeah, that's an atomic assignment, but you're still making the non-atomic choice (at least in your example) to add node to the list.

I mean, are 2 threads able to have hold of node at the same time before it gets added to the stack? And what stops them both adding it? I mean sure, the add would be threadsafe, but the second add would be setting it's next to head->next, which would be itself. Sorta need to be positive that head->next != node before you enter that loop, and need to be sure of it in an atomic way with the actions of the loop, which would be quite a challenge. Unless of course node isn't shared between multiple threads before it enters the stack?

Kermos

@fyjham said:

Yeah, that's an atomic assignment, but you're still making the non-atomic choice (at least in your example) to add node to the list.
I mean, are 2 threads able to have hold of node at the same time before it gets added to the stack? And what stops them both adding it? I mean sure, the add would be threadsafe, but the second add would be setting it's next to head->next, which would be itself. Sorta need to be positive that head->next != node before you enter that loop, and need to be sure of it in an atomic way with the actions of the loop, which would be quite a challenge. Unless of course node isn't shared between multiple threads before it enters the stack?

I think after sleeping on it for one night I have figured out what is going on. You're correct that two or more threads can get a hold of the same node at the same time. That however, is and isn't a problem.

If I always allocate a new node when I need it and don't worry about immediately deleting nodes when I pop then off the stack (either just let them leak or offload them to a delete-later list, also an option) then it works fine. The reason being, even if two threads get a hold of the same top node and try to push a value at the same time, each thread is pushing a new pointer onto the stack which changes head->next and causes the other thread's loop to re-run with new data. In that case, it is no issue. My test case confirms this.

When I use my free-node list however, this is what can happen:

We'll call stack's head node, node A.
Thread 1: Pops node A off stack, pushes it onto free list.
Thread 2: Currently is trying to push a node onto the stack and is referencing the same node A while Thread 1 is popping.
Thread 3: Right after Thread 1 pops a node off but BEFORE Thread 2 pushes a new node, it requests a new node from the free list and receives node A as that was just pushed onto the list. It now proceeds to go ahead and push this node onto the stack.
Thread 2: CAS *incorrectly* succeeds because it expects Node A, sees Node A, but it isn't the *same* node A anymore.

ABA problem - Wikipedia

That essentially, or something like that, is what is happening. What confirms this is that the code works correctly when I run my test case, including free-list, with only a single reader and single writer. The moment I have multiple readers or writers, the above can and does occur. One solution generally is to add a 'counter' to the pointer that is incremented with each use. That way when the pointer is re-used, it has a different counter value and the above can't happen (Thread 2 would fail as it would expect the previous counter value not the new one and correctly re-run it's loop). This works great for 32-bit apps as CAS uses 64-bit. However, with 64-bit apps and 64-bit pointers, this becomes a bit more difficult. I'll either have to use double-cas or just squeeze a counter into the upper bits of the pointer that are unused. Don't wanna make a 640kb is enough statement here but I really I don't think we'll see machines with 2^64 bytes of memory anytime soon...this is of course assuming all the various operating systems currently leave some of the upper pointer bits free for me to do this, will have to check that.

flop

@Kermos said:

One solution generally is to add a 'counter' to the pointer that is incremented with each use. That way when the pointer is re-used, it has a different counter value and the above can't happen (Thread 2 would fail as it would expect the previous counter value not the new one and correctly re-run it's loop). This works great for 32-bit apps as CAS uses 64-bit. However, with 64-bit apps and 64-bit pointers, this becomes a bit more difficult. I'll either have to use double-cas or just squeeze a counter into the upper bits of the pointer that are unused. Don't wanna make a 640kb is enough statement here but I really I don't think we'll see machines with 2^64 bytes of memory anytime soon...this is of course assuming all the various operating systems currently leave some of the upper pointer bits free for me to do this, will have to check that.

The wikipedia page you linked to says to use the lower bits (which should be save as long as your pointer are aligned to at least 4 bytes).

It might be not so good cache-wise but instead of an "free stack" (as you seem to use, as read by "push" and "pop") you could use a real "free list", and make sure that the number of (free) items on it is at least the number of CPUs (or threads) working with it ... then such collisions might be a non-issue again.

Kermos

@flop said:

The wikipedia page you linked to says to use the lower bits (which should be save as long as your pointer are aligned to at least 4 bytes).
It might be not so good cache-wise but instead of an "free stack" (as you seem to use, as read by "push" and "pop") you could use a real "free list", and make sure that the number of (free) items on it is at least the number of CPUs (or threads) working with it ... then such collisions might be a non-issue again.

However, lower bits on 4 byte alignemnt is only 2 bits. Not a real lot to work with...

That is why 32-bit implementations use the upper 32-bit of the cas value as a counter. That makes the tag value so ridiculously large that the chances of a collision occuring are essentially none and it's the same reason why I'm leaning towards using the upper bits instead of lower bits of the pointer. For one, don't need to worry about alignment and it gives me more bits (hopefully).

I've also considered using a list as you suggested for the free list instead of a stack. You're correct in that, if I allow it to grow to a size larger than the # of threads before removing nodes from it again that it should solve the problem. In a perfect world anyway, in reality I am a bit worried about the problem thread being pre-empted long enough for the very node it is referencing to come back around despite a list instead of a stack being used. As the # of threads scales up, this is more likely to occur. I suppose making the list large enough helps avoid that though.

I was just looking at the various operating system limits though and it seems that Vista tops out at 16TB and Linux at 128TB. So going by that I would be safe using the upper 16-bits as a tag. A little reluctant to use double cas as apparently some older AMD processors don't support it.

Kermos

Well I seem to have a running implementation now, just how reliably is a question that yet remains to be seen.

The free node list is now implemented as a singly linked list rather than a stack. It's setup to create brand-new nodes while its internal list is below a certain threshold. Only when it goes past that threshold does it start reusing nodes from it's head.

As I was afraid of though, if I set the threshold low enough, things go really bad because another thread may hold a reference long enough for the same node to be returned too early. At that point in time, things can go bad in all sorts of ways, ranging from corrupted data to endlessing spinning threads to crashes. If I set the threshold high enough, everything is happy. I suppose though that the exact same thing would happen with tagging the nodes if the # of bits is too small and the counter rolls over. Currently not sure which approach would be faster performance-wise, but I'm liking my current approach as it is 32/64-bit agnostic and doesn't rely on certain bits in pointers being free. Long-term, definitely a safer solution as long as it turns out to be reliable.

For the fun of it, here's my output from my test case which has 2 threads each pushing 10 million values and 2 threads each popping 10 million values:

Starting push threads
Starting pop threads
Push Complete
Pop Complete in 14.849 seconds which is 673446 items per second. Checking data for errors...
Error Check Complete with 0 errors found
Alloc: 8845997 Reuse: 11154003

The timing value is relatively meaningless right now as most of the overhead is spent storing data in hashtables for the error check at the end, plus tracking allocation / reuse count also isn't exactly free as it requires atomic increments. If I remove all that and just check raw performance it's about 3 seconds on average for 20 million items. Plenty fast enough for my needs. :)

flop

@Kermos said:

I was just looking at the various operating system limits though and it seems that Vista tops out at 16TB and Linux at 128TB. So going by that I would be safe using the upper 16-bits as a tag. A little reluctant to use double cas as apparently some older AMD processors don't support it.

I think you're missing some point here ... maybe that's the maximum RAM that can be used, but you could get any virtual address for your allocations, especially with address randomization techniques (which are employed more and more for added security).

On a 32bit machine, with a 2GB:2GB address split (kernel:userspace), you can only guarantee a single upper bit to be free for your tagging uses ... and that's even less than what you'd get for 4-byte alignment. (If you have some structure there, you could easily go to 16, 32 or even 64bytes alignment - which would help).

And on a 64bit machine you might not have much more room, too ... but there the 64bit ops should be fine performance-wise.

For your 'free' list you could try locking (or at least marking) the to-be-acquired slot, so that you can detect (and handle) duplicate usage - and resort to a simple alloc, which should give some really unused memory.

Maybe you could take some hints from RCU; the wiki page has a lot of references to documentation.

Kermos

@flop said:

I think you're missing some point here ... maybe that's the maximum RAM that can be used, but you could get any virtual address for your allocations, especially with address randomization techniques (which are employed more and more for added security).

Even addresses pointing into kernel space? Anything in user space should be in the lower half of the virtual address space and therefore have the upper bits free. I think this is actually one of the reasons, from what I read on an MSDN blog somewhere, why windows keeps its address space at 40 bits only: So that they have some tag bits to use in their interlocked slist.

However, I do agree with you. Using the upper bits isn't something I'm completely happy with either. What I may just do is force 8-byte alignment of the pointers and that'll give me an extra bit at the bottom even on 32-bit systems.

Thanks for all the comments, I do appreciate it. Nice to talk about this with someone that actually undestands it and doesn't look at me like a deer staring into a car's oncoming headlights with glazed over eyes. :)

PJH

@Kermos said:

@flop said:
I think you're missing some point here ... maybe that's the maximum RAM that can be used, but you could get any virtual address for your allocations, especially with address randomization techniques (which are employed more and more for added security).

Even addresses pointing into kernel space? Anything in user space should be in the lower half of the virtual address space and therefore have the upper bits free. I think this is actually one of the reasons, from what I read on an MSDN blog somewhere, why windows keeps its address space at 40 bits only: So that they have some tag bits to use in their interlocked slist.

The closest thing I've seen with restricting bits on MSDN is Mr Chen's series of articles starting with http://blogs.msdn.com/oldnewthing/archive/2008/08/27/8898863.aspx but that's only dealing with HANDLEs, and the bits concerned are at the bottom of the number:

Kernel handles are always a multiple of four; the bottom two bits are available for applications to use. But why would an application need those bits anyway?

The short answer is extending the handle namespace. The long answer will take a few days to play out. [...]

But we'll start with a warm-up. If you need some sentinel values for a HANDLE, you need to make sure your chosen sentinel value will never conflict with a valid HANDLE value.

Kermos

@PJH said:

The closest thing I've seen with restricting bits on MSDN is Mr Chen's series of articles starting with http://blogs.msdn.com/oldnewthing/archive/2008/08/27/8898863.aspx but that's only dealing with HANDLEs, and the bits concerned are at the bottom of the number:

Yea, the thing I saw was also on some msdn blog but it was in the discussion part of the blog between a bunch of people. Figure though there is some reason why Windows restricts the address space the way it does so there might be some merit to it.

What I think I'm going to end up doing I've decided is to most likely use Hazard pointers instead. Been reading up on them for the past few days and letting them simmer in sponge between my ears for a while. The concept is pretty easy really but one thing I kept fighting with at first was how to give each thread it's own hazard pointer list as the container classes have no reference to the QThread object they are executing on (using QT API). Then I just smacked myself in the head this morning, QtThread::currentThread(). *sigh* Yea, I have idiot moments sometimes.

Now all's I gotta do is subclass QThread, call it like HazardThread or something, override it's currentThread() too to return a HazardThread* instead of a QTHread* for convenience so I'm not casting all over the place and I'm done. Create a hazardpointer container which it being single writer multi-reader isn't too terribly hard. ABA problem solved and delete problem solved.

Better solution than tags as far as I'm concerned, regardless of if the tags are in the lower or upper bits. Tag bits aren't 100% secure either as they can and do roll-over. On a 32-bit implementation, I suppose one is fairly safe with a 32-bit tag. On a 64-bit implementation however without a 128-bit CAS, it gets tough as we've discussed. :)

I really enjoy this actually, this is actually something that challenges my mind a little bit and gets me to learn new things and improve my abilities. Much better than adding yet another frigging button to some form!

tgape

@Kermos said:

t's a special instruction that roughly does the following in one atomic hardware operation

Is this restricted to run on only one processor core?

If not, what's to stop it from running on two different processor cores on the same cycle? Just because it's atomic on a uni-processor, doesn't mean that it's truly atomic.

Admittedly, that's the stuff of the exceedingly intermittent bugs that plague, well, a lot of software, really. Depending on the size of the routine, the processor speed, the number of cores, it could take weeks or even months to encounter just once on a test system. It could, however, still cause some major issues.

I only ask, because this is normally the sort of thing one mentions when one indicates how one knows that something is truly happening atomically, and I didn't see you mention it.

Kermos

@tgape said:

Is this restricted to run on only one processor core?

If not, what's to stop it from running on two different processor cores on the same cycle? Just because it's atomic on a uni-processor, doesn't mean that it's truly atomic.

On Intel/AMD architectures, the LOCK instruction makes it atomic. It basically in assembly works out to the following:

LOCK
CMPXCH8B m64

Where m64 is the operand that is compared to the memory at address EDX:EAX and if equal, the memory at that address is set to the value in ECX:EBX

By running LOCK prior to CMPXCH8B, the processor is told that the next instruction needs exclusive access of the memory it is addressing. The processor can then either assert a lock signal to tell other processors / cores about this (which makes it work in multi-processor situations) or if it isn't in cache already, simply read that memory into its cache. If it does the latter, atomic operation is guaranteed by the cache coherency mechanism. The LOCK instruction can actually be used with a bunch of different commands to allow atomic adds, subtracts, increment, decrement, fetch and add, fetch and set etc.

For a more detailed description you can check the Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 2A: Instruction Set Reference A-M

So the issue with lock-free containers isn't really the atomic part of it. That part is pretty much guaranteed.

The issue is memory management. I've gone as far as implementing hazard pointers and that works. However, performance-wise on a quad-core system, the overhead for maintaining and dealing with the hazard pointers is so high that performance gain is negilible. From benchmarks that I've seen, once you get to 8 cores and more, that's when benefits start to show but that isn't something I really need to worry about yet. Even the new processors are stilly really only quad-core with hyperthreading sprinkled on top.

When I find the time I may give pointer tagging a try and see how it works out in practice, the challenge there being finding bits to stuff the tag into on a 64-bit architecture. May try some form of pointer compression possibly, but there again i'm afraid that the overhead of doing so may exceed current performance benefits until we get more cores.

tgape

@Kermos said:

@tgape said:
Is this restricted to run on only one processor core?

If not, what's to stop it from running on two different processor cores on the same cycle? Just because it's atomic on a uni-processor, doesn't mean that it's truly atomic.

On Intel/AMD architectures, the LOCK instruction makes it atomic.

That's great. Just making sure you had your bases covered, and I thought you were going for a

@Kermos said:

multi-writer & reader lock-free stack implementation

and thus wouldn't be using that lock stuff. My bad.