The update lost the race to the cleanup crew

gleemonk

We have stale cache data left in the DB apparently. It happens after removing a property from a widget.

I can't reproduce. So I walk over to the reporter. He can't reproduce either. No wait he can. It's intermittent.

var queryResult = db.Widgets.update(selector, theDirtyUpdate, reportErrors);
db.Widgets.update(selector, theCleanup);

What happens here is that the first update is run aynchronously because it is passed the function reportErrors as a callback. The two updates will race each other and if theCleanup wins, theDirtyUpdate will not be cleaned after until the next property is removed.

BUT WHY? There was a time where I didn't understand what that callback parameter to .update() would do exactly. I thought it was just a more convenient way than checking the result. So bonus points for forgetting the unused queryResult in there.

Where else did I use this racey pattern? I can find a few instances of reportErrors used like that, and some of those instances could be race conditions of their own. Fuck.

dkf

@gleemonk said:

Where else did I use this racey pattern? I can find a few instances of reportErrors used like that, and some of those instances could be race conditions of their own. Fuck.

Guess it's time to audit your code. Race conditions are hell to find via debugging; you're going to have to approach them from the other side and check for each damn place that might possibly have them and change the code so that you can prove that there is no race.

Assume that the computer is an evil swine and will spring a trap on you when dealing with code races. It's not quite true, but it puts you in the right frame of mind…

Yamikuronue

This is 80% of the problems I have when writing Node.js code. Usually I see it in my tests returning before the method under test does because I forgot it takes a callback.

gleemonk

@dkf said:

Guess it's time to audit your code. Race conditions are hell to find via debugging; you're going to have to approach them from the other side and check for each damn place that might possibly have them and change the code so that you can prove that there is no race.

Assume that the computer is an evil swine and will spring a trap on you when dealing with code races. It's not quite true, but it puts you in the right frame of mind…

I just spent more than a day writing a cache updater that I assume will always settle on the last version even when started in parallel. Contrary to the atomic updates used in the code above, in this one I couldn't avoid reading then writing. The trick I ended up with is to just run the process repeatedly until the DB doesn't report modification anymore. That's much simpler than the counting scheme I'd devised initially.

So I was in the right mindset when the problem popped up. The bad news is that this failure is rather benign compared to other possible races. It was easy to find too because the racing lines were right next to each other. I can see a few "distributed" races which will only show rarely under heavy load. I hope I can transform some of those read-write cycles into atomic updates without additional complexity.

boomzilla

I recently came across some race conditions. I'd schedule something to happen in the background. But the current transaction where I was also creating the row in the DB that was the basis of the task would sometimes take too long and not be committed before the background task had started, which resulted in the background action throwing exceptions and stopping.

Of course, you go back to look, and everything is there, so ? I think just added something like a 30 second delay (which should be way more than enough).

Jaloopa

Couldn't you have the background task explicitly waiting for the row to be committed? Waiting an arbitrary amount of time is a bit of a code smell (not that I haven't ever done it myself...)

boomzilla

Yes, that was probably a better solution, but if it's taking that long there are probably bigger issues. I admit it was the lazy way out. Even waiting for the row to show up amounts to practically the same thing, since you'd need to only check back so many times...at some point, you'd have to imagine the transaction got rolled back instead of committed, so it's never going to show up.

PleegWat

Ugh, that reminds me of a nut I still have to crack: Change tracking on an oracle database.

I have a couple of SQL queries, and I need a reasonably effective proxy for 'has the result changed'. If the result changes I need to regenerate a bunch of transformed on-disk stuff which can run expensive.

I've currently got something that examines the query plan to determine source tables for the queries, and tracks the modification time for the tables. However the modification time is remembered in a trigger which runs out-of-transaction (because deadlocks). Transactions have been getting longer on the source side and we're starting to run into race conditions.

Tsaukpaetra

This reminds me of some code I had to clean up recently. Apparently, a previous developer thought it was a good idea to .destroy() something and then .close() it just after.
Somehow this worked fine, because, I guess .destroy() was an asynchronous call that took a while to actually work.

Fun part is that there was no need to destroy the object anyways, since it would be re-used (re-created) later anyway.
Only reason why this consistently didn't fail was that the library was slow enough not to matter. Then it got optimized, and people were wondering why the app kept crashing when they tried to submit things....

dkf

@PleegWat said:

Change tracking on an oracle database.

Set up a trigger that notifies you when the underlying data might have changed? You'll still need to requery when things are complex, but at least it might save you a bunch of work…

And yes, you need the DBA on board with this. ;)

PleegWat

I've currently got a trigger (which updates a modification time in a dedicated table), but it runs out of transaction which makes it unreliable. I'm thinking on moving it in-transaction, but then I need a better signalling mechanism. The consumer is an external binary which generates files.

The only thing I can think of currently is inserting into a table which can then track modifications, but I'm afraid of uncontrolled growth in that tracking table. And I'm unsure how best to track read/unread notifications if I've got multiple consumers.

dkf

@PleegWat said:

which makes it unreliable.

But stops it from angering the main DB client owners due to the extra time to process something that's really only of main use to you and not them. If updates are pretty frequent, it doesn't matter if you miss one as the next one will let you catch up. If updates are infrequent, you could almost think about switching a regular batch job.

Keeping everything exactly accurate all the time has a significant cost.

PleegWat

You know how it is, PM wants everything and it's not allowed to cost anything.

gleemonk

@Tsaukpaetra said:

Only reason why this consistently didn't fail was that the library was slow enough not to matter. Then it got optimized,

Classic.

gleemonk

@boomzilla said:

...at some point, you'd have to imagine the transaction got rolled back instead of committed, so it's never going to show up.

I think @Jaloopa was referring to a background task that's always there waiting for rows that are missing the 'task_was_run' flag. That way you don't even need to schedule something during the transaction. It will happen automatically when the transaction is committed (and not when it's rolled back). Of course this requires a field to track the status in te DB, as well as a watching process. A timeout is certainly easier :-)

I quite like the facilities I got with the Node framework we use. I can just tell it to watch for additions of documents fulfilling a certain condition, like having empty cache fields. Then run the updates from those callbacks. Of course the only reason we need those udpaters is because we need to duplicate data that would be an easy join on an RDBMS.

boomzilla

@gleemonk said:

I think @Jaloopa was referring to a background task that's always there waiting for rows that are missing the 'task_was_run' flag.

Oh. I see. Well, technically, it was a background task that noticed there was a task to be done in the background.

@gleemonk said:

It will happen automatically when the transaction is committed (and not when it's rolled back).

Yeah, I didn't realize that requesting the task wouldn't happen before the current transaction finished. In retrospect, it kind of makes sense. In fact, the task scheduling bit seems to have a separate DB connector.

dkf

@PleegWat said:

You know how it is, PM wants everything and it's not allowed to cost anything.

I know just what the PM wants!
http://www.cambridgepm.co.uk/wp-content/uploads/2013/10/Cambridge-publishing-management-moon-on-a-stick0ea52.png