Why we get deadlocks

snoofle

In the bowels of our main class (pseudo):

while (true) {
  // read message from queue
  try {
      // create db lock for records to be processed
  } catch (Exception e) {
  }

  ...

  if (thisIsAProductionRun()) {
     // process the data
     // 1000 lines of intervening code
     if (success) {
        // release the db locks
     }
  }
}

And then they wonder why every day or so they need to bounce the cluster.

Sutherlands

That's a resource leak. It's not a memory leak, but it is a resource leak.

derula

@snoofle said:

while (true) {

// read message from queue

try {

try {

// create db lock for records to be processed

} catch (Exception e) {

// retry to create the db lock

}

} catch (Exception e) {

}
...

There, fix'd.

Ilya_Ehrenburg

Like.

Are they at least consequent enough to refuse letting you fix it?

snoofle

@Ilya Ehrenburg said:

Like.
Are they at least consequent enough to refuse letting you fix it?

They originally didn't want to risk rocking-the-boat, but after a flurry of db reboots during busy-time, they relented. I fixed and tested it in about an hour and they're stress testing it now. I suggested that in addition to running the good cases, that they also attempt to create a deadlock with a few forced failures by planting bad data (they're going to have a meeting to discuss IF it's appropriate). They would like to deploy by Mar 31 next year, if we can get it done with confidence by then. Mmmm-K.

zelmak

@snoofle said:

... They would like to deploy by Mar 31 next year, if we can get it done with confidence by then. Mmmm-K.

No rush I see.

gobes

Blah, rebooting a server take what? 10, 20 minutes? not 5% of office hours. If they can't work without a working database, where is the world going?

snoofle

@gobes said:

Blah, rebooting a server take what? 10, 20 minutes? not 5% of office hours. If they can't work without a working database, where is the world going?

It's not so much the reboot; that happens in under 2 minutes. It's the reloading of the master caches that takes 2 hours (and yes, that too is a WTF).

boomzilla

@gobes said:

Blah, rebooting a server take what? 10, 20 minutes? not 5% of office hours. If they can't work without a working database, where is the world going?

Thank you for self identifying as TRWTF.