Patch all the live productions!

MathNerdCNU

Continuing the discussion from DoubleTake vs Clinical Systems:

Hello lovely people of tdwtf,

I have an issue with DoubleTake and know absolutely nothing about this high availability solution.

The situation is the following. I work in the cancer care sector on a radiotherapy treatment system and a customer of ours has against our advise put his main oncology database on a server mirrored with DoubleTake. We only validated for clustering. In preparation for a version upgrade the customer tried to perform some OS updates and SQL updates during clincal times while the application was running. We told them not to, but hey... customers. During this process a failover occurred that broke a lot of things within the radiotherapy SQL database.

I spent hours fixing thousands of lines of SQL by hand, which was well... fun.

Now, the original server has been rebuilt by their IT and windows updates and patches have been applied to it that are not on the current primary node. The customer would now like to fail back. From my understanding of things, as soon as that happens the second server will be overridden with a copy from the current master node - however, I might be wrong.

Does anyone know if there is a way to failback and retain the updates or if that is even the default behaviour? I spent some time on the line with DoubleTakes official support today and they didn't seem to want to commit to an answer ... likely because an outage of service would mean a serious negative impact on patient care.

However, I would still like to know if its doable and if so how. That particular client had a string of mistreatments as a result of database damage incurred during the initial failover, so I would like to avoid that particular level of hell spawn for a second time in the interest of the patients.

Anyone with any experience of that software here? I debated asking StackOverflow... but I don't think I'd trust any advice I'd get from there.

I don't even have words to describe the ery involved with in-place upgrading patient systems and not having rollback testing/systems/anything in place.

To be clear, the are the shit-monkies that upgraded against advice/testing/any-rational-thought-that-may-have-existed. Fuck me.

swayde

I was going to comment in that thread, but f...
Doing the failover again with out backups and plans etc. is just as bad. I'm getting all mad here.

@royal_poet said:

DoubleTakes official support today and they didn't seem to want to commit to an answer

Warning flags all around. How the fuck do they not know this ? What fucking level of incompetence is this ?! Escalate until you get someone that'll take some fucking responsibility for their damn product...

Weng

Yeah we use doubletake for the DR site failover of our SQL cluster. We've never tried to exercise the round-trip. Strictly one way in our world (which is just as well because of the original datacenter still physically exists it'd be quicker to fix the problem than to activate the DR processes)

blakeyrat

Come to think of it, what would be the point of purposefully failing-over back to the original DB server? Aren't the two identical? Like, isn't it kind of trivial which one just happens to be the "main" one at any given moment? Let sleeping dogs lie, man.

EDIT: I guess because it's not patched-up. But it seems 10,332,43243254,52352,42 times less risky to just patch the server without doing a fail-over first. Scheduling intentional downtime shouldn't be a big deal, since the unintentional downtime they just did didn't put them out of business. Even though it should have.

ben_lubar

@blakeyrat said:

what would be the point of purposefully failing-over back to the original DB server?

It's disaster recovery, so it might not be on as beefy hardware as the main production server.

royal_poet

You'd think that except their incompetence made it not so. Certain clinical temp/ cache files with extensions not known to their IT don't seem to be copied across. Which leads to certain medical images not being able to be opened.

Yes, that is also dumb product design to not auto-recreate those files when not found.. but I get why it wasn't coded for. Nobody in their right mind ever messes with the actual storage location for medical images. Well, everyone except that doubletake installation.

royal_poet

In this case it is mostly being cheap. The servers are identical. However, they ordered a clinical system version upgrade and upgrade has certain requirements on windows version and SQL version. So rather than waiting for scheduled downtime they thought they get this done on the fly - and well save the hassle of buying a new server to set up to spec. I mean why do that and avoid risk to patients when you can go with the same old server you had for the last 5 years. Because obviously when you spend millions on new radiotherapy linear accelerators you want to save the 15000 for a decent up to spec server - or something. I guess. Maybe.

Can't claim I fully comprehend these people.

I should really write you a best of clinical WTFery. I see a lot every day.

Cursorkeys

@royal_poet said:

That particular client had a string of mistreatments as a result of database damage incurred during the initial failover

@royal_poet said:

linear accelerators

This is a patient management system not a system controlling the treatment device? I've got a horrible image of radiation burns in my mind rather than simply missed appointments

I didn't want to divert your other thread with questions but I'm really curious.

@royal_poet said:

I should really write you a best of clinical WTFery. I see a lot every day.

Please do!

gleemonk

@royal_poet said:

I should really write you a best of clinical WTFery. I see a lot every day.

And have us fear for our lives every time we set foot in a hospital? Count me in.

royal_poet

It's both. It does trivial stuff such as storing patient letters and appointments as well as recording and verifying the treatments.

In this particular case there wasn't an over-treatment, but an under-treatment.

When treating cancer patients - roughly - you take a CT and identify the tumour. You draw the outline of the tumour in something called a planning system and work out how many beams of radiation are good for treating that and from what angles these beams should come from. Then you send this data CT + Plan to the Oncology Information System (my bit) We store the CT and send the treatment information to the linear accelerator.

When the patient comes in for treatment they have a mini CT on the linear accelerator. They take that so they can compare this CT to the initial CT that came with the plan. It's to make sure that the patient lies in the same spot and the planned beams are hitting the tumour and not some unrelated healthy tissue.

Here... the failover caused the database to not record which patient the plan CTs were for when they arrived in the system. It did write some lines, but the lines were malformed and lacked some information.

So the patients came... they tried to treat this the patients and the mini CT could not be compared the the plan CT because it was missing. They then could not give the patient any radiation because they could not verify it would hit the correct target.

A missed treatment might not sound so bad at first, but in radiotherapy any treatment that is missed lessens the chance to survival somewhat. To deal with a tumour you need a lot of small doses administered very regularly.

But all things considered these particular clients got kinda lucky... much worse than this could have just as easily happened.

MathNerdCNU

@swayde said:

I was going to comment in that thread,

That was my initial thought but I didn't want to shit up a General Help thread with what I posted.

Wait...does that mean...I used Discourse...as designed. And it worked?

https://www.youtube.com/watch?v=IOfMKv5fxgQ