Persuasive Maintenance of Benelux

abarker

Continuing the discussion from status status status status SNAKE!!!!!!!!!!!!!!!!!!!!!!!!!:

I think this deserves to be a SideBar now ..

The day started out good enough. Tasks were getting completed. People were getting their work done. @abarker was performing some final pre-release testing for a set of system updates that night.

Pretty good start to a Monday, thought @abarker.

Then came the first email from the production system.

A transport error has occurred when receiving results from the server.

And then another.

Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.

It was 11:00 in the morning, and all hell had broken loose.

The emails started coming regularly. First one production system was affected, then two, then all of them. Investigation revealed the problem: the CIO had scheduled a third party contractor to perform maintenance on the VM hosts. During business hours. The maintenance had started at 10:00 and was to continue until 2:00.

@abarker met with the network team to put together a plan. Since the VM hosts were in an offsite datacenter, and the maintenance was being handled by a third party technician they could only come to one conclusion: send out an enterprise email about what was happening and hunker down until the storm passed.

To dodge the massive wave of responses, the email was sent from the help desk's address, linked to an already open ticket. Since the ticket was not assigned to anyone, nobody was bothered by the massive responses. The bunker was secure. Now it was noon, and there were only two more hours to go.

Two o'clock rolled around, and the IT staff poked their heads out. Over 300 error emails had been received. Everyone in the company had been unable to do any real work. @abarker had not been able to complete his testing. Just one problem: it wasn't over yet. So the team took cover once again.

An hour later, they heard from the technician:

Good news! I just finished with the first host box! The next two should go much faster!

He's an hour past schedule, and only 1/3 done, and that's good news‽ thought @abarker. We're belgium-ed.

The IT team just trudged through the rest of the day, doing their best to avoid everyone else, hoping the nightmare would end. Just before leaving the office for the day, @abarker checked his email: 590 error emails, and counting.

The next morning, @abarker had 743 error emails. There was also a new message from technician, time-stamped at 8:00 PM the previous evening:

I finished all the host boxes. There are a few more details I need to address in the morning, but everything should be working now.

That's great, but I'm still getting error messages.

Host 1 is unresponsive.

So @abarker checked with the only network guy - James - that was in that morning. They discovered that the VMs had not been put back on the correct hosts. They hadn't even been properly distributed across the hosts. When the two looked at the resource demands, this is what they saw:

Host 1: ~ 50% of all VM resources
Host 2: ~ 35% of all VM resources
Host 3: ~ 15% of all VM resources

"Well no wonder host 1 isn't responding," said James, "It's massively over-utilized."

They immediately got in contact with the technician.

"Oh, that shouldn't matter. I've got the hypervisor set up to load balance the hosts. You should be fine."

"But doesn't that just balance the network traffic?" asked James.

"Uh, yeah."

"So shouldn't we be redistributing the VMs to balance the resource load?"

"Huh?"

At this point, James and @abarker decided to give up on the technician and take things to the CIO. That's when they discovered something really interesting: the technician had performed unauthorized work. He was only supposed to upgrade the BIOS on the host boxes. The hypervisor upgrades he performed were never requested.

Now @abarker and his IT team are all left wondering: how long until this gets fixed?

Edit: Added the morning count of error emails.

RaceProUK

Truly, the technician is a total Belgium

Luhmann

Doubt it. Unless @abarker 's off site location is across the ocean. But that would be the real WTF.

abarker

@Luhmann said:

Doubt it. Unless @abarker 's off site location is across the ocean. But that would be the real WTF.

Nah, just a few miles from our corporate office here in Phoenix. It's mainly off site for a better internet connection, so it can serve our other sites better. But that's a different WTF.

RaceProUK

Should I for whoosh?

abarker

@RaceProUK said:

Should I for whoosh?

Do I really need the additional stress right now? (how is that tired?)

RaceProUK

@abarker said:

Do I really need the additional stress right now? (how is that tired?)

Do you respond to flags?

I wasn't talking about flagging you anyway; I was talking about flagging @Luhmann ;)

abarker

@RaceProUK said:

Do you respond to flags?

I wasn't talking about flagging you anyway; I was talking about flagging @Luhmann ;)

Oh, misunderstood. Many apologies.

BTW, let me know when you are ready to change avatars. I now have an assortment of hats to choose[1] from!

[1] To be clear, I get to choose, not so much you. ;)

Arantor

Candidate for front page.

Onyx

@Arantor said:

Candidate for front page.

Needs more Hanzo.

Arantor

Change @abarker to Hanzo, job done, have a half day and celebrate with a pint?

Polygeekery

Sooooo, blacklisting words does not work on topic titles? I think we need to test this with all the English expletives over on meta.d. You know...just to verify...and see where the bugs are. Yeah...that's the reason.

Polygeekery

@Arantor said:

have a half day and celebrate with a pint?

You don't know him at all... ;)

Arantor

I did not specify who should have the half day or the pint for that matter.

Polygeekery

I don't think we should reward the tech who caused all of this...

Arantor

@Polygeekery said:

I don't think we should reward the tech who caused all of this...

Nor was I implying that this would be the case ;)

Polygeekery

So, you are saying that you need a half day and a pint?

Onyx

@Polygeekery said:

So, you are saying that you need a half day and a pint?

I'll take a day and half a pint. I'm humble that way.

Polygeekery

@Onyx said:

half a pint

That's blasphemy.

(This is coming from a man who asks his friends, "Want to go to the pub and have a beer or twelve?")

abarker

@Arantor said:

celebrate with a pint?

Of A&W? Sounds good.

Onyx

@Polygeekery said:

That's blasphemy.

I said I'll take half of pint. Nowhere did I imply I won't buy more myself.

Polygeekery

Good man.

Jaime

@abarker said:

When the two looked at the resource demands, this is what they saw:

Host 1: ~ 50% of all VM resources
Host 2: ~ 35% of all VM resources
Host 3: ~ 15% of all VM resources

"Well no wonder host 1 isn't responding," said James, "It's massively over-utilized."

50% is the best that server will see if one fails. Since 50% is too much, that means you have three hosts and everything will stop working if one goes down and it's load fails over to the remaining two. So, you effectively have negative fault tolerance.

hungrier

@Arantor said:

Change @abarker to Hanzo, job done, have a half day and celebrate with a pint?

Put it on the front page with one of the @abarker mentions changed to Hanzo and put an editor's note that says "Good news!" then edit it during the day while the article is live.

Polygeekery

I thought the same thing, but it is a matter of why you virtualize and whether or not you can tolerate some downtime. As this is a testing environment, my guess is a little downtime can be tolerated.

HardwareGeek

@Polygeekery said:

this is a testing environment

@abarker clearly said production systems were failing. My understanding is the three physical servers hosted both production and test VMs. Also, in the original, non-sidebar description, he stated that the servers were at or beyond capacity even when they were properly balanced.

abarker

@Jaime said:

50% is the best that server will see if one fails. Since 50% is too much, that means you have three hosts and everything will stop working if one goes down and it's load fails over to the remaining two. So, you effectively have negative fault tolerance.

This setup doesn't provide any sort of fault tolerance. Each VM is assigned to a specific host. If that host goes down, then all the VMs on that host go down. to provide fault tolerance, we would need to setup additional VMs to catch the failovers.

@Polygeekery said:

As this is a testing environment, my guess is a little downtime can be tolerated.

Actually, it's testing and production. This setup is a relic from before I started. I've actually made some progress on this in recent months: I now have a host dedicated to the dev team. I'm working on migrating test VMs to the new server as I have time, but it's been slow.

Polygeekery

@HardwareGeek said:

@abarker clearly said production systems were failing.

Shit. You're right. Well, we found the first WTF that allowed this WTF. It is like an onion of WTF really. Many layers.

Polygeekery

@abarker said:

Actually, it's testing and production.

In that case, just get rid of the testing VMs and do all your testing in production. Problem solved. What's the worst that could happen? ;)

loopback0

Has @Arantor been hatted yet?

Arantor

One of the people in the picture is already wearing a hat. Just not the kind of hat y'all thinking of.

Arantor

@Polygeekery said:

So, you are saying that you need a half day and a pint?

Always

Arantor

@Polygeekery said:

That's blasphemy.

(This is coming from a man who asks his friends, "Want to go to the pub and have a beer or twelve?")

First half of many?

Polygeekery

@Arantor said:

First half of many?

Maybe you could only get half at a time so that the beer does not get warm. I never seem to have that problem though. Pretty sure all the glasses I get have holes in them or something.

loopback0

@Polygeekery said:

I get have holes in them or something.

A big one in the top hopefully

Polygeekery

You know what I meant, you top-hatted dog. ;)

Arantor

All my glasses have holes in.

As does the number of my newly regranted badge. It has a 0 in it. Following a 1, and followed by two small holes and a line.

Polygeekery

So you're a ten percenter? You privileged bastard.

Arantor

@Polygeekery said:

So you're a ten percenter? You privileged bastard.

I worked hard for my 10% dammit. Just like I worked hard to get back into the 25% club in the last few days.

loopback0

Bit of @accalia grade spam and you'll be up here in the 5% crew in no time

Arantor

Time continues to be an illusion. Dinnertime especially so.

DCRoss

I... er... someone I know had a similar story recently.

My friend received an alert saying that one disk in a four disk RAID set had gone bad on a critical production server. The kind that handles lots of dollar signs and numbers with a bunch of zeroes in them. Really not a big deal, but further investigation revealed that three of the disks, including the failed one, had known issues and should all be replaced. Again, not a big deal since only one had failed, but we^H^H they procured three replacement disks and sent them off to the datacenter thinking that we could replace the failed disk, rebuild the array, and then swap the next two disks one at a time over the weekend until everything was all better and nothing bad could ever happen.

Naturally this was an ultra-secure i-could-tell-you-what-goes-on-in-there-but-first-i-would-have-to-kill-you kind of data centre so my friend had to rely on the helpful technician they employed to do the work. Again, a bit awkward but really nothing to get excited about. You wait until after hours, the tech calls to tell you he's ready to start, you make sure that the server is shut down, explain exactly what needs to be done and then he does it. Just to be sure my friend sent detailed instructions complete with pictures showing exactly what was required, and turned on the flashing "Fix me!" light on the drive that needed to be replaced.

Imagine m^Hhis surprise at receiving an email moments after the start of the maintenance window and being told "Yeah, I replaced those drives for you. You're good to go."

The server in question was still up. My friend was still logged in and in the process of shutting down applications and ensuring that all of their data was safely on disk. And yet, after requerying the RAID controller, it did seem that the serial numbers of the first three drives had all changed.

I know of only one kind of RAID configuration which can survive removing three out of four disks, and this server wasn't using it. In fact, moments later the poor thing started complaining that it was having trouble accessing parts of /usr, and perhaps this should be looked at soon.

A polite emailed reply was sent, explaining that this had not been the original plan and could the disks please be returned to their original places before it is too late.

"Oh, really? I just assumed that the server was dead. After all there were three disks here so I just stuck them all in. Plus I forgot which slots the old drives were in so I can't really put them back. Is this really important?"

Fortunately my friend still had the detailed status of the RAID array from earlier that afternoon and with excruciating politeness replied that yes indeed this was important and could everything please be put back where it came from, if that's not too much trouble.

The drives were eventually returned to their original places but by then it was far too late. Attempts to rebuild the skewered array all met with disastrous failure and my increasingly less gruntled friend spent the rest of the evening rebuilding, re-installing and basking in the glory that is "Don't worry, we'll get all your data eventually" backup and restore procedures. By the next morning the server was back up and running on three new drives filled with data cherry-picked from several different days in the middle of the last week, and there was much rejoicing.

Well, not really. I think somebody said "Finally!" and complained that the log files from Friday evening weren't all there, but we can pretend that there was rejoicing if it makes us feel better.

On Monday, of course, there was the usual three hour long meeting to discuss everything that "we" did wrong and how to avoid doing it again. My friend's suggestion that we never ever allow the data centre technician to touch anything more complicated than a Pez dispenser was met with sage-like nodding, leading to a follow-up question.

"So, I noticed in the summary that we only replaced three drives. Why didn't we replace all four? I mean, we didn't know there was a problem with the other three when they first went in and they turned out to be bad. How can we know that the same thing won't happen a fourth time?"

"Good thinking. Let's ship out a fourth drive, call the datacentre and have their guy swap the last drive as soon as possible. Don't wait for the usual weekend maintenance window, we have to be proactive about this."

What could possibly go wrong?

accalia

@loopback0 said:

@accalia grade spam

i forget is that one step above or below weapons grade spam?

PleegWat

@Arantor said:

As does the number of my newly regranted badge. It has a 0 in it. Following a 1, and followed by two small holes and a line.

Huh, I seem to have received one of those too. Do you think it is defective?

loopback0

Below. But judging by that last graph it's getting closer!

accalia

yesterday was a busy day for me.... ;-)

Polygeekery

@DCRoss said:

Naturally this was an ultra-secure i-could-tell-you-what-goes-on-in-there-but-first-i-would-have-to-kill-you kind of data centre so my friend had to rely on the helpful technician they employed to do the work.

This is a thing? Like, you have no access to your own servers?

abarker

Start your own damn topic. This one's mine!

DCRoss

@Polygeekery said:

This is a thing? Like, you have no access to your own servers?

None at all. It's a shared co-location facility where proximity to other people's networks is a selling point, but the price is heavy security. Allegedly this will prevent random idiots from walking in off of the street and sabotaging your servers, but somehow I'm not convinced that it's working.

Polygeekery

@accalia said:

i forget is that one step above or below weapons grade spam?

Ask the resident rat.