Slow it down

snoofle

A large part of what I've been doing for the past several months is speeding up all aspects of our application: loading data from the db, crunching said data and saving results back to the db. I've finally gotten it to the point that it's fairly zippy.

We ran lots of tests and saw that it was good.

Sort of.

You see, production machines (app and db) are very fast. Pre-production machines are less so. Development machines while zippy, are downright slow by comparison. As such, I could only load the machines so much. While blocked on I/O, the CPUs could throttle down and cool off a bit.

We begged for production quality hardware to perform full load tests but were refused because of the cost.

Ok, it's deployed, and sanity tested; everything works.

Then they kick off the first big job. And the second. And the third. Hardware failure! CPU overtemp warning. Box replaced. Restart job.

During the post-mortem of the hardware failure, it turnsout that a fan controller failed. Why? Dunno but the CPUs basically cooked themselves. Why?

Apparently, the previous incarnation of the app was so slow that the CPUs were never loaded at all. It turns out that when I sped up the application, and the DB access, I reduced the I/O lag so much that the CPUs no longer throttled down. They just screamed at high load. The heat built up in the box and the fan controller fried.

How can we prevent that from happening again? Um, redundant cooling? A/C? Better quality components? etc.

No, you made the app faster and it caused this problem; slow it down!

What?

Slow it down!

Well, we could run fewer threads or less instances of the app on each server, but that would require more servers to get the same level of throughput.

No, we can't spend the money; slow the application down to where it was before!

The solution that was rammed down my throat? Configured sleeps at key points in the application.

My boss almost cried. He begged and pleaded for them to fix the actual problem, but no dice.

Wheee!

Anketam

They should relocate their server room to antartica, that should keep the servers cool.

Xyro

Awesome! That's even better than speed-up loops!

Imagine how easy it will be to optimize and improve performance in the future!

Ibix

@snoofle said:

My boss almost cried. He begged and pleaded for them to fix the actual problem, but no dice.

Presuming that your app is doing something business critical, a faster app means more productivity means more profit means worth investing a bit, no? Are you so short on cash that your beancounters can't even afford to buy that argument? Or are they just more shortsighted than a bat that's lost its glasses?

snoofle

@Ibix said:

@snoofle said:
My boss almost cried. He begged and pleaded for them to fix the actual problem, but no dice.

Presuming that your app is doing something business critical, a faster app means more productivity means more profit means worth investing a bit, no? Are you so short on cash that your beancounters can't even afford to buy that argument? Or are they just more shortsighted than a bat that's lost its glasses?

The company is making pretty good money now. Specifically, we get paid to run the app for each customer, on a periodic basis. The more our customers want, the more runs we do, the more hardware we need. Speeding it up meant buying less hardware. We make more and spend less; double win.

Unfortunately, we got bought out and the bean counters at the new parent are arguing with our bean counters over budgets, cost cutting, yada yada, so all purchases have been frozen. Apparently, production shutting down is not justification for an expense; even a $20 fan.

Using common sense gets you nowhere with these people, so I just duck and let my boss deal with it.

KattMan

Uh oh.

You have finished the optimizations, deployed, then slowed it back down again? Does this mean your job is done there?

Please tell me this isn't so, I want to read so much more about the work you are barely able to actually get done.

Bender1

@snoofle said:

Apparently, production shutting down is not justification for an expense; even a $20 fan.

Go to said bean counter's office, swap fan in their PC with broken fan from server. If they complain, tell them to count beans using their fingers instead of the PC.

Ibix

@snoofle said:

Unfortunately, we got bought out and the bean counters at the new parent are arguing with our bean counters over budgets, cost cutting, yada yada, so all purchases have been frozen. Apparently, production shutting down is not justification for an expense; even a $20 fan.
Using common sense gets you nowhere with these people, so I just duck and let my boss deal with it.

And senior people in the New Parent Company are happy with their shiny new acquisition being half-frozen while the accountants have a pissing contest?

Anketam

The choices that I see available to get the hardware are:

Somehow make the issue the accounts issue. For example route everything the accounts do through one of the dead servers, so they are unable to do any work. This technique can be classified sort of like blackmailing.
Add more 0s to the cost. Say that the cost to fix will take $20,000 rather than $20. Create some insanely complicated reasoning for this that there is no chance the bean counters could understand or question. They will think it is far more mission critical and will be more likely to give you that $20,000. Now you can fix the $20 issue and the other little issues that the bean counters were not willing to give you money for.

snoofle

@Ibix said:

And senior people in the New Parent Company are happy with their shiny new acquisition being half-frozen while the accountants have a pissing contest?

Who knows if they even know about it...

snoofle

@KattMan said:

Uh oh.

You have finished the optimizations, deployed, then slowed it back down again? Does this mean your job is done there?

Please tell me this isn't so, I want to read so much more about the work you are barely able to actually get done.

Not even close. Even though it's slowed down, there are still a lot of other optimizations that I'm slated to work on over thenext 6-10 months, plus whatever they come up with during that time. I think my time here is limited more by my patience... As long as my boss shields me from a lot of it, I'll probably stay for a while...

blakeyrat

Wait what brand of servers is your company buying? Or are they all like 5 years old?

C_Octothorpe

@snoofle said:

@KattMan said:

Uh oh.

You have finished the optimizations, deployed, then slowed it back down again? Does this mean your job is done there?

Please tell me this isn't so, I want to read so much more about the work you are barely able to actually get done.

Not even close. Even though it's slowed down, there are still a lot of other optimizations that I'm slated to work on over thenext 6-10 months, plus whatever they come up with during that time. I think my time here is limited more by my patience... As long as my boss shields me from a lot of it, I'll probably stay for a while...

One piece of advice which has held me in good stead:

Do their checks clear?

Yes...

The STFU!

That was my mother who said that... :(

snoofle

@blakeyrat said:

Wait what brand of servers is your company buying? Or are they all like 5 years old?

Dunno what our production boxes are, but everything (laptops, monitors, keyboards, printers) in the office has an HP label on it, so.... I do know that they're running 64 bit Linux.

@C-Octothorpe said:

Do their checks clear?

Yes...

The STFU!

That was my mother who said that... :(

Your mother is wise!

KattMan

I had one company I worked for who's checks didn't clear. then they got scarce and hard to locate.

Found out they were going to be have a booth at a local conference and showed up with the bad checks.

Needless to say, after only a few words they went out and got cash and paid me on the spot so I would go home, which I did right afterwards. Damage to thier reputation was already done though.

Cassidy

@snoofle said:

You see, production machines (app and db) are very fast. Pre-production machines are less so.

Have you pointed out the obvious that if test environment doesn't accurately match production, then unexpected things are likely to happen in production because all tests are basically invalid.

@Bender said:

Go to said bean counter's office, swap fan in their PC with broken fan from server. If they complain, tell them to work slower so that it doesn't tax their PC.

FTFY.

TGV

Long time I thought you could have been an ex-colleague of mine. Right up until the point where they deny to give you decent hardware. But then ... wow. That's a WTF story of great lineage. Stay sane! Don't let them drag you down!

Mcoder

@snoofle said:

@KattMan said:
Uh oh.

You have finished the optimizations, deployed, then slowed it back down again? Does this mean your job is done there?

Please tell me this isn't so, I want to read so much more about the work you are barely able to actually get done.

Not even close. Even though it's slowed down, there are still a lot of other optimizations that I'm slated to work on over thenext 6-10 months, plus whatever they come up with during that time. I think my time here is limited more by my patience... As long as my boss shields me from a lot of it, I'll probably stay for a while...

Just remember to undo all those other optimizatins before they create more problems, not after. Be proactive at once!

Mcoder

@Ibix said:

And senior people in the New Parent Company are happy with their shiny new acquisition being half-frozen while the accountants have a pissing contest?

Was that a rethoric question?

Mcoder

Don't mind. In a few years, when snoofle's out of the company, and nobody else remembers why those sleeps are there we'll see how easy it is to fix them right here, at TDWTF.

PJH

@Anketam said:

Add more 0s to the cost. Say that the cost to fix will take $20,000
rather than $20. Create some insanely complicated reasoning for this that
there is no chance the bean counters could understand or question. They
will think it is far more mission critical and will be more likely to give you
that $20,000. Now you can fix the $20 issue and the other little issues
that the bean counters were not willing to give you money for.

We back to bikeshedding again?

Anketam

@Mcoder said:

Don't mind. In a few years, when snoofle's out of the company, and nobody else remembers why those sleeps are there we'll see how easy it is to fix them right here, at TDWTF.

He has a good point, better put comments in there as a legacy to future developers so that when they post your code here and say look at what this crazy idiot did, we can protect your reputation.

swayde

Did you tell them that downclocking the cpus might be a tempoary solution ? (i know i know - server hardware might not play nice) There seems to be lots of other solutions that are more robust that sleep(x)...
for instance http://cpulimit.sourceforge.net/

snoofle

@Cassidy said:

Have you pointed out the obvious that if test environment doesn't accurately match production, then unexpected things are likely to happen in production because all tests are basically invalid

More times than I can count. They simply don't care.

snoofle

@swayde said:

...thoughtful ideas...

The sleep wasn't our idea; it was the brainchild of boss+2, and we were expicitly instructed to do it that way.

I don't mind taking liberties with suggestions, but once I've lost the argument for common sense, direct orders I tend to obey, no matter how stupid they might be.

DOA

@snoofle said:

How can we prevent that from happening again? Um, redundant cooling? A/C? Better quality components? etc.

Hands up if you figured out the rest of the story when you read this...

Cassidy

@snoofle said:

@Cassidy said:
Have you pointed out the obvious that if test environment doesn't accurately match production, then unexpected things are likely to happen in production because all tests are basically invalid
More times than I can count. They simply don't care.

And yet when disasters like this happen, repeating the reasons behind it that still has not been addressed should finally sink in.

I mean.. I refuse to believe someone there demands to know why these keep happening and yet completely forgets the reasons provided the last time it happened. I understand they want to do fuckall about it, but the message should cause embarrassment for several decision-makers, and sooner or later someone should see sense.

I also know that you're in a position to keep your head down and milk it for all it's worth. I'm not jealous.

Really. Honest.

da_Doctah

@Cassidy said:

I refuse to believe someone there demands to know why these keep happening and yet completely forgets the reasons provided the last time it happened.

From now on, your answer when they ask for root-cause analysis should be "see my previous memo". Be helpful. Give them the reference date to make it easier to look up. As that date recedes further and further into the past, they should be more and more sheepish about being reminded of it.

Cassidy

@da Doctah said:

From now on, your answer when they ask for root-cause analysis should be "see my previous memo".

I have pulled this "recursive answer" trick via email in the past.

Sales account girl Anna wants a copy of my CV for a customer. I spend some time updating it with new skills and experience I've acquired.
Line manager Bert wants a copy of my CV in case he's asked for it, so I embed the reply to Anna in my response, showing that perhaps she should/could have gone through him first (or that someone's beaten him to the punch).
Marketing manager Carl says that he wants copies of everyone's CV because sales packs are being proactively constructed ready to issue upon demand. I reply by embedding my reply to Bert, showing that my line manager already has my CV.
A few months later, marketing assistant Davina says she's reviewing the sales packs and wants copies of people's CV. She gets Carl's reply, showing that he had my copy all along.
A few months after that, Bert lets the department know that HR are reviewing their records and need updated CVs. After clicking through several embedded emails he figures out he's already got mine.
The following month, Bert reminds the whole department that he's getting pressure from HR and updated CVs must be submitted. You get the picture.

At no point in the flow did anyone remark how useful a centralised repo of this information would be. Because then they'd be describing our rarely-used and oft-overlooked Sharepoint system.

tgape

@snoofle said:

Apparently, the previous incarnation of the app was so slow that the CPUs were never loaded at all. It turns out that when I sped up the application, and the DB access, I reduced the I/O lag so much that the CPUs no longer throttled down. They just screamed at high load. The heat built up in the box and the fan controller fried.
How can we prevent that from happening again? Um, redundant cooling? A/C? Better quality components? etc.

Given what you've said about your environment, I disbelieve the root cause you claim.

I suspect that the real root cause was, the fan died months ago due to old age and additional wear and tear from dynamic cooling on a box that never really generated a lot of heat. You all just didn't notice the fan's death because you don't have any system monitoring of CPU fans.

The real solution is to get system monitoring of all of your CPU, motherboard, and power supply fans, your power supplies, your hard drives, and so forth. Then, if a $20 part fails and the company won't spare the budget to fix it, have the manager responsible for the service shell out the money from his own pocket, as he's responsible for the service and so the downtime when the system overheats would almost certainly cost him more than the fan.

And, yeah, I'm serious about dynamic cooling wearing fans down faster on light use than heavy use. I've managed 8+ year old hardware, and our more idle boxes had at least twice the fan failure rate of our busier machines, on average. Sure, there were a couple that never lost a fan right up until we replaced them, but there were also a couple of our busiest boxes that also never lost a fan - and we had more idle boxes than busy, due to required geographic redundancy.

Speakerphone_Dude

@tgape said:

@snoofle said:
Apparently, the previous incarnation of the app was so slow that the CPUs were never loaded at all. It turns out that when I sped up the application, and the DB access, I reduced the I/O lag so much that the CPUs no longer throttled down. They just screamed at high load. The heat built up in the box and the fan controller fried.
How can we prevent that from happening again? Um, redundant cooling? A/C? Better quality components? etc.

Given what you've said about your environment, I disbelieve the root cause you claim.

I suspect that the real root cause was, the fan died months ago due to old age and additional wear and tear from dynamic cooling on a box that never really generated a lot of heat. You all just didn't notice the fan's death because you don't have any system monitoring of CPU fans.

The real solution is to get system monitoring of all of your CPU, motherboard, and power supply fans, your power supplies, your hard drives, and so forth. Then, if a $20 part fails and the company won't spare the budget to fix it, have the manager responsible for the service shell out the money from his own pocket, as he's responsible for the service and so the downtime when the system overheats would almost certainly cost him more than the fan.

And, yeah, I'm serious about dynamic cooling wearing fans down faster on light use than heavy use. I've managed 8+ year old hardware, and our more idle boxes had at least twice the fan failure rate of our busier machines, on average. Sure, there were a couple that never lost a fan right up until we replaced them, but there were also a couple of our busiest boxes that also never lost a fan - and we had more idle boxes than busy, due to required geographic redundancy.

I totally agree with your opinion. I have witnessed similar situations and the same problem appears with huge storage pools used mostly for archives (such as MAID systems or even higher-end stuff like SONAS). Idle + dynamic or on-demand = death warrant. Like all those out-of-shape people who die of a heart attack when they shovel snow after an unexpected snowstorm.

beginner_

@Anketam said:

The choices that I see available to get the hardware are:

Somehow make the issue the accounts issue. For example route everything the accounts do through one of the dead servers, so they are unable to do any work. This technique can be classified sort of like blackmailing.

Add more 0s to the cost. Say that the cost to fix will take $20,000 rather than $20. Create some insanely complicated reasoning for this that there is no chance the bean counters could understand or question. They will think it is far more mission critical and will be more likely to give you that $20,000. Now you can fix the $20 issue and the other little issues that the bean counters were not willing to give you money for.

Very well formulated. I agree for small stuff, never tell the truth. For idiotic managers cheap is a synonym to unimportant and arguing with them is blasphemy or trying to argue with religious fanatic. So the solution is to bullshit, bullhsit, bullshit as loud as possible wasting time you could have used for productive things and in the end the company pays 10x times as much + less work was done.

bridget99

So all of your work, and all of your boss's pleading, went for naught? That's funny. But hey... at least you got to look cool on company time for (by your own admission) several months. No doubt you felt smug, emerging from your little cocoon every few days to share some unintelligible little chestnut about "throughput" with your admiring colleagues. I'm sure all those awards you won (and groupies you scored with) during your little heyday made it all worthwhile. Of course, if you didn't win any awards, or score with any groupies, then you're just an idiot.

dhromed

@bridget99 said:

just an idiot

You should be a motivational speaker!