Computers need AC? Who would have thought.



  • Part of me hopes this is true and part of me hopes this was made up...

    http://community.livejournal.com/techsupport/1368038.html

    From: $user who for whatever reason came in on Monday when no one else was in the building.
    To: IT Dept.
    Re: A/C constantly running.

    Hi Guys,

    I
    came in today (Monday) to finish up a project I was working on before
    our big meeting with a potential client tomorrow, and I noticed that
    there were three or four large air conditioners running the entire time
    I was here. Since it's a three day weekend, no one is around, why do we
    need to have the A/C running 24/7? With all the power that all those
    big computers in that room use, I doubt it is really eco-friendly to
    run those big units at the same time. And all computers have cooling
    fans anyway, so why put the A/C for the building in that room? I got a
    keycard from $facilitiesmanager's desk and shut off the A/C units. I'm
    sure you guys can deal with it being warm for an hour or two when you
    come in tomorrow morning. In the future, let's try to be a little more
    conscientious of our energy usage. Thanks.


    RESULT:

    Fatalities: Exchange Server, Domain Controllers, a few Sun boxes that I'm not sure of the usage.
    Near-Fatalities: Phone Switch, Apps Servers.

    Temperature of server room 7AM Tuesday Morning: 90 Degrees Fahrenheit.

    Status of Employee who sent the above e-mail: Terminated.
     



  • god damn, this is front page material!

     

    Please god say that the person that did this worked in a position that did not include using or being near computers.
     



  • I would think there isn't a whole lot of truth to this. Servers will run at 90 degrees. Trust me on this...

     
    One of my customers refuses to think long term. They had insufficient AC that for well over a year kept a server room close to 90 degrees. They finally replaced it, with a unit that didn't work correctly. In winter, it was too cold so the unit didn't work right. It wasn't configured to draw in outside air. In summer, the sun hit the unit at 2pm and overheated it. This went on for years.

    They also have wiring closets with hordes of ATM gear, with no AC at all. In summer, these closets can reach 100 degrees. 



  • @jjeff1 said:

    I would think there isn't a whole lot of truth to this. Servers will run at 90 degrees. Trust me on this...


    One can suppose that the dying machines shut off and stopped heating the room, or that the temperatures actually peaked higher on that Monday afternoon (compared to the Tuesday morning), or that there was some lag between the opening of the server room door to fresher air and the temperature reading (perhaps occupied by actually dealing with the damage control), or that the heat distribution was less than perfectly uniform, or that the per-server cooling was inadequate for 90-degree temperatures and internal temperatures were much higher, or that there wasn't actually a real temperature reading (just an "OMG it must be 90 in there") or any number of things like that.



  • Well, the admins are at fault here too: the A/C units stopping, the temperature rising and the servers failing should have triggered alarms that would page the admin team! A confident admin is a bad admin.



  • @acne said:

    Well, the admins are at fault here too: the A/C units stopping, the temperature rising and the servers failing should have triggered alarms that would page the admin team! A confident admin is a bad admin.
    In a company where some (obviously? I hope!) non-IT guy has access to the server room, what makes you think think that any alarm system would even be on the horizon?

    The biggest WTF here is that one of the following is presumably true:
    a) some random clueless guy has full access to the server room
    b) the A/C controls are outside the server room.
     



  • My wife works for a small company that is entirely dependant on one server that I look after on a kind of "mates-rates" basis.

    It's kept in a small, unventilated cupboard in the hottest room in the office.

    When the RAID crapped itself a few months ago, it took me two weeks to get it back up properly (hey, I work full time as well you know!)

    The experience of limping along precariously for that period has now bought home my recommendation for a decent server with some better redundancy and a rack, installed somewhere cooler...
     



  • @jjeff1 said:

    I would think there isn't a whole lot of truth to this. Servers will run at 90 degrees. Trust me on this...

    True as far as it goes, however: if it does not strain the CPU cooling to the point where the automatic cut-off shuts the whole thing down, then the hard drives are going to be running hot. Running drives over their temperature limit reduces their MTBF from about 3-5 years to about 3-5 months (because the lubricants stop working properly and the moving parts start to wear out quickly). They may not fail immediately if they're new, but you're taking huge amounts of time off their expected lifespan. The 'server' may survive, but that's not much help when one of its drives crashes.

    I have no difficulty in believing that running a bunch of relatively old servers this hot would cause a sudden spate of drive failures. You only need one disk out to take down the entire server.
     



  • Depends on whether you believe google's recent white paper or not.

    Temperature didsn;'t have too much effect - don't get me wrong, it had an effect, but not that pronounced.



  • @tster said:

    god damn, this is front page material!

     

    Please god say that the person that did this worked in a position that did not include using or being near computers.
     

     Given
    that the story is about him switching off the aircon in the server
    room, I think you should probably already have guessed that he would be
    "near" computers.  Either that, or you're assuming he has very
    long arms...

     



  • @RayS said:

    @acne said:

    Well, the admins are at fault
    here too: the A/C units stopping, the temperature rising and the
    servers failing should have triggered alarms that would page the admin
    team! A confident admin is a bad admin.
    In a company where
    some (obviously? I hope!) non-IT guy has access to the server room,
    what makes you think think that any alarm system would even be on the
    horizon?

    The biggest WTF here is that one of the following is presumably true:
    a) some random clueless guy has full access to the server room
    b) the A/C controls are outside the server room.
     

      
    Well, if you wanted to know, you could always try... oh I dunno.....
    actually reading the story?  Look for the phrase  " I got a
    keycard from $facilitiesmanager's desk " and see if you can work it out
    from there...

      
    The real WTF of course is that the facilitymanager has a keycard to the
    secure area and he just leaves it lying on his desk over the
    weekend.  What if this was really an inside-job mass data theft,
    and the guy just switched off the aircon to cover his tracks, destroy
    the logs and provide plausible deniability?  $facilitymanager
    should be, if not sacked, at least getting himself a couple extra
    assholes torn by the boss...




  • @valerion said:

    My wife works for a small company that is entirely dependant on one server that I look after on a kind of "mates-rates" basis.

    That's the type of basis where instead of getting paid, you get screwed?

         



  • @Bob said:

    Depends on whether you believe google's recent white paper or not.

    Temperature didsn;'t have too much effect - don't get me wrong, it had an effect, but not that pronounced.

    Read the article a little more. Their conclusion that temperature's didn't have an effect on life span was based upon temperatures that varied within 70 - 80 degrees, which is within the operable temperature range. They did not include any results about hard drive temperatures in extreme ranges (either plus or minus).



  • @Bob said:

    Depends on whether you believe google's recent white paper or not.

    Temperature didsn;'t have too much effect - don't get me wrong, it had an effect, but not that pronounced.

    I definitely believe google's recent whitepaper, and it said the same thing I did: high temperatures kill drives. I suspect that you read only the slashdot summary and not the actual paper.

    The paper introduced the interesting result that variations in temperature below the drive's operating limit don't have a significant effect on its lifespan; slashdot managed to summarise this as "drives operate fine at any temperatures, even if they're melting or on fire at the time" - which would be a WTF in its own right, if it wasn't the typical quality of editing that we've come to expect. There was previously a meme that lower temperatures were always better; the google staff just observed that all temperatures below a certain point are about the same, so over-cooling drives is a waste of money. Any temperatures above the operating limit still kill the drive in a hurry.



  • @DaveK said:

      
    Well, if you wanted to know, you could always try... oh I dunno.....
    actually reading the story?  Look for the phrase  " I got a
    keycard from $facilitiesmanager's desk " and see if you can work it out
    from there...

      
    The real WTF of course is that the facilitymanager has a keycard to the
    secure area and he just leaves it lying on his desk over the
    weekend.  What if this was really an inside-job mass data theft,
    and the guy just switched off the aircon to cover his tracks, destroy
    the logs and provide plausible deniability?  $facilitymanager
    should be, if not sacked, at least getting himself a couple extra
    assholes torn by the boss...

    You know, I read it twice, but still somehow missed that detail. Oops.

    Anyway, no way was that some attempt of covering his tracks. It's just not even close to being a reliable way of destroying data, plus you'd have to be an absolute moron to then email people about it. You'd take a hammer to everything/zap the electronics (or just take everything) since any possible security log would show the other guy's card being used. 



  • @jjeff1 said:

    I would think there isn't a whole lot of truth to this. Servers will run at 90 degrees. Trust me on this...

    There's a difference between operating servers in a 90 degree room, and having your servers heat a 72 degree room to 90 degrees.  The later indicates that the internal temperatures are far, far higher.  If your servers are heating the room, you've got a problem.  If the room itself is hot, but the servers aren't really increasing the temperature, you're fine.

     



  • @merreborn said:

    @jjeff1 said:

    I would think there isn't a whole lot of truth to this. Servers will run at 90 degrees. Trust me on this...

    There's a difference between operating servers in a 90 degree room, and having your servers heat a 72 degree room to 90 degrees.  The later indicates that the internal temperatures are far, far higher.  If your servers are heating the room, you've got a problem.  If the room itself is hot, but the servers aren't really increasing the temperature, you're fine.

     

     

    Yeah, I can second that. Used to work at a company with a main datacenter consisting of about 3,000 Proliant 6400s (I can't remember what they call them now, they're always going to be Compaqs, and always model 6400 hehe). Anyway, when we'd need to shut the systems all down for maint. or what have you, invariably a dozen or so just never came back up. This happened when we left the AC systems on (Which sucked since it rapidly dropped down to 40 degrees in the server room - we had 5 25 ton chillers running 24/7.) or if we took them off too. Just seems like servers in general can't really handle being run at, say, 58 degrees for 2 years then a rapid increase or decrease. 



  • @Jeremy D. Pavleck said:

    Anyway, when we'd need to shut the systems all down for maint. or what have you, invariably a dozen or so just never came back up. This happened when we left the AC systems on (Which sucked since it rapidly dropped down to 40 degrees in the server room - we had 5 25 ton chillers running 24/7.) or if we took them off too. Just seems like servers in general can't really handle being run at, say, 58 degrees for 2 years then a rapid increase or decrease. 

    That's probably not quite the same problem (I'm betting that most of those were drive failures). Spinning up a drive is a very stressful process - the motor and mechanism are under far higher forces during spin-up than they experience during normal operation. A drive that is near the end of its life, which is rarely spun up (because it's kept spinning most of the time), is more likely to fail at this time than any other. If you have several thousand servers, like in this scenario, then there is a high probability that at any given time, a few of them are nearing the end of their lifespan - these drives are going to fail soon. If you reboot all of the servers at once, statistically you'd expect a handful of those to die in the process, and there's nothing you can do about it; they would have died in the next couple of months anyway, you just brought it forwards with the extra stress.

    You just don't notice it until you get up to that number of servers, because with less than a few hundred, the number of expected failures on reboot is less than 1 (so there's no pattern for you to easily see, without working out the correlation math). The same thing does apply to all the hot chips (CPU, main bridges, etc), but the failure rates for those are much lower, so only the largest installations will see it there - I don't know the numbers off the top of my head.

    The rest of the failures (ie, not drives) are usually caused by power spikes from the sudden increase in load overwhelming your power management kit (avoidable, but you have to do some work with test kit to figure out a safe sequence), or by hardware that had already failed, but which hadn't been exercised in the specific manner of its failure, so you hadn't noticed it yet. For example: a memory stick that had lost a few bytes in one particular region, and "normal operation" just happens to not visit that area of the stick, but the boot process puts a different kind of load on the server and trips the fault.

    Heat is not normally a problem so long as you always stay below the operating limits (typically about 80 degrees Celsius on-die for most chips, and about 60 degrees internal for most drives) and don't ever chill below the dew point (about 0 degrees Celsius if your dehumidifiers are working properly - if metal surfaces feel damp, you have a problem, otherwise you're fine). Variations within that range shouldn't ever cause new damage, although they may exacerbate existing (but unnoticed) damage.



  • @asuffield said:

    ...and don't ever chill below the dew point (about 0 degrees Celsius if your dehumidifiers are working properly ...

    Dehumidifier?  <sarcasm>What's that????</sarcasm>

    The server room in my office actually has a humidifier.  Seriously.  At times, it's a challenge to maintain over 20% relative humidity in my neck of the crispy, forest-fire-prone woods.



  • Yeah - low humidity and electronics = ouch



  • @GettinSadda said:

    Yeah - low humidity and electronics = ouch

    Okay, fair point - that's a real problem too, it just doesn't occur in my part of the world. 



  • Hmm...

    I always thought that in the geek's world the abbreviation to Air Conditioning is A/C. Because AC stands for Alternating Current.

    Which made the post appear even more WTFy to me (Computers need electricity? Who would have thought!)
     



  • @asuffield said:

    Heat is not normally a problem so long as you always stay below the operating limits (typically about 80 degrees Celsius on-die for most chips, and about 60 degrees internal for most drives) and don't ever chill below the dew point (about 0 degrees Celsius if your dehumidifiers are working properly - if metal surfaces feel damp, you have a problem, otherwise you're fine). Variations within that range shouldn't ever cause new damage, although they may exacerbate existing (but unnoticed) damage.

    Signed, and I will add two more points:

    • It's not even running drives at a warm temperature that necessarily reduces their lifespan (a drive at 100F is not bad off if it doesn't have too many platters, for example). It's large changes in temperature that you need to watch out for. Any kind of change: shock, air pressure, temperature, power cycling, off-axis rotation, etc. -- is not good for hard drives.
    • A second problem you need to watch out for... rapid heating causes thermal expansion. Thermal expansion causes add-on cards, risers, memory heatsinks, (you name it) to change shape and potential lose electrical contact, with the usual consequences. Also thermal fluctuations are hard on lead-free solder joints (common in new equipment) which can cause hard to diagnose (and difficult to repair) problems
    So keep your server room AC running. And monitor the temperature at multiple points in the room, watching out for fluctuations and hotspots.


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.