When will people learn to care for important data properly?



  • I've just been through a messy data recovery (I work at a company in downtown San Francisco, which was affected by Tuesday's string of power outages).  I'm a software engineer, but for reasons I can't explain, I was put in charge of doing the perforce backup, but only the small part (the database checkpoint, since it needs to be scripted). 

    The first thing that should be mentioned is that we store absolutely everything in perforce (this include management powerpoints, photoshop files and source files).

    The power flaps resulted in a chewed up NTFS volume containing our perforce data that wasn't salvageable.  In addition, our parent company backs up the whole server over a WAN connection and it takes 3 days to make a full image of the perforce server.  We had to restore from Thursday night (losing 2.5 days of revision control history), integrate targets, etc.

    Lesson 1: If your questionably acquired UPS suddenly dies, don't just unplug everything and wait for slow boat purchasing to replace it.

    Lesson 2: If you're the IT person for a small company (~100 people) and you've been seeing several months of bitching about a failing raid controller in the event log, do something *RIGHT NOW*.  Don't let anyone argue that it can be waited on.

    Lesson 3: When a software engineer is asked to do sysadmin work, let him work.  Everyone was getting bitchy with me for usurping part of a windows fileserver to put colinux on, so that I could do my stuff in peace (we were lucky that I disobeyed everyone and just did it, since we had a good checkpoint to restore the perforce db from).

    Lesson 4: If you have an expensive DLT changer, and a tape gets stuck in the ejector, open it up and unstick the tape.  Since the tape drive was "broken" nobody even told me we had it, so I requested (and got denied), a cheap USB AIT drive (I don't have physical access to the server room, so I couldn't see for myself that this beast existed until I was let in to help rebuild perforce).

    Hopefully someone here will benefit from this lesson.  It's taken 100 people ~3 days of lost productivity to recover, when we could have just spent $500 for a cheap external tape drive and been done with most of the pain months ago.  Actually, fixing either the DLT, the UPS or the RAID array would have completely averted this mess.  The lesson is; don't let things sit.  Pay attention.  It took over 1 year for these failures to accumulate and come together.
     



  • Oh oh Mr. Kotter, I know the answer to the question posed in the topic!

     

    Is it never? 



  • @KattMan said:

    Oh oh Mr. Kotter, I know the answer to the question posed in the topic!

    Is it never? 

    No. The correct answer is: "When it is too late" 



  • @ammoQ said:

    @KattMan said:

    Oh oh Mr. Kotter, I know the answer to the question posed in the topic!

    Is it never? 

    No. The correct answer is: "When it is too late" 

    "...and only temporarily."

    We really need to go back to flaky hardware that throws people out the windows every few weeks. Then people might take backups seriously.

     - Tech who, right now, has a considerable quantity of business-critical data stored, and only stored, on one spindle. 



  • Ugh, don't get me started on crap UPS systems. As a tech support company we got a lot of units of questionable quality from our supplier and in turn passed them on to our clients. You might think that we'd get some decent quality APC systems but those cost three times as much as the crap brands. We're not stupid you know. So we get the units and they were great... for a while. After a while we discovered that when the power switched off, they'd switch off as well. I mean you can't expect electronics (and especially a UPS) to work with no power!

    We spent countless hours on RMA to get rid of all that crap and replaced it with... you guessed it, new models of the same brand.  Because we're not stupid to spend extra cash on quality stuff like some other losers....
     



  • @robbak said:

    ..and only temporarily."

    We really need to go back to flaky hardware that throws people out the windows every few weeks. Then people might take backups seriously.

    Like, with a robotic arm and stuff.



  • Lesson 3: When a software engineer is asked to do sysadmin work, let him work.  Everyone was getting bitchy with me for usurping part of a windows fileserver to put colinux on, so that I could do my stuff in peace (we were lucky that I disobeyed everyone and just did it, since we had a good checkpoint to restore the perforce db from).

    You installed colinux (experimental software) on a production machine? Why did you need colinux anyhow?



  • Because I'm hopeless at windows sysadmining and gave up on trying to figure out how to get the target share at the parent company mounted to copy the checkpoint to.  After getting my shell box up, it was as easy as cron+samba to mount the shares and copy the checkpoint. 

    The real WTF is that me (somebody whose only system administration experience is with solaris 2.4, 10 years ago) is anywhere near the production environment :-)

    So in short I needed it because I knew I could get the the job done with it, without disturbing anything else, and without trying to fly blind at 2am to figure out what ate my at job.  Somebody should really write a primer about how things work in the netherworld of the at service from the perspective of a system administrator.  Lots of this kind of literature exists for cron (your stdin will be dup'd from /dev/null is a good example), but almost none for windows.  Basic things like 'Is HKCU connected and do I have write permission in there? (And who is the current user, by the way?)' and 'What's the right way to temporarily mount a share as an ordinary domain user from inside an at job?'.

    Oh edit ... you may have noticed it was on a budget, since I couldn't just get a tape drive and we never paid to repair the old one.  I could have asked for a new piece of hardware, but it would hardly have gotten past the evil laughter. 



  • @DOA said:

    Ugh, don't get me started on crap UPS systems. As a tech support company we got a lot of units of questionable quality from our supplier and in turn passed them on to our clients. You might think that we'd get some decent quality APC systems but those cost three times as much as the crap brands.

    APC used to make decent quality UPSes about ten years ago, but they don't any more. Their top-of-line models no longer rate anything above "mediocre"; they've gone low-budget all the way through their product line. The decent ones typically cost about twice as much as the equivalent APC model.



  • @asuffield said:

    APC used to make decent quality UPSes about ten years ago, but they don't any more. Their top-of-line models no longer rate anything above "mediocre"; they've gone low-budget all the way through their product line. The decent ones typically cost about twice as much as the equivalent APC model.
    Which brand do you recommend then? In my experience, APCs just work, unlike some of the cheaper brands that died on the first power failure.



  • @ender said:

    @asuffield said:
    APC used to make decent quality UPSes about ten years ago, but they don't any more. Their top-of-line models no longer rate anything above "mediocre"; they've gone low-budget all the way through their product line. The decent ones typically cost about twice as much as the equivalent APC model.
    In my experience, APCs just work, unlike some of the cheaper brands that died on the first power failure.

    Oh sure, the cheaper ones are even worse. APC just used to be really good, and now they're pretty unimpressive - they aren't worthless junk, but neither can you rely on them to soak up lightning strikes without tripping out, or smooth out noisy power to the point where your servers don't crash.

    Which brand do you recommend then?

    Sadly, I'm not aware of any brands that are reliable these days, in the sense that anything you buy from them will be decent. You have to research each individual model, and have some level of understanding of how they work; it's roughly as difficult as finding a good quality motherboard.



  • @asuffield said:

    APC just used to be really good, and now they're pretty unimpressive - they aren't worthless junk, but neither can you rely on them to soak up lightning strikes without tripping out, or smooth out noisy power to the point where your servers don't crash.

    I don't know if any UPS can take a lightning strike.  I mean, it's only about a million volts.  As for smoothing noisy power, you would need to at least get into the Smart UPS line for true sine-wave output.



  • @operagost said:

    @asuffield said:

    APC just used to be really good, and now they're pretty unimpressive - they aren't worthless junk, but neither can you rely on them to soak up lightning strikes without tripping out, or smooth out noisy power to the point where your servers don't crash.

    I don't know if any UPS can take a lightning strike.  I mean, it's only about a million volts.

    It's surprisingly simple to handle voltages on that scale. The basic construct is a widget known as a "spark gap". Basically, it's two heavy-duty metal prongs, where one is in the live feed, and the other is connected to earth. If an extreme voltage spike hits, over a couple thousand volts or so, then the air between the prongs ionizes and the whole thing grounds out to earth across the gap. A little gets past, but it's only about a thousand volts, which the UPS then has to soak. Ionized air will support any voltage that lightning could carry in the first place (since it's the same thing), and it's easy to build the prongs to the point where they don't melt. They're commonly used by the power and telephone companies to protect their own equipment, which get direct lightning strikes on a regular basis.

    It's actually more difficult to get rid of that last thousand volts than it is to ditch the first million. Heavy capacitors and trips are involved. 

    APC UPSes manage to protect the equipment behind them, but their input circuit usually dies in the process, so your server shuts down when the battery runs out. A really good UPS will reset and keep going.

    As for smoothing noisy power, you would need to at least get into the Smart UPS line for true sine-wave output.

    They still don't manage to remove harmonic noise from the utility power. The smoothers in them just aren't that good. Hook up an old electric drill to the source, and the UPS output voltage will wobble all over the place. It doesn't help that the "Smart UPS" line is mostly only line-interactive, and not dual-conversion (although even their DC models still let noise though).


Log in to reply