Well the backup did _work_



  • Recently major "data corruption connected with a RAID disc failure" affected "Home directories for usernames beginning S to Z" at a world-leading establishment. That in itself is not a WTF. But some of the emails flying around revealed where some WTFery was going on.

    "This morning we concluded we had no choice but to restore
    the volumes from tape backups. This being our first
    real use of our current tape system & software to carry out a restore,
    we have unfortunately found that the restore speed is painfully slow
    (it appeared that restoring each volume would take a whole day) and
    we have been investigating why."

    Can I say, test your restore procedure beforehand?

    For context, there were three volumes, each 16 GB I think.

    And the icing on the cake,

    "We regret that changes to files made after backups were made i.e. work done by users S to Z on the day of failure have been lost."

    That's possibly quite important work for a number of people, although how many use this system and not one of various others around (this is an establishment with over 10,000 people in many departments with some different systems) I don't know.



  • [quote user="m0ffx"]

    Recently major "data corruption connected with a RAID disc failure" affected "Home directories for usernames beginning S to Z" at a world-leading establishment. That in itself is not a WTF. But some of the emails flying around revealed where some WTFery was going on.

    "This morning we concluded we had no choice but to restore

    the volumes from tape backups. This being our first

    real use of our current tape system & software to carry out a restore,

    we have unfortunately found that the restore speed is painfully slow

    (it appeared that restoring each volume would take a whole day) and

    we have been investigating why."

    Can I say, test your restore procedure beforehand?

    For context, there were three volumes, each 16 GB I think.[/quote]

    That's a nice couple of WTFs there.  I'd describe our local tape backup system as "painfully slow", and it can restore a single 40GB volume in four hours, or our entire mailserver in eight.  And not testing the backup beforehand is just plain stupid.


    [quote user="m0ffx"]And the icing on the cake,

    "We regret that changes to files made after backups were made
    i.e.
    work done by users S to Z on the day of failure have been lost."

    That's possibly quite important work for a number of people, although how many use this system and not one of various others around (this is an establishment with over 10,000 people in many departments with some different systems) I don't know.

    [/quote]

    No WTF there.  It's almost never worthwhile to make continuously-updated backups except through RAID, so losing a day's work if the RAID array fails is normal.



  • Yeah I know that last bit isn't really a WTF. It's just annoying for people like me who use the system. At least I wasn't working on files on it on Monday.
     



  • [quote user="Carnildo"]And not testing the backup beforehand is just plain stupid.[/quote]That's all very well for large companies and such, but then: small companies... 

    How would you go about testing the restore procedure on the only server, having the only tapedrive? Of course you could re-install when the restore-test fails, but the time spent on managing IT by the most knowledgeable clerck is very closely guarded by the manager, who want's to know why he's still busy on the damn machine in stead of "doing his job" (i.e. accounting or whatever)...

     But then again: in most small organisations the IT is one big steaming heap of WTF's.



  • I'm interested that you consider an organisation with over 10,000 people "small". I'd understand that kind of thing from some of the smaller departments that handle their own IT systems, there are plenty of WTFs at that level. But this was the main IT department that handles all the used-by-everyone systems, like almost everyone's email, all the network routing, and lots more.



  • No,I was referring to companies closer to home: companies I have to
    work with day by day :) Most of them are small in the 4 employers,
    half-a-server ballpark ;)



  • [quote user="pnieuwkamp"]

    [quote user="Carnildo"]And not testing the backup beforehand is just
    plain stupid.[/quote]That's all very well for large companies and such,
    but then: small companies... 

    How would you go about testing
    the restore procedure on the only server, having the only tapedrive?

    [/quote]

    Well, in the small company I work for, we kept an old server around and upgraded the hard drives so that in an emergency it could function as a replacement for any of the other servers (fileserver, mailserver, database server).  Backup testing is done by restoring to that computer.

    (Having three servers in this company is something of a WTF in and of itself: before I got here, nobody'd thought of doing any sort of load profiling on the servers, so whenever they had a new server task, they simply added another server.  The newest of the servers could handle the entire load by itself)



  • [quote user="pnieuwkamp"]

    [quote user="Carnildo"]And not testing the backup beforehand is just
    plain stupid.[/quote]That's all very well for large companies and such,
    but then: small companies... 

    How would you go about testing
    the restore procedure on the only server, having the only tapedrive? Of
    course you could re-install when the restore-test fails, but the time
    spent on managing IT by the most knowledgeable clerck is very closely
    guarded by the manager, who want's to know why he's still busy on the
    damn machine in stead of "doing his job" (i.e. accounting or
    whatever)...

     But then again: in most small organisations the IT is one big steaming heap of WTF's.

    [/quote]

    Anyone with half a clue in such an organization should ask themselves a simple question:  Which will cost me more?  A day testing my backup system, or losing all my data in a catastrophic failure?
     



  • Losing all your data due to a catastrophic failure when testing your backup system. If you don't have a spare machine to test it on, testing it on the production machine is possibly worse than not testing it at all.



  • [quote user="m0ffx"]Losing all your data due to a catastrophic failure when testing your backup system. If you don't have a spare machine to test it on, testing it on the production machine is possibly worse than not testing it at all.
    [/quote]

    You made me laugh out loud... but you are completely right. 



  • The backup system probably was tested, just with a small dataset.  That is, they tested and knew that they could restore backed up data, but didn't test with a large enough set to realize that it would take so long for such a large restore.

    It's often not immediately obvious from a scaled down test just how the real deal is going to perform.

     


Log in to reply