Linus RAID failure: suspense / horror thriller for IT professionals (video)



  • Well known PC youtuber Linus (from LinusTechTips - not Torvalds) suffers RAID failure on his company's server.

    With no backups.

    https://www.youtube.com/watch?v=gSrnXgAmK8k

    I've seen quite a few films over the holidays, but none had me in as much cold sweat as this one. Any IT professional who's ever been through a HDD failure will understand.


  • Winner of the 2016 Presidential Election Banned

    I was expecting Linus Torvalds when I saw that title, and I was expecting a hilarious story. I was disappointed. This is not a comedy. This is a tragedy.



  • @Fox said:

    I was expecting Linus Torvalds when I saw that title, and I was expecting a hilarious story. I was disappointed. This is not a comedy. This is a tragedy.

    I was literally sitting at the edge of my seat, going "Oh no! No!" under my breath.


  • ♿ (Parody)

    @Fox said:

    I was expecting Linus Torvalds when I saw that title

    Nah, his backup strategy is far more robust.


  • Winner of the 2016 Presidential Election Banned

    @boomzilla said:

    Nah, his backup strategy is far more robust.

    Yeah, which is why I was expecting a fantastically hilarious story involving a Kylo-Ren-esque tantrum.


  • FoxDev

    @cartman82 said:

    suffers RAID failure on his company's server.

    that's a thing that does happen on occasion.

    at least he didn't fall into the thought trap that RAID = backup

    @cartman82 said:

    With no backups.

    :headdesk:

    Repeat after me:

    RAID DOES NOT EQUAL A BACKUP!

    got that? good.

    now who can tell me the 3-2-1 minimum rule of backups?

    ah... not that guy it seems.



  • He didn't have offsite backups? He knows better than that 😕

    Actually, that sounds pretty much like Linus. Knows his shit but takes unnecessary risks sometimes.



  • I generally assume big-time tech channel youtubers actually know nothing about tech and instead just parrot it from blogs or people that do know a thing or two about tech.



  • He's been working for years on hardware and case reviews, though. Originally for NCIX. I've only been following him for ~6 months though.



  • I don't get it, is this fiction or did he film a real troubleshooting?


  • Winner of the 2016 Presidential Election

    …and that's why I always use software RAID.



  • @asdf said:

    …and that's why I always use software RAID.

    @accalia said:

    Repeat after me:

    RAID DOES NOT EQUAL A BACKUP!

    ITYM "offsite backup"


  • FoxDev

    @rc4 said:

    ITYM "offsite backup"

    it's not even an onsite backup.

    hardware RAID is more resilient to controller failure than software RAID, but if your RAID controller goes (and in the case of hardware raid you cannot get a replacement or have forgotten to export and backup your raid settings), you're data is gone.

    RAID is redundancy, yes.

    and to a certain extent so is backups, yes.

    but RAID is not backups.

    ESPECIALLY IF YOU USE RAID 0!


  • Java Dev

    @rc4 said:

    ITYM "offsite backup"

    I think she meant backup. A backup protects against the user fat-fingering and deleting the wrong file. A RAID array does not.


  • :belt_onion:

    @accalia said:

    RAID 0

    Of course not. Everyone knows that the number of the raid indicates the number of backups...

    That's how that works, right?
    Right?
    ...



  • Well, in the case of RAID 0 and 1, you wouldn't be exactly wrong...


  • :belt_onion:

    I just wanted to point out the fact that the notification for your post is unread, despite me reading and liking it...



  • @accalia said:

    RAID DOES NOT EQUAL A BACKUP!

    Correct, raid is for debugging.

    https://www.youtube.com/watch?v=ugVaMg7GTbI


  • FoxDev

    @rc4 said:

    Well, in the case of RAID 0 and 1, you wouldn't be exactly wrongright*...

    Raid 0 is striping

    Raid 1 is mirroring.

    :-D

    * in the case of 2 setups for RAID 1 only



  • @rc4 said:

    Well, in the case of RAID 0 and 1, you wouldn't be exactly wrong...

    Works for 0, works for 1, by induction...



  • @accalia said:

    Repeat after me:

    RAID DOES NOT EQUAL A BACKUP!

    OK, I get that in 99% of cases that's true, but

    the way I see it, there are 2 sources of data loss: hardware failure (disk goes poof, natural disasters...) and software failure (data corruption, malware, accidentally deleting files you didn't really want to delete).

    What if you could remove software failures? For example, instead of giving your computer direct access to your hard drives, you connect them to a different computer running a specifically designed trusted OS that offers them as a network service but does not allow any deletions, except with a 7 day waiting period or something.

    So now malware can't delete your files. If you eliminate hard drive failures with RAID, wouldn't that be almost completely safe, barring natural disasters?

    Yes, still needs some geographic redundancy, which makes the whole thing (heh) redundant. But food for thought.



  • What's the difference between deleting a file and overwriting a file? You can open an existing file and empty it without ever deleting it.



  • I've been there and felt all the same panic. Well not a multi-TB server, but I know the feeling of panic when you think you may have lost lots and lots of hard work. (The first time was in the 7th grade when my science fair project had only one copy of an Apple II 140K floppy that got a corrupted directory. My father and I spent 3 hours with a sector editor to rebuild the directory so we could access the files.)

    That being said, I can't believe they had such a large server without an automated backup system in place.

    I'm also surprised that they were striping across three RAIDs. Is it really that critical to keep everything on one logical volume? Would it really be that horrible to let the three RAIDs act as three separate file systems?

    @accalia said:

    RAID DOES NOT EQUAL A BACKUP!

    It's hard to believe how many people don't understand this concept.

    RAID will protect against individual drive failure, but that's about it. There's no protection against (as we saw here) RAID controller or motherboard failure. There's no protection against multiple-drive failures (some RAID topologies will protect against more than one drive at a time, but there's always a limit.) There's also no protection against malware, a software glitch/bug, a malicious user, a stupid user, or just a dumb mistake trashing your data when all the hardware is working perfectly.

    It also doesn't protect against a disaster where fire, flood, tornado, earthquake, whatever trashes the entire building and physically destroys your server.

    All of these are covered by a proper backup strategy (that includes off-site storage.) If Linus had a recent backup, they could've just replaced the failed motherboard and RAID card, reformat the array and restore from the backup, losing, maybe, a day's worth of work. As opposed to this scenario where they had to enlist data recovery specialists and were really lucky they didn't end up losing everything.

    @anonymous234 said:

    For example, instead of giving your computer direct access to your hard drives, you connect them to a different computer running a specifically designed trusted OS that offers them as a network service but does not allow any deletions, except with a 7 day waiting period or something.

    Stuff like this can lower the odds of catastrophic data loss, but it can't eliminate it. That trusted server could still fail catastrophically - its RAID controller or motherboard could die. It could get infected by malware (don't ever assume something is invulnerable). And something could delete/overwrite infrequently-accessed files such that the corruption is not discovered until after your 7 day waiting period.

    Ultimately, there is no substitute for making backups, and keeping some of them off-line when they're not actively in use.



  • @anonymous234 said:

    Yes, still needs some geographic redundancy, which makes the whole thing (heh) redundant. But food for thought.

    Indeed. You could call it a "Filer" and fill it with waffles.



  • @anonymous234 said:

    wouldn't that be almost completely safe

    No.

    Cryptolocker.


  • Grade A Premium Asshole

    @boomzilla said:

    Nah, his backup strategy is far more robust.

    Not as robust as blakey's. He keeps his entire machine backed up to Git.


  • Grade A Premium Asshole

    @anonymous234 said:

    the way I see it, there are 2 sources of data loss: hardware failure (disk goes poof, natural disasters...) and software failure (data corruption, malware, accidentally deleting files you didn't really want to delete).

    You forgot #3, user failure. You sort of covered it, but not explicitly.


  • Fake News

    @David_C said:

    I'm also surprised that they were striping across three RAIDs. Is it really that critical to keep everything on one logical volume? Would it really be that horrible to let the three RAIDs act as three separate file systems?

    I was also wondering WTH they were thinking, though it seems it's got a name at least:

    Doing it on SSDs without backup though... When those things die, they die properly.


  • ♿ (Parody)

    @Polygeekery said:

    @boomzilla said:
    Nah, his backup strategy is far more robust.

    Not as robust as blakey's. He keeps his entire machine backed up to Git.

    Yeah, same deal. Except Linus' is replicated in thousands (at least) of geographically separate places.


    Filed Under: Who whooshed here?


  • Discourse touched me in a no-no place

    @boomzilla said:

    Who whoosed here?

    Who what there?


  • ♿ (Parody)

    @loopback0 said:

    Who what there?

    Don't make fun of speech impediments.



  • @Polygeekery said:

    Not as robust as blakey's. He keeps his entire machine backed up to Git.

    That's actually my backup strategy as well.

    @blakeyrat said:

    I don't get it, is this fiction or did he film a real troubleshooting?

    It looks real to me.



  • Real.



  • It was RAID 5 + 0. I assume he was aware that he didn't have a backup, he just wasn't caring enough to get that set up.



  • @JazzyJosh said:

    It was RAID 5 + 0. I assume he was aware that he didn't have a backup, he just wasn't caring enough to get that set up.

    My impression is, they were moving offices. Offsite backup was in the plans, but they figured it wasn't a priority.

    They figured wrong.



  • Yeah, moving offices has been happening for the past few months.


  • FoxDev

    @JazzyJosh said:

    It was RAID 5 + 0. I assume he was aware that he didn't have a backup, he just wasn't caring enough to get that set up.

    :rolleyes:

    nota bene: i didn't watch the video, i was commenting purely out of personal experience with people who don't understand raid



  • 3:10 in the video - new RAID controller saw the drives as unconfigured.

    What kind of stupid RAID controller doesn't store the config on the drives? Seriously! A few kilobytes cut off the end of each drive will hold almost any possible configuration, and allow for a raid card failure to be a simple problem.

    If I were in his place, I'd make LSI pay for the data recovery.



  • @cartman82 said:

    My impression is, they were moving offices. Offsite backup was in the plans, but they figured it wasn't a priority.

    From my experience, a workplace saying "we're moving offices in 3 months" actually means, "we'll be here another year and a half at least." One company I was with claimed to be "about to move offices" so long (3+ years) that I could only come to the conclusion it was a bald-faced lie.



  • @blakeyrat said:

    From my experience, a workplace saying "we're moving offices in 3 months" actually means, "we'll be here another year and a half at least."

    But every now and then that works in reverse.

    I was once hired to babysit an old VMS installation "just for another six months or so, while we migrate the applications onto Windows". I accepted the job, started a few weeks later, and found that the last server sitting at an SRM prompt while all of the company data was happily hosted on Windows[1]. The only time I actually fulfilled my original job description was two months later during an audit when I had to press the "B" key to boot it up again.

    [1] (Okay, it was on Oracle, on Windows, so it was probably unhappy twice over, but at least it was there.)


  • Winner of the 2016 Presidential Election

    @accalia said:

    hardware RAID is more resilient to controller failure than software RAID

    ?

    If there's no RAID controller that can fail, then there's one less piece of hardware that can corrupt your data when it fails. And I've heard a lot of bad stories, especially about cheap RAID controllers.



  • @Circuitsoft said:

    If I were in his place, I'd make LSI pay for the data recovery.

    IIRC it seemed to be a motherboard issue, not a card issue, but since moving hardware around could make things worse, he didn't want to mess with it.



  • Well, their old office was a house, and they've moved out of it, so...


  • Notification Spam Recipient

    To be fair, he was in the process of backing it up to the backup array when it failed (it's almost a given that this scenario is the most likely to happen).
    I believe there was an older array that had backups, but it was destroyed during the creation of the new one.



  • Things I haven't seen mentioned yet...

    • He has a 24 (as in 3x8!!!!) drive array of 960GB SSDs. That's almost $10,000 just in storage!!!
    • Tagged into a whitebox server, because saving all that money on not buying a Dell is going to make the difference here
    • He was MIXING hardware and software RAID (3 RAID-5s, striped together within Windows)
    • Because he used cheap cheap RAID controllers that could only handle 8 drives at a time, rather than, you know, a expander capable card & back plane so that one RAID controller could handle all the drives.
    • The original problem was the server restarting / crashing when under load. So he started troubleshooting by "remaking" his backup :wtf: and then "witnessed the RAID card giving up the ghost".
    • His initial troubleshooting included transferring all the drives to another box / backplane!!! and mucking around putting a "bigger, better (desktop) power supply" in to see if maybe it was just underpowered.
    • And then he decides to RAID rebuild by pulling drives out willy nilly in order to rebuild the array

    I mean, yikes.

    This whole thing should be one really large RGE that results in this guy being permanently consigned to working at McDonalds. Because he shouldn't be allowed anywhere near system admin work.



  • And honestly, if you think that a software RAID-0 striping of three hardware RAID-5s can be recovered by manually recovering each RAID-5 one at a time, well...

    (Not to mention, "let's recover it onto a 8TB 7200RPM desktop grade Seagate SATA drive!")



  • And you don't come out of a disasterous disaster recovery like that and say "hey, we'll have better backup in the future". You get the damn backup running properly BEFORE you tell the users it's back online. And then you check the damn backup with a test restore. And then you think about telling the users.



  • Where I work, when setting up our server, I started with stacked RAID - two mirrored RAID5s, but then I realized that RAID6 will give me as much reliability and more space, so I went to straight RAID6.

    Still, it's across 6 drives: 2 each 6TB drives from 3 manufacturers, with one hot spare from each manufacturer, ready to be swapped in if mdadm deems it necessary.

    Backups are also mirrored to our other office.


  • Notification Spam Recipient

    My house lives on the wild dangerous side, with only Raid Z2 (mirrored/parity across four disks).
    Eventually I'll sign up for some kind of off-site backup for my "critical" stuff (like photos that nobody will ever access), but for now, 🍹 Here's to hoping the disks don't die.



  • @blakeyrat said:

    From my experience, a workplace saying "we're moving offices in 3 months" actually means, "we'll be here another year and a half at least." One company I was with claimed to be "about to move offices" so long (3+ years) that I could only come to the conclusion it was a bald-faced lie.

    In my experience, 90% of anything they say is a lie, I just ignore it until I see it happening.


Log in to reply