Linus RAID failure: suspense / horror thriller for IT professionals (video)


  • Discourse touched me in a no-no place

    @blakeyrat said:

    From my experience, a workplace saying "we're moving offices in 3 months" actually means, "we'll be here another year and a half at least."

    Of course. “Moving offices” means “creating a committee to consider the plan for moving people to new offices”. The actual move itself is only tangentially related and might get cancelled if committee orders the wrong type of cookies when the VP of Office Relocation comes visiting…



  • I had written program to rescue hardware RAID myself, and I think I'd give up at the moment when I see the "corrupted but not completely broken" video.

    These guys are amazing.



  • Knowning how resiliant MPEG streams are, I immediately thought that they got the striping close, but not quite right, from that video.



  • @asdf said:

    that's why I always use software RAID.

    That's why I always want to use software RAID.

    Not always feasible. I found out to my disgust that the RAID controller in the server we'd bought and paid for - the one we had to keep in order for Acer to honor the server's hardware warranty - simply didn't support JBOD mode; the only way to make the attached drives even visible to the BIOS was by using the inbuilt tools to format them as some kind of RAID array first. Took the path of least resistance and offloaded RAID 6 to the controller, crossed fingers, hoped for the best.

    Got the worst instead, naturally, and ended up needing to recover some vital and not-yet-backed-up stuff from an NTFS filesystem inside a QCOW2 disk image inside an Ext4 filesystem on top of LVM2 on top of this hardware-managed RAID6 disk set, after a disk-corrupting server-killing power supply failure, with replacement hardware at least a week away (we're out in the sticks).

    It wasn't pretty. Ended up learning more about Linux software RAID than I wanted to, but did eventually manage to make it talk to the raw disks off this thing in a manner close enough to what the hardware RAID card was doing as to be able to get what I needed off it.



  • @asdf said:

    a lot of bad stories, especially about cheap RAID controllers

    Cheap being the operative word here. Something that can trash your entire company's IT infrastructure when it fails is not something you want to cut corners on.

    @izzion said:

    This whole thing should be one really large RGE that results in this guy being permanently consigned to working at McDonalds. Because he shouldn't be allowed anywhere near system admin work.

    I think this guy owns the company, so that might be hard to pull off, but yeah, there was a lot of WTF going on here, and I think you spelled it out really well.


  • Grade A Premium Asshole

    @flabdablet said:

    I found out to my disgust that the RAID controller server we'd bought and paid for simply didn't support JBOD mode; the only way to make the attached drives even visible to the BIOS was by using the inbuilt tools to format them as some kind of RAID array first. Took the path of least resistance and offloaded RAID 6 to the controller, crossed fingers, hoped for the best.

    You could have also mounted all of the drives as single drive RAID-0 arrays to get them visible to the OS. We have frequently done that with PERC controllers that we did not want to use hardware RAID for.


  • Grade A Premium Asshole

    @David_C said:

    Cheap being the operative word here. Something that can trash your entire company's IT infrastructure when it fails is not something you want to cut corners on.

    Almost everything on the network can fall under that description. This was just a case of putting all his eggs in one basket and the basket was too flimsy to handle the load.

    Filed under: Cue @blakeyrat to come in and complain about the idiom of eggs and baskets and how it does not apply to the modern world and "Oh noes!11! I don't understands!!eleventy!"



  • @Circuitsoft said:

    What kind of stupid RAID controller doesn't store the config on the drives?

    The new, replacement RAID controller was probably running firmware incompatible with the old controller's config. Or the config got corrupted. Or the new controller re-ordered the drives. Or any number of weird-ass hardware-RAID failure modes.

    I really, really dislike hardware RAID. For the last ten years, even the least capable server CPUs available have been orders of magnitude faster than the embedded processors on RAID cards; offloading all that parity nonsense no longer makes computational sense.


  • Grade A Premium Asshole

    @flabdablet said:

    The new, replacement RAID controller was probably running firmware incompatible with the old controller's config. Or the config got corrupted. Or the new controller re-ordered the drives. Or any number of weird-ass hardware-RAID failure modes.

    This.

    @flabdablet said:

    I really, really dislike hardware RAID. For the last ten years, even the least capable server CPUs available have been orders of magnitude faster than the embedded processors on RAID cards; offloading all that parity nonsense no longer makes computational sense.

    You are comparing a processor that is purpose built for parity calculations against a general purpose CPU. They are entirely different animals. Hardware RAID is still worth it for the performance increase, when you get to the limit, if you get to the limit, and if you need it. Lots of people don't need it though. Hardware RAID for boot drives is not a bad idea though, no matter the size of the enterprise.

    But to say that it makes no difference is a bit...incorrect. It would be like saying that TCP/IP offload engines don't have any benefits. They do, it is just that most people don't need them.



  • @Polygeekery said:

    You could have also mounted all of the drives as single drive RAID-0 arrays to get them visible to the OS.

    Yeah, but you still end up with disk layouts that aren't exactly the same as you'd get from raw drives; the RAID controller will steal a few blocks for metadata, which can mean your software RAID metadata (or even, at worst, partition tables and whatnot) ends up in the wrong place after transplanting the drives into a different box.



  • @Polygeekery said:

    You are comparing a processor that is purpose built for parity calculations against a general purpose CPU. They are entirely different animals.

    That's true. But the performance differences are now nowhere near what they used to be when hardware RAID cards first became a thing. We're talking single-digit percentage differences here, and not always in hardware RAID's favour.



  • @flabdablet said:

    Or the new controller re-ordered the drives

    Given that he admitted to moving the drives to another chassis (and back), and changing the RAID cards around, and the segment where he's trying to recover and "OMG we lost another RAID-5???" because he forgot to plug the damn cables back in...

    I'm gonna go with the ID-10-T re-ordered the drives, corrupting the configurations.


    Though, in fairness to him, I guess there is sort of a technical reason to split the drives across multiple RAID cards, if you're going to blow all the money on 24 SSDs to begin with - an 8 drive SSD RAID will be looking at ~4GBps max throughput (less RAID overhead, but close enough for estimate work) and a PCIe 2.0 x8 slot maxes out at 4GBps (or 8GBps if it's a PCIe 3.0 x8 slot), so you definitely wouldn't get full performance from all 24 drives on a single PCIe bus anyway.

    But still. If you're gonna spend at least $10k in storage (and that's at today's price, which means there wouldn't be any actual data on them -- so he probably spent at least twice that), why are you skimping on the server?



  • @izzion said:

    I'm gonna go with the ID-10-T re-ordered the drives, corrupting the configurations.

    I'm gonna go with panicked technician felt under too much pressure to tell all his users to fuck off and wait while he made block-for-block forensic copies of every single drive onto spares before doing anything the slightest bit complicated.

    The fact that I had done exactly that as Step 0, and then again as Step 1, for dealing with my own crash is the only reason I was able to recover anything at all. I lost count of how many times and in how many ways I managed to completely shit up the RAID config, and how much data got accidentally overwritten as RAID and LVM subsystems tried to "repair" data with shat-up RAID config, before finally working out all the things I needed to do to jury-rig a truly read-only just-compatible-enough software RAID 6 array.

    Even so, the whole process took less than the 14 hours this guy said he'd spent online looking for data recovery support.

    The key is to own lots of disk drives.


  • Grade A Premium Asshole

    @izzion said:

    This whole thing should be one really large RGE that results in this guy being permanently consigned to working at McDonalds. Because he shouldn't be allowed anywhere near system admin work.

    Every video I've seen by the guy gave me the impression that he's a YouTube personality first and a tech guy a distant fifteenth or twentieth.



  • Backing up your files is simple. Here's a step-by-step process:

    1. Have someone else back up your files.

    See? That's why I log into my browser and store my open source code on GitHub. Because someone else has to deal with backing it up. And the resources GitHub or Google put into data redundancy are a lot more than what I can afford to do on my own.



  • In my experience, software RAID is more reliable than hardware RAID.

    I've hit 4 hardware RAID fails over 4 years (including mid-range RAID cards like Promise SX8000), and not once have software RAID failure.

    IMO, unless your CPU has more important things to do (like running a SQL server), software RAID is preferred.



  • And lets don't get started talking about accidental change in data, like update/delete SQL statements without WHERE clause.

    There is no way for external program to know whether you mean it or not. So any realtime backup solution that only keeps latest copy (and therefore RAID1 too) does not function as a backup.

    You can if you can make it multiple copy backup, though. (I know admins to take out a disk on RAID1 daily as backup and swap another hard disk for rebuild. It could work as long as you can make sure the disks are in good condition. The disk being taken out have good opportunity for full disk scan although lots of admin I know does not :sadface: )

    @JBert said:

    Doing it on SSDs without backup though...

    The SSDs... due to the property of flash memory, when you buy SSD from the same production batch to from RAID, if any of them start to fail, you can expect the other disk will fail within short time (like one or two week).

    If you want to use SSD to build RAID to store important data, you'd better be very cautious to any I/O errors throw in event log or corresponding log of monitoring software, and make sure all backups can really be restored.


  • Winner of the 2016 Presidential Election

    @cheong said:

    and not once have software RAID failure

    Same here, but it sometimes re-builds my RAID 5 for no apparent reason. I think one of my HDDs might finally be failing, although the SMART data looks OK.

    Edit: Hm, just checked, and it seems like one of my drives has an older firmware than the others. Maybe that's the problem then and I should update the firmware.


  • Discourse touched me in a no-no place

    @Tsaukpaetra said:

    To be fair, he was in the process of backing it up to the backup array when it failed (it's almost a given that this scenario is the most likely to happen).

    He was backing up then, because the server had been dropping out over the past couple of days.



  • @asdf said:

    Maybe that's the problem then and I should update the firmwaremake sure all of my data is backed up before I do anything regrettable.

    Yes, that sounds like a good idea.



  • @cheong said:

    I know admins to take out a disk on RAID1 daily as backup and swap another hard disk for rebuild.

    That's scary. It means you have no redundancy until the rebuild completes. I wouldn't want to do that unless the system is keeping two mirrors. (Is there a name for that? RAID1+1?)


  • Fake News

    I believe it's still named RAID 1, it's just that it has more than two drives in the array. The only big hurdle is that all drives need to be of the same size as each is an exact mirror copy.

    If you then have enough mirror disks, pulling out one disk and inserting a fresh disk should give you an identical copy.



  • @JBert said:

    The only big hurdle is that all drives need to be of the same size as each is an exact mirror copy.

    Negative ghostrider. Capacity has to be >= the others.



  • @ben_lubar said:

    Because someone else has to deal with backing it up.

    You hope.
    And the restore process may take several weeks. SourceForge.



  • I've had good experiences with Google and GitHub maintenance. Usually either the site keeps working (possibly in read-only mode) or the site is down for a small number of hours and no data is lost.


  • :belt_onion:

    I was about to post exactly that.

    You better be damn sure you're not gonna have data corruption in the meantime, or you'll need that backup sooner than you thought 😑


  • :belt_onion:

    @sloosecannon said:

    sure you're not gonna have data corruption

    For the joke impaired/pendants:

    Yes, that's the point. You can never be sure you won't have data corruption.



  • And worse than that, you're thrashing the bejeezus out of the other drive in the RAID, reading all the data off of it to write it to the new drive. So you're increasing the chance of a disk failure on the drive staying behind by an order of magnitude.

    Plus, if you get exceptionally brillant and decide to just naively rotate drives on a fixed rotation (say, once per week) and forget to blank last week's drive before you stick it back in, you might get "lucky" and have the RAID decide the newly reinserted (one week old) drive is the right drive, and it needs to rebuild onto the other drive (RIP current data).



  • @izzion said:

    ... naively rotate drives on a fixed rotation (say, once per week) and forget to blank last week's drive before you stick it back in, you might get "lucky" and have the RAID decide the newly reinserted (one week old) drive is the right drive, and it needs to rebuild onto the other drive

    I would like to think that RAID software would have some mechanism to prevent this, but I also know enough to not make any such assumption in the absence of additional information. And even if there is a mechanism in place, software can have bugs...



  • @cheong said:

    And lets don't get started talking about accidental change in data, like update/delete SQL statements without WHERE clause.

    Mysql fixed that



  • @anonymous234 said:

    does not allow any deletions, except with a 7 day waiting period or something.

    I've once made rm superuser-only in a server I was helping my school set up. It broke about a thousand different things and made admin stuff a hassle, but it honestly saved a few man-months of taking it offline to recover some file someone put in the desktop and another teacher thought it was fine to delete.
    Don't judge me too much, I was 14 at the time.


  • Notification Spam Recipient

    @Husky said:

    a server I was helping my school set up.

    @Husky said:

    I was 14 at the time.

    Ladies and gentlecolts, I give you The Genius!


  • 🚽 Regular

    @David_C said:

    I've been there and felt all the same panic.

    Yep, I had a RAID-5 full of iSCSI LUNs for our virtualised Linux servers crap itself. One disk was stone-cold dead and missing from BIOS and another was failing SMART. No big problem, I had full backups. I went to check them and it went:

    {company}VMSRV01
    {company}VMSRV03
    {company}VMSRV04
    ...

    Where the fuck is VMSRV02 ?!

    It was a critical LOB server too, with man-years of work on it. That's about the only time I've felt physically sick over an IT issue.

    About 5 minutes later I figured out the backup script had a typo in it for the folder but not for the name on the reporting email and it had been backing up correctly but under another CIFS share.
    The RAID actually rebuilt itself too at the speed of drying paint after loading two new drives in one by one.
    I made a physical paper check-sheet after that to, hopefully, catch all contingencies with configuring backups.



  • @David_C said:

    @izzion said:
    ... naively rotate drives on a fixed rotation (say, once per week) and forget to blank last week's drive before you stick it back in, you might get "lucky" and have the RAID decide the newly reinserted (one week old) drive is the right drive, and it needs to rebuild onto the other drive

    I would like to think that RAID software would have some mechanism to prevent this, but I also know enough to not make any such assumption in the absence of additional information. And even if there is a mechanism in place, software can have bugs...

    I have definitely seen the aforementioned problem from someone that was relying on the onboard RAID of their desktop motherboard and doing RAID-1 "backups" by rotating drives. And it was catastrophic.

    That said, yes, I would hope the actual LSI controllers would be smarter. But given how many RAID-5's I've seen fail because someone figured removing and reinserting the offline drive would be fine, I'm not going to hold my breath there, either...



  • Actually you need to start with the smallest disks for this to work. If the final capacity of the new inserted disk is found to be just a block or two smaller then the set you already have, the mirror won't rebuild.



  • RAID 5 is the devil. If you don't think that, you've never had 2 disks fail w/in 5 minutes of each other and had your IT chucklefucks go "we don't back that up because it takes up too much space"...



  • Still that won't fix situations like "accidentally recalculate payroll before locking the previous payroll session in the current month", etc. 😛



  • @PJH said:

    He was backing up then, because the server had been dropping out over the past couple of days.

    And that, friends, is TRWTF.

    If you have a mission-critical, single-point-of-enterprise-failure, not-properly-backed-up server (which, granted, you never should have, but we've all seen it happen) and it starts falling over, the first thing you do is not keep it running while putting extra load on it with a backup job. You don't know what's wrong with it yet; until you do, you have no idea what it's about to do to your data.

    What you do do is shut it down clean, then pull every drive, one by one, labelling each one with a Sharpie as it comes out so you know which one it is, and then use ddrescue on some completely other machine to make block-for-block backups, and you label all the backup drives as well. And then you do that again, and you label all the second backups properly too, and then you put them away in a box and don't touch them again unless the originals and the first backup set have all caught fire.

    If you have a heap of workstations, you can borrow them to do all this work in parallel.

    And only once you have two backups of every single disk block that was in your server the last time you managed to shut it down clean: only then do you turn your attention to the task of trying to fix it.

    And if people give you grief about all the downtime, you apply your LART and tell them to STFU, because they should never have put you in this position in the first place.



  • Then you hit the italics button randomly in each of your sentences while lecturing a forum full of people who did not make this stupid backup mistake.



  • @smallshellscript said:

    If you don't think that, you've never had 2 disks fail w/in 5 minutes of each other

    The thing I don't get about RAID controllers is hearing stories about how one failed sector can trigger an array rebuild, which then exposes other failed sectors that nobody noticed before, and now the whole array is burnt.

    It seems to me that a failed sector should normally just get rewritten by itself after being reconstituted from the corresponding sectors on the remaining drives. In most cases this will cause the drive whose sector failed to spare it out, and then everything can just carry on as normal.

    And if an array rebuild is in progress and encounters failed sectors, it should just be able to work on past them just like ddrescue does. So yes, you might end up with a sprinkling of failed sectors across the array and need to do some recovery from backups for a handful of files. But this idea that an entire array should instead become instantly unusable because of one unrecoverable sector seems to be pretty common, and I really don't get why.

    Sure, sometimes a drive will indeed fail in a way that takes it entirely offline; I've had a couple of Seagate drives do that. I've also recovered all but a small number of sectors from those very drives: power-cycle them to get them back online, then avoid re-reading the sectors that triggered the shutdown until after you've got everything else off. I don't understand why a RAID controller couldn't at least try that.

    It's rare for a spinny drive to go completely tits-up without warning. Most drive failures I've seen have been foreshadowed months to weeks in advance by the sparing-out of a few hundred to a few thousand sectors. The nightmare cascading-failure RAID death scenarios I keep reading about should, given proper controller design, be quite extraordinarily rare.

    Maybe there are just not enough SSDs in my life.



  • I'll have you know that every single one of those italicized phrases was lovingly handcrafted using manually-inserted <em> tags.

    WYSIWYG web text entry is for squids.



  • @cheong said:

    capacity of the new inserted disk is found to be just a block or two smaller...

    Isn't that what I wrote?

    @brianw13a said:

    Capacity has to be >= the others.



  • @brianw13a said:

    Isn't that what I wrote?

    Yup, what I added is you should start the setup with smallest disks, the rest is basically what you wrote.

    Is there a problem?



  • Doesn't it depend on the disk type?

    I've once had a server, not properly backed up, which started to show disk read errors in the system log. I managed to create a backup while the server was still running, but after a reboot (and a spinup cycle) the harddisk was gone. If I had followed your advice, I would have lost my data.

    So isn't it better, at least with conventional drives where a spinup cycle is also quite some stress, to reduce the load as much as possible, but keep the system running, and to try to figure out what is wrong before doing anything at all?



  • @Grunnen said:

    If I had followed your advice, I would have lost my data.

    Fair point.

    On the other hand, what you were dealing with there was not a server spontaneously falling over, but a server continuing to run and showing disk errors in the log.

    My opinionated opinion really only applies to servers like the one described in the OP, which have started to show a pattern of just crashing at random.





  • @flabdablet said:

    The nightmare cascading-failure RAID death scenarios I keep reading about should, given proper controller design, be quite extraordinarily rare.

    I used to think that. Then I ran into hardware designed and sold in the real world.

    Suppose, for the moment, that it's 2001. The real one, with Tiger Woods and Pope John Paul II, not the one with Stanley Kubrick and Richard Strauss. You have a small cluster of two machines built and configured by a major sever vendor which had been bought out by a minor desktop computer vendor which was then bought out by a company which makes Huge Printers, and it is designed to be COMPLETELY UNSTOPPABLE. You spend months having meetings with sales engineers who show you that there is no possible way that a hardware fault can ever slow this cluster down, and they even show diagrams demonstrating that you could fire a 50 cal bullet through the rack without taking out enough of the redundant systems to cause a problem. There are two independent pairs of RAID controllers each connected to multiple shelves full of mirrored drives, and the RAID sets themselves are mirrored in software, and the whole mess is managed by a Tru-ly amazing Cluster system that can handle any failure ever imagined.

    Naturally, you don't believe any of this, since you are over the age of six and have met a salesperson before, but the design still looks good and it holds up to testing. Sure, the same guys who told you the system had 100% redundancy also plugged the power supplies for all four RAID controllers into the same power bar, but that's why you check all of these things yourself.

    Eventually, you agree that the design is sound and that it is at least better than the system it was replacing, so one night you do the big switch-over to run all of the company's operations off of the new cluster. Everything goes smoothly, ridiculous certificates of appreciation are sent around to all of the wrong people, and your managers' manager's name gets mentioned at a board meeting. Yay.

    Then, late one night, there's a tiny software hiccup in one of the RAID controllers, so it tries to restart itself and fails. But not to worry, there is actually a pair of controllers so the other one will seamlessly take o-- Oh, wait. It's running the same software, received exactly the same inputs and just hit exactly the same bug. So it tried to restart too, with the same result. Suddenly half of the RAID sets have disappeared, but that's okay because they are all mirrored with RAID sets on the other controllers, right?

    Um, yeah. About that...

    There's a Cluster suite monitoring them, and it noticed a loss of connectivity to some of its disks. And now, even though it still has network connectivity to every member of the cluster, it's concerned that some mysterious new cluster member, which has never before been seen, might now be writing to those missing disks so it invokes its standard split-brain defense system.

    It disables all cluster resources which include the missing disks, and then tries to restart them.

    And that fails because even though the logical volumes required are all visible, half of the physical volumes are (gasp) still unreachable.

    So it fails to restart the cluster services. Several times. Naturally, this causes the cluster manager software to phone Roy in IT, who suggests a perfectly reasonable solution to the problem.

    The entire cluster turns itself off and then on again.

    Upon rebooting, each member of the cluster sees that half of its physical disks are still missing, and continues to assume that some evil twin cluster node has escaped from the Phantom Zone and is happily writing to the invisible disks so it quite naturally refuses to start up, leaving every node of the cluster stuck part-way through the boot process.

    Naturally, none of this came out until we had time to analyze the logs later on. All that we knew at the time was that our entire unstoppable cluster had simply shut itself down in the middle of the night and that we needed it working again now. This involved someone seeing that one of the RAID controller pairs had flashing red lights on it and power cycling it, followed by doing the same to every other piece of hardware in the rack until they started working.

    An elite team of sales engineers from Huge Printers came by later that week to help resolve the situation. They brought plenty of T-shirts, pens and coffee cups to help satisfy everyone that they were very serious about resolving our biggest concerns, and they sat in many dimly lit boardrooms to show that they meant business.

    Their final answer, once all of the information had come out and the source of each of the cascading problems was identified?

    "Um, yeah. It's supposed to do that. What did you expect?"



  • @DCRoss said:

    An elite team of sales engineers from Huge Printers came by later that week to help resolve the situation. They brought plenty of T-shirts, pens and coffee cups to help satisfy everyone that they were very serious about resolving our biggest concerns, and they sat in many dimly lit boardrooms to show that they meant business.

    Can't like this paragraph hard enough.


  • 🚽 Regular

    @DCRoss said:

    But not to worry, there is actually a pair of controllers so the other one will seamlessly take o-- Oh, wait. It's running the same software, received exactly the same inputs and just hit exactly the same bug.

    Oh yes. That's what happened to the redundant power supplies in one of my Dell servers. Until the tech arrived onsite (in an hour 👍) I hadn't realised the power supplies had firmware. A reflash of that and the 'life cycle controller' and they worked again.

    Also, please do a front page article. That was great.


  • area_deu

    @Cursorkeys said:

    the power supplies had firmware

    :WTF:


Log in to reply