@flabdablet said:
The nightmare cascading-failure RAID death scenarios I keep reading about should, given proper controller design, be quite extraordinarily rare.
I used to think that. Then I ran into hardware designed and sold in the real world.
Suppose, for the moment, that it's 2001. The real one, with Tiger Woods and Pope John Paul II, not the one with Stanley Kubrick and Richard Strauss. You have a small cluster of two machines built and configured by a major sever vendor which had been bought out by a minor desktop computer vendor which was then bought out by a company which makes Huge Printers, and it is designed to be COMPLETELY UNSTOPPABLE. You spend months having meetings with sales engineers who show you that there is no possible way that a hardware fault can ever slow this cluster down, and they even show diagrams demonstrating that you could fire a 50 cal bullet through the rack without taking out enough of the redundant systems to cause a problem. There are two independent pairs of RAID controllers each connected to multiple shelves full of mirrored drives, and the RAID sets themselves are mirrored in software, and the whole mess is managed by a Tru-ly amazing Cluster system that can handle any failure ever imagined.
Naturally, you don't believe any of this, since you are over the age of six and have met a salesperson before, but the design still looks good and it holds up to testing. Sure, the same guys who told you the system had 100% redundancy also plugged the power supplies for all four RAID controllers into the same power bar, but that's why you check all of these things yourself.
Eventually, you agree that the design is sound and that it is at least better than the system it was replacing, so one night you do the big switch-over to run all of the company's operations off of the new cluster. Everything goes smoothly, ridiculous certificates of appreciation are sent around to all of the wrong people, and your managers' manager's name gets mentioned at a board meeting. Yay.
Then, late one night, there's a tiny software hiccup in one of the RAID controllers, so it tries to restart itself and fails. But not to worry, there is actually a pair of controllers so the other one will seamlessly take o-- Oh, wait. It's running the same software, received exactly the same inputs and just hit exactly the same bug. So it tried to restart too, with the same result. Suddenly half of the RAID sets have disappeared, but that's okay because they are all mirrored with RAID sets on the other controllers, right?
Um, yeah. About that...
There's a Cluster suite monitoring them, and it noticed a loss of connectivity to some of its disks. And now, even though it still has network connectivity to every member of the cluster, it's concerned that some mysterious new cluster member, which has never before been seen, might now be writing to those missing disks so it invokes its standard split-brain defense system.
It disables all cluster resources which include the missing disks, and then tries to restart them.
And that fails because even though the logical volumes required are all visible, half of the physical volumes are (gasp) still unreachable.
So it fails to restart the cluster services. Several times. Naturally, this causes the cluster manager software to phone Roy in IT, who suggests a perfectly reasonable solution to the problem.
The entire cluster turns itself off and then on again.
Upon rebooting, each member of the cluster sees that half of its physical disks are still missing, and continues to assume that some evil twin cluster node has escaped from the Phantom Zone and is happily writing to the invisible disks so it quite naturally refuses to start up, leaving every node of the cluster stuck part-way through the boot process.
Naturally, none of this came out until we had time to analyze the logs later on. All that we knew at the time was that our entire unstoppable cluster had simply shut itself down in the middle of the night and that we needed it working again now. This involved someone seeing that one of the RAID controller pairs had flashing red lights on it and power cycling it, followed by doing the same to every other piece of hardware in the rack until they started working.
An elite team of sales engineers from Huge Printers came by later that week to help resolve the situation. They brought plenty of T-shirts, pens and coffee cups to help satisfy everyone that they were very serious about resolving our biggest concerns, and they sat in many dimly lit boardrooms to show that they meant business.
Their final answer, once all of the information had come out and the source of each of the cascading problems was identified?
"Um, yeah. It's supposed to do that. What did you expect?"