Mass outage. I've heard of those. We just had one of those.
At the same time, we had a mini outage.
Did you know that you can have spaghetti servers? The jobs that went down were on the IBM mainframe. They were FTP jobs, writing files to a server that we will call Samwise.
Samwise runs Linoma GoAnywhere, which is one of those connectivity products that allow you to FTP, SFTP, http, or do just about anything with files. Maybe you can do too many things with files. Maybe you can do things with too many servers. Maybe you can do too many things with GoAnywhere.
So why did the mainframe jobs fail, that was the question. The host was connecting to Samwise, so call GoAnywhere team.
No, Samwise was working fine, but it returned an error when we tried to PUT a file. Maybe we need to know what user it's connecting to on Samwise? Okay, the user was FUDGE. I like fudge.
Maybe we need GoAnywhere guy to look and see how FUDGE is set up. Look at the GoAnywhere profile. Interesting, the home directory for FUDGE is //Gollum/files/in. Seriously, WTF, it's dropping files in a home folder...on another server?!
So, WTF is wrong with Gollum? Gollum is not responding when we try to connect to the file folder. Is it a problem with Gollum's hardware?
That is a simple question, easily answered right? By the team that handles Gollum, right? No, this is the game of Spaghetti Servers. Gollum is a virtual server that runs on the server Mirkwood, which runs many other virtual servers. Is Mirkwood down?
That's another team. No, Mirkwood is working just fine. Gollum team says Gollum seems to be working just fine. Except for the file folder.
Do we have a hard drive failure? What drive is //Gollum/files on? You're kidding, right? We don't do hardware, we do foggy cloud. It's on the SAN!!! (Storage Area Network) Of course it is, should have thought of that, it only makes good sense.
Okay, so where is it located on the SAN? Call SAN team on-call wizard, have wizard consult SAN profiles. Wizard is testy, yells something about bigger problems, but complies. And now we're getting somewhere, because that is the server Mordor. Now we understand, because Mordor is down! It's been down for hours! Major company outage.
So WhyTF are you wasting the time of 5 on-call teams figuring this out, when it's been related to the Mordor problem the entire time? You should have made that connection right away... it's not like there are any layers of obfuscation...