Schrodinger's Server



  • One of our production servers suddenly became unreachable, and some customers complained. Not to customer support. Not to sales. But to the CEO. The CEO stomped on the CTO who jointly stomped on the Production Manager who all went to jointly stomp on the production SA du-jour. 

    Unfortunately, the guy scheduled to be on duty was out sick, and a brand new junior SA was holding down the fort in his absence.

    C**: Why is server <X> down? Customers are threatening to leave if we don't keep the systems alive and available! Log in and see what went wrong!

    JSA: Sure, do you know the password?

    C**:  No, don't YOU know it?

    JSA: No, I just started working here two days ago. I know some of the passwords, but not for this box. The usual SA is unreachable.

    C**: Why aren't all the passwords for the master login the same?

    JSA: That would seem to be a sensible policy, I don't know why it's like this. If you like, I can change the application master password and use that login, but it might break scripts that rely on it.

    C**: Do it!

    JSA: <writes a script to stream edit the yp password and changes EVERY login's password to: "welcome">

    PM: <too late> NOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO!

    C**: What happened?

    PM: He just changed EVERY password - it's going to break everything!

    C**: Roll it back!

    JSA: You can't roll back a scripted stream edit

    C**: So is the server alive or dead?

    JSA: I can ping it so it must still be up, but none of the applications on that box seem to be responsive, and now I'm starting to get alarms from the other servers - applications seem to be crashing all over the place; I think we may have bigger problems

    At least it's Friday.

     



  • Great, now's there's coffee all over my desk and monitor. Thanks a lot.



  • OK the JSA shouldn't have offered to do that obviously, but I still feel a little bad for him getting fired.


  • Winner of the 2016 Presidential Election

    I've learned from too much experience that if you say to a business user, "we can do that, but [fire and brimstone and death and destruction]", they will only hear the first part. Now I just say, "no, it can't be done."



  • I give the reason it can't be done FIRST. It's like writing a newspaper article, you lead off with the most important bit of information, "it would completely break all our apps, but..."

    I wouldn't expect a junior guy who's only been in the office 2 days to know that, but if he's quick he's already learned it. If he's not then we'll get lots more Snoofle in the future!


  • Winner of the 2016 Presidential Election

    @blakeyrat said:

    I give the reason it can't be done FIRST. It's like writing a newspaper article, you lead off with the most important bit of information, "it would completely break all our apps, but..."

    I tried that, too. You must have wiser business users than I. They completely tune out everything but the yes or no.



  • Yeah...



    That draconian process that the CEO and the CTO came up with, way back in the day, is only meant to be followed by the idiots they employ... obviously the guys at the top who came UP with the process shouldn't be beholden to it, as they clearly know what they're doing.




    I like how they can't actually blame anyone but themselves... so I look forward to hearing all about how they blame everyone but themselves.



  • @locallunatic said:

    OK the JSA shouldn't have offered to do that obviously, but I still feel a little bad for him getting fired.

    Who said he was fired? He was the only SA on duty and they needed him to revert all the passwords - one at a time - to get everything back up and running. Fortunately, a few of us were  around (I'm not supposed to know the production passwords, but I do) to help out.

    By random chance, I happened to know that the only thing running on the flaky box was an FTP server, so if it died, the only impact was customers wouldn't be able to use a certain application to retrieve documents (that particular app was the only one still functioning). It turned out that management told the junior SA to check on processes that weren't supposed to be running on that box, so of course they didn't respond.

    After some digging, the whole "glitch" turned out to be caused by an end user at the client who forgot his new password and couldn't log in, so he bucked it up the chain that he couldn't log in, and the folks above assumed our stuff was down and bitched up the chain.

    Idiots.

     



  • @snoofle said:

    After some digging, the whole "glitch" turned out to be caused by an end user at the client who forgot his new password and couldn't log in, so he bucked it up the chain that he couldn't log in, and the folks above assumed our stuff was down and bitched up the chain.

    Idiots.

     

    LMFAO!!!!

    You should've taken that other job xD



  • @blakeyrat said:

    I wouldn't expect a junior guy who's only been in the office 2 days to know that, but if he's quick he's already learned it. If he's not then we'll get lots more Snoofle in the future!

    Yes. But I also give a lot of allowance for a junior guy (2 days!?...who thought it was a good idea to leave him effectively in charge) with multiple C-level execs yelling at him. There're pretty much no good outcomes from that sort of situation.



  • @boomzilla said:

    @blakeyrat said:
    I wouldn't expect a junior guy who's only been in the office 2 days to know that, but if he's quick he's already learned it. If he's not then we'll get lots more Snoofle in the future!

    Yes. But I also give a lot of allowance for a junior guy (2 days!?...who thought it was a good idea to leave him effectively in charge) with multiple C-level execs yelling at him. There're pretty much no good outcomes from that sort of situation.

    Not to mention that it depends on the types of scripts you're using. Sometimes it only takes a tiny typo to create disaster - like the difference between "rm -rf ./*" and "rm -rf /*" (though the --preserve-root option should mitigate such nowadays)



  • @Rhywden said:

    @boomzilla said:
    @blakeyrat said:
    I wouldn't expect a junior guy who's only been in the office 2 days to know that, but if he's quick he's already learned it. If he's not then we'll get lots more Snoofle in the future!

    Yes. But I also give a lot of allowance for a junior guy (2 days!?...who thought it was a good idea to leave him effectively in charge) with multiple C-level execs yelling at him. There're pretty much no good outcomes from that sort of situation.

    Not to mention that it depends on the types of scripts you're using. Sometimes it only takes a tiny typo to create disaster - like the difference between "rm -rf ./*" and "rm -rf /*" (though the --preserve-root option should mitigate such nowadays)

    I've since taught this new-hire SA the iterative-script-development technique I use, running the script at each stage to verify it's doing what you want and no more: 1) do <whatever> to get the data, 2) filter the data and verify only the rows of interest are getting through, 3) massage the data as required, 4) only after you're sure your output is desired do you add the code to make the actual change.


  • @snoofle said:

    @Rhywden said:

    @boomzilla said:
    @blakeyrat said:
    I wouldn't expect a junior guy who's only been in the office 2 days to know that, but if he's quick he's already learned it. If he's not then we'll get lots more Snoofle in the future!

    Yes. But I also give a lot of allowance for a junior guy (2 days!?...who thought it was a good idea to leave him effectively in charge) with multiple C-level execs yelling at him. There're pretty much no good outcomes from that sort of situation.

    Not to mention that it depends on the types of scripts you're using. Sometimes it only takes a tiny typo to create disaster - like the difference between "rm -rf ./*" and "rm -rf /*" (though the --preserve-root option should mitigate such nowadays)

    I've since taught this new-hire SA the iterative-script-development technique I use, running the script at each stage to verify it's doing what you want and no more: 1) do <whatever> to get the data, 2) filter the data and verify only the rows of interest are getting through, 3) massage the data as required, 4) only after you're sure your output is desired do you add the code to make the actual change.
    I make a lot of use of "set -x" during the initial stages of developing any remotely risky or complex script.  And echoing rather than executing any command that writes/deletes/updates data.  When it generates the right commands as output on stdout, pipe it into another shell.




  • @DaveK said:

    And echoing rather than executing any command that writes/deletes/updates data.  When it generates the right commands as output on stdout, pipe it into another shell.
    Interesting approach.... I like that!



  • @snoofle said:

    @locallunatic said:

    OK the JSA shouldn't have offered to do that obviously, but I still feel a little bad for him getting fired.

    Who said he was fired? He was the only SA on duty and they needed him to revert all the passwords - one at a time - to get everything back up and running.

    Sorry, where I am the JSA would have been shown the door so that the higher ups could claim that it was his mistake that caused the big issue but tell the customer that they don't need to worry cause he is gone.



  • @snoofle said:

    @DaveK said:

    And echoing rather than executing any command that writes/deletes/updates data.  When it generates the right commands as output on stdout, pipe it into another shell.
    Interesting approach.... I like that!

    And if auditing/reproducibility is a concern, you can tee the commands into a logfile on the way.




  • @joe.edwards said:

    @blakeyrat said:
    I give the reason it can't be done FIRST. It's like writing a newspaper article, you lead off with the most important bit of information, "it would completely break all our apps, but..."

    I tried that, too. You must have wiser business users than I. They completely tune out erase everything but the yes or no when they forward emails to blame IT for bad decisions.

    FTFY so it looks more like things I deal with.



  • @joe.edwards said:

    @blakeyrat said:
    I give the reason it can't be done FIRST. It's like writing a newspaper article, you lead off with the most important bit of information, "it would completely break all our apps, but..."

    I tried that, too. You must have wiser business users than I. They completely tune out everything but the yes or no.
    Face it, the business people hear what they want to hear and not a word more.  When I told one "the front end will be done by the end of next week, and the rest of the application should be ready to roll out in ninety days with the next scheduled release", memos started coming back from every direction that the whole app would be ready by the end of next week.

    In fact, I strongly suspect that if you simply answered every question with "toilet purple kittycat nightgown pastrami", they'd somehow hear that as "your idea is brilliant; go right ahead with it".  And then they'd quote you on that.



  • @snoofle said:

    At least it's Friday.

    So now everyone gets to work on the weekend to fix this mess!



  • @OhNoDevelopment said:

    @snoofle said:
    At least it's Friday.

    So now everyone gets to work on the weekend to fix this mess!

    Nah - already fixed (there aren't that many production logins or applications, so it was fairly easy to get the list, and decrypt the passwords from the run-scripts).

    Mind you, I didn't fix it to help these idiots (who created a problem where none actually existed); I'm just really bored and it was something to do.


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.