Schrodinger's Server



  • One of our production servers suddenly became unreachable, and some customers complained. Not to customer support. Not to sales. But to the CEO. The CEO stomped on the CTO who jointly stomped on the Production Manager who all went to jointly stomp on the production SA du-jour. 

    Unfortunately, the guy scheduled to be on duty was out sick, and a brand new junior SA was holding down the fort in his absence.

    C**: Why is server <X> down? Customers are threatening to leave if we don't keep the systems alive and available! Log in and see what went wrong!

    JSA: Sure, do you know the password?

    C**:  No, don't YOU know it?

    JSA: No, I just started working here two days ago. I know some of the passwords, but not for this box. The usual SA is unreachable.

    C**: Why aren't all the passwords for the master login the same?

    JSA: That would seem to be a sensible policy, I don't know why it's like this. If you like, I can change the application master password and use that login, but it might break scripts that rely on it.

    C**: Do it!

    JSA: <writes a script to stream edit the yp password and changes EVERY login's password to: "welcome">

    PM: <too late> NOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO!

    C**: What happened?

    PM: He just changed EVERY password - it's going to break everything!

    C**: Roll it back!

    JSA: You can't roll back a scripted stream edit

    C**: So is the server alive or dead?

    JSA: I can ping it so it must still be up, but none of the applications on that box seem to be responsive, and now I'm starting to get alarms from the other servers - applications seem to be crashing all over the place; I think we may have bigger problems

    At least it's Friday.

     



  • Great, now's there's coffee all over my desk and monitor. Thanks a lot.



  • OK the JSA shouldn't have offered to do that obviously, but I still feel a little bad for him getting fired.


  • Considered Harmful

    I've learned from too much experience that if you say to a business user, "we can do that, but [fire and brimstone and death and destruction]", they will only hear the first part. Now I just say, "no, it can't be done."



  • I give the reason it can't be done FIRST. It's like writing a newspaper article, you lead off with the most important bit of information, "it would completely break all our apps, but..."

    I wouldn't expect a junior guy who's only been in the office 2 days to know that, but if he's quick he's already learned it. If he's not then we'll get lots more Snoofle in the future!


  • Considered Harmful

    @blakeyrat said:

    I give the reason it can't be done FIRST. It's like writing a newspaper article, you lead off with the most important bit of information, "it would completely break all our apps, but..."

    I tried that, too. You must have wiser business users than I. They completely tune out everything but the yes or no.



  • Yeah...



    That draconian process that the CEO and the CTO came up with, way back in the day, is only meant to be followed by the idiots they employ... obviously the guys at the top who came UP with the process shouldn't be beholden to it, as they clearly know what they're doing.




    I like how they can't actually blame anyone but themselves... so I look forward to hearing all about how they blame everyone but themselves.



  • @locallunatic said:

    OK the JSA shouldn't have offered to do that obviously, but I still feel a little bad for him getting fired.

    Who said he was fired? He was the only SA on duty and they needed him to revert all the passwords - one at a time - to get everything back up and running. Fortunately, a few of us were  around (I'm not supposed to know the production passwords, but I do) to help out.

    By random chance, I happened to know that the only thing running on the flaky box was an FTP server, so if it died, the only impact was customers wouldn't be able to use a certain application to retrieve documents (that particular app was the only one still functioning). It turned out that management told the junior SA to check on processes that weren't supposed to be running on that box, so of course they didn't respond.

    After some digging, the whole "glitch" turned out to be caused by an end user at the client who forgot his new password and couldn't log in, so he bucked it up the chain that he couldn't log in, and the folks above assumed our stuff was down and bitched up the chain.

    Idiots.

     



  • @snoofle said:

    After some digging, the whole "glitch" turned out to be caused by an end user at the client who forgot his new password and couldn't log in, so he bucked it up the chain that he couldn't log in, and the folks above assumed our stuff was down and bitched up the chain.

    Idiots.

     

    LMFAO!!!!

    You should've taken that other job xD


  • ♿ (Parody)

    @blakeyrat said:

    I wouldn't expect a junior guy who's only been in the office 2 days to know that, but if he's quick he's already learned it. If he's not then we'll get lots more Snoofle in the future!

    Yes. But I also give a lot of allowance for a junior guy (2 days!?...who thought it was a good idea to leave him effectively in charge) with multiple C-level execs yelling at him. There're pretty much no good outcomes from that sort of situation.



  • @boomzilla said:

    @blakeyrat said:
    I wouldn't expect a junior guy who's only been in the office 2 days to know that, but if he's quick he's already learned it. If he's not then we'll get lots more Snoofle in the future!

    Yes. But I also give a lot of allowance for a junior guy (2 days!?...who thought it was a good idea to leave him effectively in charge) with multiple C-level execs yelling at him. There're pretty much no good outcomes from that sort of situation.

    Not to mention that it depends on the types of scripts you're using. Sometimes it only takes a tiny typo to create disaster - like the difference between "rm -rf ./*" and "rm -rf /*" (though the --preserve-root option should mitigate such nowadays)



  • @Rhywden said:

    @boomzilla said:
    @blakeyrat said:
    I wouldn't expect a junior guy who's only been in the office 2 days to know that, but if he's quick he's already learned it. If he's not then we'll get lots more Snoofle in the future!

    Yes. But I also give a lot of allowance for a junior guy (2 days!?...who thought it was a good idea to leave him effectively in charge) with multiple C-level execs yelling at him. There're pretty much no good outcomes from that sort of situation.

    Not to mention that it depends on the types of scripts you're using. Sometimes it only takes a tiny typo to create disaster - like the difference between "rm -rf ./*" and "rm -rf /*" (though the --preserve-root option should mitigate such nowadays)

    I've since taught this new-hire SA the iterative-script-development technique I use, running the script at each stage to verify it's doing what you want and no more: 1) do <whatever> to get the data, 2) filter the data and verify only the rows of interest are getting through, 3) massage the data as required, 4) only after you're sure your output is desired do you add the code to make the actual change.


  • @snoofle said:

    @Rhywden said:

    @boomzilla said:
    @blakeyrat said:
    I wouldn't expect a junior guy who's only been in the office 2 days to know that, but if he's quick he's already learned it. If he's not then we'll get lots more Snoofle in the future!

    Yes. But I also give a lot of allowance for a junior guy (2 days!?...who thought it was a good idea to leave him effectively in charge) with multiple C-level execs yelling at him. There're pretty much no good outcomes from that sort of situation.

    Not to mention that it depends on the types of scripts you're using. Sometimes it only takes a tiny typo to create disaster - like the difference between "rm -rf ./*" and "rm -rf /*" (though the --preserve-root option should mitigate such nowadays)

    I've since taught this new-hire SA the iterative-script-development technique I use, running the script at each stage to verify it's doing what you want and no more: 1) do <whatever> to get the data, 2) filter the data and verify only the rows of interest are getting through, 3) massage the data as required, 4) only after you're sure your output is desired do you add the code to make the actual change.
    I make a lot of use of "set -x" during the initial stages of developing any remotely risky or complex script.  And echoing rather than executing any command that writes/deletes/updates data.  When it generates the right commands as output on stdout, pipe it into another shell.




  • @DaveK said:

    And echoing rather than executing any command that writes/deletes/updates data.  When it generates the right commands as output on stdout, pipe it into another shell.
    Interesting approach.... I like that!



  • @snoofle said:

    @locallunatic said:

    OK the JSA shouldn't have offered to do that obviously, but I still feel a little bad for him getting fired.

    Who said he was fired? He was the only SA on duty and they needed him to revert all the passwords - one at a time - to get everything back up and running.

    Sorry, where I am the JSA would have been shown the door so that the higher ups could claim that it was his mistake that caused the big issue but tell the customer that they don't need to worry cause he is gone.



  • @snoofle said:

    @DaveK said:

    And echoing rather than executing any command that writes/deletes/updates data.  When it generates the right commands as output on stdout, pipe it into another shell.
    Interesting approach.... I like that!

    And if auditing/reproducibility is a concern, you can tee the commands into a logfile on the way.




  • @joe.edwards said:

    @blakeyrat said:
    I give the reason it can't be done FIRST. It's like writing a newspaper article, you lead off with the most important bit of information, "it would completely break all our apps, but..."

    I tried that, too. You must have wiser business users than I. They completely tune out erase everything but the yes or no when they forward emails to blame IT for bad decisions.

    FTFY so it looks more like things I deal with.



  • @joe.edwards said:

    @blakeyrat said:
    I give the reason it can't be done FIRST. It's like writing a newspaper article, you lead off with the most important bit of information, "it would completely break all our apps, but..."
    I tried that, too. You must have wiser business users than I. They completely tune out everything but the yes or no.
    Face it, the business people hear what they want to hear and not a word more.  When I told one "the front end will be done by the end of next week, and the rest of the application should be ready to roll out in ninety days with the next scheduled release", memos started coming back from every direction that the whole app would be ready by the end of next week.

    In fact, I strongly suspect that if you simply answered every question with "toilet purple kittycat nightgown pastrami", they'd somehow hear that as "your idea is brilliant; go right ahead with it".  And then they'd quote you on that.



  • @snoofle said:

    At least it's Friday.

    So now everyone gets to work on the weekend to fix this mess!



  • @OhNoDevelopment said:

    @snoofle said:
    At least it's Friday.

    So now everyone gets to work on the weekend to fix this mess!

    Nah - already fixed (there aren't that many production logins or applications, so it was fairly easy to get the list, and decrypt the passwords from the run-scripts).

    Mind you, I didn't fix it to help these idiots (who created a problem where none actually existed); I'm just really bored and it was something to do.



  • @Rhywden said:

    rm -rf ./*

    Why would you ever, ever, ever write this that way? It's like you want to make a typo..



  • @morbiuswilters said:

    @Rhywden said:
    rm -rf ./*

    Why would you ever, ever, ever write this that way? It's like you want to make a typo..

    I wouldn't. Just an extreme example. I'd rather use the already suggested approach of "data gathering/checking first - then executing potentially destructive commands on the data"



  • @snoofle said:

    After some digging, the whole "glitch" turned out to be caused by an end user at the client who forgot his new password and couldn't log in, so he bucked it up the chain that he couldn't log in, and the folks above assumed our stuff was down and bitched up the chain.

    Idiots.

    I was actually expecting the story to be "they thought it was down, they acted as if it was down, it wasn't down" before this part.

    This reminds me of a case when I was working at Massive Insurance Co. Somebody had reported malware on a server and the top guys ordered all the sites to be shut down. I don't remember the details, but I remember we certainly didn't need to shut down all the websites to be shut down to protect the users or fix the real problem (which might have involved real malware).

    Yeah... your story is better.



  • @da Doctah said:

    I strongly suspect that if you simply answered every question with "toilet purple kittycat nightgown pastrami"

    Oh great. Now I have to change my password.



  • @joe.edwards said:

    @blakeyrat said:
    I give the reason it can't be done FIRST. It's like writing a newspaper article, you lead off with the most important bit of information, "it would completely break all our apps, but..."

    I tried that, too. You must have wiser business users than I. They completely tune out everything but the yes or no.

    Don't give them an actual "yes". I borrow a phrase from Jeeves: "I couldn't advise it".



  • @pjt33 said:

    I borrow a phrase from Jeeves: "I couldn't advise it"

    +1. Better than my Jeeves quote, which is a lot more egotistical.



  • @mikeTheLiar said:

    @pjt33 said:
    I borrow a phrase from Jeeves: "I couldn't advise it"
    +1. Better than my Jeeves quote, which is a lot more egotistical.
    Probably not advisable in this sort of context, but I've always loved the letter HL Mencken used to send to crackpots who asked him for his advice about incoherent insane ramblings that he could make no sense of (as a newspaperman, he tended to attract quite a lot of these):

    Dear Sir (or Madam),

    You may be right.

    Yours sincerely,

    Henry Louis Mencken

    It's my understanding that he had a supply of postcards preprinted with this response.



  • @snoofle said:

    JSA: That would seem to be a sensible policy, I don't know why it's like this. If you like, I can change the application master password and use that login, but it might break scripts that rely on it.
     

    @snoofle said:

    JSA: <writes a script to stream edit the yp password and changes EVERY login's password to: "welcome">

    Someone's scripting skillz aren't up to much.

    @snoofle said:

    C**: Why aren't all critical passwords recorded somewhere safe for situations such as these?

    FTFY. @snoofle said:

    JSA: I can ping it so it must still be up

    (a) I don't see how that action requires a password change first

    (b) If this was a supportable production server, why the hell isn't any monitoring software providing an early warning system?

    (c) Another case for centralised logging to provide analysis separate from the server.



  • @joe.edwards said:

    I've learned from too much experience that if you say to a business user, "we can do that, but [fire and brimstone and death and destruction]", they will only hear the first part.
     

    True.dat

    @joe.edwards said:

    Now I just say, "no, it can't be done."

    /bin/false. I'd have them sign off so that they know the consequences and still agreed to it.  Telling them it can't be done is plainly wrong, and you'll lose credibility when someone else shows that it can be done.

    @blakeyrat said:

    I give the reason it can't be done FIRST

    That. I read them the riot act before handing over the baton (and, in some cases, ensure there are witnesses/evidence so they can't claim ignorance later).

    @blakeyrat said:

    I wouldn't expect a junior guy who's only been in the office 2 days to know that

    I would expect a system administrator, no matter their level of experience, to think and act like a system administrator. It's a different matter if they were an apprentice/trainee but if their title is JSA then they'll have been through some interview process to ascertain their capability and aptitude for the job. A sysadmin unfamiliar with a new corporate culture I can understand; a loose cannon is unforgivable.

     



  • @snoofle said:

    It turned out that management told the junior SA to check on processes that weren't supposed to be running on that box, so of course they didn't respond.
     

    Poor configuration management? I'd like to think there's some diagram documenting what services are running on which box, along with their importance (which could help people like new Junior SAs) but knowing WTF Inc it's all stuck in people's heads.

    @snoofle said:

    After some digging, the whole "glitch" turned out to be caused by an end user at the client who forgot his new password and couldn't log in, so he bucked it up the chain that he couldn't log in, and the folks above assumed our stuff was down and bitched up the chain.

    Incident escalation via ignorance. It's one thing for a customer to do it, it's another thing to self-LART. Didn't anyone along the chain actually ask for a clear incident report? Have people in the chain understood how much money has been wasted by adding another layer of hysteria for each communication channel it passes through?



  • @snoofle said:

    I've since taught this new-hire SA the iterative-script-development technique I use
     

    .. I wouldn't expect scripters to know the basics of the SDLC but it doesn't hurt to take them through some fundamentals, if only as a risk reduction measure (assuming that the scripts may be running with elevated privs).

    I'd certainly advocate testing them out.

    @DaveK said:

    I make a lot of use of "set -x" during the initial stages of developing any remotely risky or complex script.  And echoing rather than executing any command that writes/deletes/updates data.  When it generates the right commands as output on stdout, pipe it into another shell.

    +1 for "set -x", or "noexec".  I also use the echo trick to show what is about to be executed, but the final output sometimes doesn't match the actual command (especially in the case of quoted text or metacharacters).


  • @da Doctah said:

    Face it, the business people hear what they want to hear and not a word more.
     

    Communication is a two-way thing: if they're not hearing it right then perhaps we're not speaking it right.

    Yeah, they hear what they want to hear, but that doesn't mean we can't put it in terms to make them listen more intently.

    @da Doctah said:

    "the front end will be done by the end of next week, and the rest of the application should be ready to roll out in ninety days with the next scheduled release"

    "Given no unexpected issues, the finished product can be released in 90 days. If all goes according to plan, next week we'll be in a position to show you a non-working prototype if you wish to see what it looks like."

    @da Doctah said:

    "toilet purple kittycat nightgown pastrami"

    "fizzbuzz dildo horse battery staple. By tomorrow, if possible."



  • @da Doctah said:

    Dear Sir (or Madam),

    You may be right.

    Yours sincerely,

    Henry Louis Mencken

    It's my understanding that he had a supply of postcards preprinted with this response.

     

    .. which were returned by pedantic dickweeds that pointed out the salutation used requires "Yours faithfully".

     


  • Garbage Person

    @Cassidy said:

     

    .. which were returned by pedantic dickweeds that pointed out the salutation used requires "Yours faithfully".

    Wait, what? I love arcane pedantic bullshit like that, and I've never heard of any sort of link between salutation and closing. CITE THIS IMMEDIATELY.

     



  • @Cassidy said:

    I would expect a system administrator, no matter their level of experience, to think and act like a system administrator. It's a different matter if they were an apprentice/trainee but if their title is JSA then they'll have been through some interview process to ascertain their capability and aptitude for the job. A sysadmin unfamiliar with a new corporate culture I can understand; a loose cannon is unforgivable.

    You know damn well that a lot of employees, especially new employees, think the C**-level guys are like perfect luminous beings who can do no wrong. Or alternatively, they're scared shitless that they'll be fired by disobeying an order from them. (Protip: the opposite is true, if you can articulate the reason why you didn't do it, which is why once again for the 3342,436th time, communication is by far the most important job skill.) I attribute this to them learning from sitcoms, which are always like this. (Think: Mr. Slate in Flintstones, Mr. Burns, that tricerotops guy in Dinosaurs who gets a special mention because he literally ate his subordinates, etc.) Also watching too many military shows and failing to realize that corporations are nothing like a military.

    It takes a few years of realizing your management makes dumb decisions all the fucking time under your belt.



  • @Cassidy said:

    Yeah, they hear what they want to hear, but that doesn't mean we can't put it in terms to make them listen more intently.

    @da Doctah said:

    "the front end will be done by the end of next week, and the rest of the application should be ready to roll out in ninety days with the next scheduled release"

    "Given no unexpected issues, the finished product can be released in 90 days. If all goes according to plan, next week we'll be in a position to show you a non-working prototype if you wish to see what it looks like."

    Which gets cut down to "...the finished product can be released...next week..." and that's what actually goes out to the Greater Western World, with your name on it as the author.

     



  • @Weng said:

    CITE THIS IMMEDIATELY.
     

    Your google-fu escapes you?

    TL;DR version: I was taught (at a young age) "Dear Sir != Your Sincerely" and "Dear Friend != Yours Faithfully" - the "mix, don't match" rule. Something as stupid as that stuck with me for years.



  • @blakeyrat said:

    You know damn well that a lot of employees, especially new employees, think the C**-level guys are like perfect luminous beings who can do no wrong.
     

    Admittedly, yeah. I was reading "junior" in terms of years of service, rather than age, so would have expected a bit more gumption behind a JSA. Evidentially not.

    @blakeyrat said:

    It takes a few years of realizing your management makes dumb decisions all the fucking time under your belt

    I learned that lesson pretty quickly, but hoped there was some ulterior reasons behind it. The older I grew, the more my hope eroded into cynicism that my original fears were founded. And at this stage of life, I sadly recognise that management don't make dumb decisions, they make them again.

    I don't know what it is - as though there's some window of stupidity that management are forced to traverse during their escalation through the ranks. For some, they recognise the quagmire around their ankles and quickly step out before they sink too deep but learn from the foot stench. Others remain stationary, loudly demanding to know "why are you all growing tallerBlubBlubBlubble?"


  • Garbage Person

    @Cassidy said:

    @Weng said:

    CITE THIS IMMEDIATELY.
     

    Your google-fu escapes you?

    TL;DR version: I was taught (at a young age) "Dear Sir != Your Sincerely" and "Dear Friend != Yours Faithfully" - the "mix, don't match" rule. Something as stupid as that stuck with me for years.

    Who said I tried to Google it? And it looks to be one of those dumb British things, like inserting u's where they aren't necessary. Also, I now know the word 'valediction', which will be the Word of the Day at work tomorrow (I'm a bit infamous for using really snarky and borderlien inappropriate ones on internal emails)

     



  • @blakeyrat said:

    that tricerotops guy in Dinosaurs who gets a special mention because he literally ate his subordinates
    ... which was dumb, because triceratops were herbivores.



  • @Anonymouse said:

    @blakeyrat said:
    that tricerotops guy in Dinosaurs who gets a special mention because he literally ate his subordinates
    ... which was dumb, because triceratops were herbivores.

    Yeah it's shocking because you'd expect a show like this would be 100% scientifically accurate.



  •  

    And it looks to be one of those dumb British things, like inserting u's where they aren't necessary.

    The Language is English therefor we set the rules (& make them so complicated and contradictory that no-one can ever be 100% correct)

    it is really anoying when you guy drop letters or replace s's with z's unnecesarily



  • @Weng said:

    And it looks to be one of those dumb British things, like inserting u's where they aren't necessary.

    Apparently they were necessary at one point, in order to try and represent their pronounciation in the original Old French.  Hey, I learned something today.

    Personally I like the way that the spelling 'colour' makes clear that it doesn't sound like 'colon' would with an r instead of the final n.




  • @blakeyrat said:

    this

    And people think American isn't the greatest nation ever.


Log in to reply