Yet Another Production Incident


  • Garbage Person

    I walk into my office one Monday morning. "PRODUCTION IS DOWN!". This is not abnormal for a Monday. In fact, the rate at which things successfully come back from the weekend update reboots is approaching "seriously fucking alarming".

    I quickly ascertain which of our cruddy dependencies didn't come back. It was the shitty, but appealingly cheap small business software we use for a specific domain task. Being small business desktop software being run in a fully automated enterprise envrionment, we had to abuse the "automation" features built into the product. Basically, it watches folders for incoming files and runs a script that tells it what buttons to click in the UI. We have a service that starts the app in the background and just lets it run on a detatched desktop. This has the appealing effect of being able to kill that process and launch the client on a real desktop and actually spy on what it's doing (by watching it automagically click on shit and fill out forms. Turns out it was getting hung up on the Open File dialog. Periodically it'll forget that it needs to paste in the filename that it detected appearing in the watch directory. This despite the fact that the script quite clearly calls for it. This is always fixed by deleting and recreating the scripts. So I poke it with that particular stick, and everything starts processing again. Disaster is averted!

    Naturally, this had been going on unchecked for 2 days, so our job scheduler was full of timed out jobs that needed to be reprocessed. Handily, the web UI has a "Try that again!" button that tries the last step again.

    As our ops team clicked that button, something untoward happened: The same job got resumed twice. Actually achieving this is a nigh-irreproducible race condition, and the odds of both copies actually SUCCEEDING (because they are then in a series of race conditions wherein they are both editing the same database state) is basically impossible. Completely unprecedented. It happened. Both jobs successfully produced output. The downstream system (an ordering system) successfully accepted both outputs. This created an impossible state where individual orders were associated with more than one production batch, which did unholy things to the manufacturing output. It was caught on the production floor when someone said "HEY THIS PILE OF STUFF LOOKS EXACTLY LIKE THIS PILE OF STUFF. ALSO BOTH THESE PILES OF STUFF ARE FULL OF DUPLICATES THAT MAKE NO SENSE!"

    So we worked with manufacturing to have them throw away the duplicates (a grand total of $33 once you factor in the materials, costs and PROFIT MARGIN). No big deal. Except the idiot who is in charge of that program thought she'd seen the same thing before (similar symptoms in manufacturing are caused by a whole class of fuckup, and we'd recently had a spate of them wherein customer service wankers attempted to manually manipulate things that are not to be manually manipulated and caused mass order duplication, and customers fed in data that caused mass order duplication, and so on and so forth. None of those are our problem. But we admitted to this one, so somehow we were at fault for all the others.)

    She escalated, and we had to do the dreaded ROOT CAUSE ANALYSIS AND CORRECTIVE ACTION PLAN! Naturally, it was dripping in lovely statistics about the earth-shattering $33 writeoff figure (I actually offered to pay for it out of pocket to not have to do the god damned paperwork) and a frank analysis of the race conditions, and a promise to block that condition from occurring in a future revision of the system and a note that while we can block it at the UI level for a single user, but there's nothing we can do with our current insane architecture to prevent two users clicking the same button to trigger it. Major million-dollar refactor job that. Anwyay, we also made pains to point out that no human benig would have ever been touching that process if the shitty rube goldberg software hadn't gone wrong. I suggested it's complete replacement

    Days pass and, I got a call from the VP. I MUST add an Are You Sure dialogue to the button. I pointed out that this just moves the race condition to the Yes button instead. But I MUST add an Are You Sure dialogue or else. So I accepted the change request, dashed off a note to the operations team telling them I was going to make their lives miserable by adding a pointless click, and forwarded a JIRA ticket to the web guy.

    There is already an Are You Sure dialogue on that button.

    I was also told that we were going to add additional capacity to the Shitty Software environment by purchasing more licenses and more servers. When I pointed out that this would not help the fact that it's an unreliable piece of shit and that half the time it stops working it's because the Phone Home server died and thus it wouldn't even be isolated to a single instance, I was reminded that this product costs hundreds of dollars. The nearest enterprise-grade player in the same space is tens of thousands of dollars annually.

    When I pointed out that we sell it as a $1000-a-pop feature to hundreds of clients, I was basically told to shut up.



  • Why do you still work there?

    A company with that much incompetence doesn't deserve competent employees.



  • @Weng said:

    There is already an Are You Sure dialogue on that button.

    Well then, what's the problem?

    Problem solved!


  • Discourse touched me in a no-no place

    @EvanED said:

    Well then, what's the problem?

    Obviously you need an Are You Sure You're Sure? button.



  • Yet another reason I much prefer being my own boss. If things go FUBAR, I have nobody to blame but myself. Saves a lot of stress and frustration when you can just rm -rf and start over without someone going "WTF ARE YOU DOING"


  • Garbage Person

    At this point, I'm just riding it out for two reasons:

    1. There's a good chance of managerial reorganization putting me back in a saner structure
    2. The pay is ABSURDLY good with fair to middling chances of getting even better.

    If neither of those happen, I'm out. If only one of those things happens, the other had better be far in excess of what is expected to cover the difference.



  • A fairly predictable outcome, but still entertaining to read. I'm guessing the fact that the dialog was already there indicates this had happened before. Only last time, it cost the company $16.


  • Garbage Person

    Nah, my team wrote that front end from scratch over the past 10 months.

    No prior incidents of this type.



  • @Weng said:

    I was basically told to shut up.

    I hate politics. It gets in the way of engineering far too often.



  • @Weng said:

    it watches folders for incoming files and runs a script that tells it what buttons to click in the UI

    Not reading any more ewww ewwwwww ewwwwwwwwwww



  • @Weng said:

    I pointed out that this just moves the race condition to the Yes button instead. But I MUST add an Are You Sure dialogue or else.

    Couldn't help myself. Trying not to read further was like trying not to smell the finger after wiping it up the arse crack.

    So what you need to do here is cast the rightmost 64 bits of the workstation ID to a uint64, reduce that mod 1000, and delay acting on a Yes until the result matches the current NTP milliseconds value.

    When the problem is Rube Goldberg architecture, only a Rube Goldberg solution will do.


  • kills Dumbledore

    @FrostCat said:

    Obviously you need an Are You Sure You're Sure? button.

    https://www.youtube.com/watch?v=8Jc-QpgmcTI

    Edit, that was meant to start at 9 seconds



  • The way that I see it is that if they're overpaying you, it's probably to keep your mouth shut.


  • I survived the hour long Uno hand

    We had a Monday Morning production incident too. Total outage time: 20 minutes. Staff time trying to get at the root cause: half a day for each of a dozen people, some of which are far above my pay grade. End result: an email this morning stating "The issue was in proc foo. Some combination of X, Y, and Z resolved the issue; none of A, B, or C did. Director Gamma has put a bounty out for anyone who can figure out what exactly about condition bar caused this outage in proc foo."

    If that much information can't narrow it down enough for a room full of developers, I'm betting proc foo is a steaming pile of WTF. Foo is a condition we supposedly planned for, but in actual fact, our new client uses it in some way our older clients don't.


  • Garbage Person

    @Yamikuronue said:

    Staff time trying to get at the root cause: half a day for each of a dozen people, some of which are far above my pay grade.
    My favorite part of the RCA/CA process is that it requires an exact sequence of events. And it has a time limit to be filled out. The time limit is too short to figure out anything particularly tricky, nevermind go through a good engineering planning process on how to fix it.

    The time limit? The attention span of whoever asked for it. Or, more formally, how long they're willing to wait before escalating. I've met fruit flies willing to wait longer.


  • I survived the hour long Uno hand

    Our traffic light is nice, because when people are freaking out about a problem, being able to point to a GIANT wall sign that reflects their state of panic helps them realize we are, in fact, taking this seriously. People get more patient if they think you're handling it.


  • Garbage Person

    @chubertdev said:

    The way that I see it is that if they're overpaying you, it's probably to keep your mouth shut.

    The way I see it, there are 3 broad pay buckets. "Do as you're told, n00b", "You are really good at what you do. Please give us your professional advice." and "Hush money."

    I'm still in the second bucket.


  • Garbage Person

    @Yamikuronue said:

    People get more patient if they think you're handling it.
    Unfortunately, we're distributed. Our users/internal customers live in far off manufacturing sites. Our bosses lives in far off office complexes. Of the handful of people capable of walking into our offices, one used to be our boss and still thinks he is (and he was BAD. And is worse now.) The rest think his shit doesn't stink.

    Nobody can see us working on their problem - they can only take our word for it. And every time anyone else in this company makes assurances about anything, they're lying. So we get painted with that brush.

    Edit: This just reminded me of another really good one! It's going to be tough as hell to anonymize, I'll work on it today.


  • FoxDev

    @Weng said:

    It's going to be tough as hell to anonymize, I'll work on it today

    -drool- the ones that are hard to anonymize are the best WTFs!



  • @Weng said:

    No prior incidents of this type.

    Well, you guys just planned for disaster then, didn't you?



  • Cripes, it sounds like the manufacturing management software you use is worse than InfoR Visual which my company uses.


  • Garbage Person

    Failure is inevitable. The details vary.


Log in to reply