What could possibly go wrong?



  • I'm a web developer for a small ISP, and as a course of doing business we expose limited functionality from our internal Network Management System to our customers. Sadly, this NMS provides zero functional API that I'm aware of, so everything we do with it boils down to a semi-hack. Not a big deal, it still works fairly well most of the time, but recently our admin made a change on the NMS server that broke one of the pieces of functionality we provide (a very simple network monitoring service that allows the user to enter a usage threshold and receive emails when a circuit exceeds that given percentage, as well as a second email when the circuit falls back below the given percentage). After doing some assessment, I laid out what would be needed to fix it -- the data structure that we were originally using wouldn't work, so there would need to be some modifications to a couple database tables, as well as modifications to the scripts that import and process the circuits as well as the code that saves the information on what percentage the customer wants to be alerted at. Fairly simple stuff, perhaps a couple days worth of work though made more complicated due to the fact that I don't have a test environment that will work to fully test the edge cases of the implementation (a WTF in itself). Since I'll be off work all next week, I let the admin know that we probably won't be able to have a fix in place until after the first of the year, due to wanting to make sure that everything works before we push any kind of update out.

    The admin, unbeknownst to myself decided to just go off and write a script to try and solve this whole problem himself, despite the fact that when I was trying to explain the process to him yesterday, it took three tries just to get him to understand the simple process of correlating a circuit to a given customer email and threshold percentage, and he's never coded a day in his life. This morning, he IMs me asking for help on his perl script (never mind that I'm still waiting for specifications on which pieces of data we need to be importing/processing in order to make the changes he's requested), and the thing is a total mess. I tell him not to do it, but instead to let it go through the proper channels and fix things correctly instead of trying to institute a hack workaround, since hacks like this are guaranteed to break over the holidays, and better than half our development team will be out for the holidays (including the only person who properly understands how all of this works -- me). He informs me that it "has to be done today", because this customer has been "calling for weeks" (yet the first ticket for it wasn't opened until Wednesday).

    My only regret is that I'm not going to be here next week to watch this thing blow up.



  • @SituationSoap said:

    My only regret is that I'm not going to be here next week to watch this thing blow up.
    Personally I love it when I get to skip the fallout, because the mess usually falls on my lap somehow. Just make sure your mobile's turned off.


  • Discourse touched me in a no-no place

    I came to the site for this post expecting a better laid out rant than the unreadable screed presented to me in email.



    I know how code snippets, for example, break when the notifications end up in gmail, and work out fine when I click the link and read them on CS.



    An occasional, but necessary, interruption to my enjoyment of TDWTF via email.



    Sadly, the only difference here was one paragraph break.



  • @PJH said:

    Sadly, the only difference here was one paragraph break.

    So your expectation was one break and the original post has two?

    This sounds like we will be hearing more about this WTF and I'm sure the writing style will be sufficient to get the basic WTFness across.



  • @SituationSoap said:

    He informs me that it "has to be done today", because this customer has been "calling for weeks" (yet the first ticket for it wasn't opened until Wednesday).

    This reminds me of a problem from several years back.  I released a change about a week before I was going to be out for a two week vacation.  The day before i left, someone called in a bug report caused by the change.  They said they needed the problem fixed immediately, because it had their whole operation down, and they'd been "calling about it for months".

    "Months"?  While I was able to easily reproduce their problem with the updated code, I couldn't reproduce it using the old code.  I checked the logs, there was no apparent activity from their system in the span of a year, apart from that day.

    They ran an existing service, that the whole company used.  How could they not have been in the log files?  They were considered quite important, and so had been on the notification list for the change, from the very beginning (you know, the message that said, "I've deployed this update on test box foo, please point your test environment there and test it to make sure it works for you.")  Their test box hadn't talked to any of our test boxes in the same time period.

    I didn't want to update our code that day, because deploying a code update the day before a vacation is asking for problems.  The update was working for a bunch of people who had tested early and thanked us for the new functionality.  As such, I was reluctant to back out the change.  They were really pushing to get a fix out that day, so one of my coworkers tried working up a fix.  Unfortunately, he wasn't familiar with that code, and his fix would've broken things for a lot of other people.  Actually, that wasn't surprising, I didn't think *I* could've worked up a real fix in the time given either.

    Fortunately, just before I left for the day, somebody else from the team complaining about the issue responded to the thread, saying that the code which was having an issue with the update was *supposedly* deployed months ago, but he checked the system and found it had actually been deployed that morning, without any change control process whatsoever - it didn't even grace their test box.  He had fixed the issue by issuing a revert command to their version control system, as the offending code hadn't even been checked in.

    The original person complaining about the issue replied back saying that the guy who 'fixed' the issue just lost man-months of work, because that code didn't exist anywhere else, as he'd exited out of his editor.  The end of the story got even better, as that reply email helped justify getting rid of the guy.  (No, the code wasn't actually lost, as the fixer made *four* copies of it before reverting: one to their test environment, one to his home directory, one to his laptop, and one to his USB drive.)  The bug in my code was fixed about a month later, over six months before anybody actually used it for real.



  • @tgape said:

    This reminds me of a problem from several years back.
    Front page material, IMO.  It hits squarely on the fact that we're not always at fault for issues, but we're suddenly responsible for fixing them.


Log in to reply