'Compared to how these normally go, that rollout wasn't that bad'


  • Garbage Person

    The subject is a quote, spoken by me to my carpool-mate (who is only a 6 month verteran to the team and has never seen our silly-season before).

     

    Our silly-season kicked off with a vengeance this week, with the much-anticipated launch of one of my team's programs. This program's original go-live date was in May. It was pushed back repeatedly due to hardware difficulties in manufacturing (in short, the 20 year old machine they were trying to use was giving up the ghost. Ultimately, the machine blew the hell up during light-duty pre-production testing and a replacement was ordered) 

     

    Rewind to the previous week:

    Monday -  A frantic call from manufacturing comes in telling us that the next program to go live will have to go live in a partial DR failover mode, because the shiny new machine they just bought wasn't working. The DR for this failure mode involves having another facility do the initial manufacturing step and overnight the output to the primary site for subsequent steps. Not a huge deal.

    Thursday - A frantic call from manufacturing comes in telling us they 'just heard about how we were going to be delivering output' and that 'their equipment can't deal with that'. After a quick check, it turns out they're talking about the program that goes live in two weeks. Alright. Fine. NBD. We do some digging and find out that they were told flat out by another development group working on another project producing a similar product told them it was flat out impossible. Some further digging revealed that we do similar things all the time, the page ordering is just a little bit different. They didn't believe us, so we scheduled a roadtrip for Friday to go up there and show them.

     Friday - We go on our roadtrip and make the other development group look REALLY BAD by walking in, giving a machine operator a notepad and asking him to draw their needs for us, and sitting in a conference room for 4 hours before walking out to the supervisor 'run this'. Runnification happened, and satisfactory output was produced. There was much rejoicing and much consumption of corporate-sponsored pizza and beer lunches..

    Friday, 3:30PM - We're sitting in traffic. We get a call from the PM on the project due to go live on Monday.  "We have to go live in stage 2 failover - the customer never approved the stage 1 failover site's security. Sales just told us this 10 minutes ago. I am livid." Stage 2 failover involves a facility that does related production and is actually security approved doing the first several manufacturing steps, stuffing the output in a box and overnighting it to the primary site for final assembly and packaging. The stage 2 facility is highly automated and very busy and physically can not get output to the primary site in less than 72 hours from receipt of our output. We have a 24 hour window to do pre-processing in IT before it even lands in manufacturing. The scheduled batch jobs to meet the needs of the stage 2 facility don't take place until 12 hours after that. That's 108 hour. The customer is scheduled to visit the primary site on Wednesday. We will not receive input data until 8AM Monday morning.

     Friday, 4:00PM - The rollout window begins.

    Friday, 5:00PM - The rollout window ends when the last op goes home. (Our procedures require a dev-lead to shoulder-check the ops when they're doing a rollout with this much value).

    Friday, 5:30PM - We give up and stop for dinner and beer.

     

    Monday, 7:20AM - I roll in, and blast into the ops office. "You. Rollout. Now. God's personal authority."

    Monday, 7:30AM - I eyeball the customer's data feed and notice some peduliarities. Following up reveals the customer had IT issues on their end that caused the vast bulk of the records to reject. They'll try that again tomorrow

    Monday, 7:35AM - Sitting at the op's desk, I dial the infrastructure development supervisor's desk to make a very shouty bug report. I don't get an answer.

    Monday, 7:36AM -  "SQL console. Type exactly what I say... "DELETE FROM tbl... WHERE id NOT IN (SELECT MAX(id) FROM.... GROUP BY orderid)"

    Monday, 7:40AM - I dial the PM's cell phone. She's three time zones back, it's 4:40AM. "Infrastructure bug. Part of job ran twice. Culled duplicates. Client will receive duplicate reports. Other IT group will receive duplicate files. Need to kill jobs in other IT group. BTW, good morning sunshine!"

    Monday, 8:00AM - Conf call with sales and executives.

    Monday 8:30AM - Having failed to kill the jobs, the other IT group sends a bunch of duplicated files over. I cull the dupes and tell them to clean up the results for the ones I culled. I resume processing from their return files.

    Monday, 8:31AM - Process failure notifications hit my inbox. The other IT group's output files contain only one record each - this failed sanity checking versus what we passed them.

    Monday, 8:45AM - "Oh, it overwrites the previous record's output each time it iterates." No word on how it worked just fine during the extensive months of testing, including several tests in the prod environments.

    Monday, 9:00AM - Revised outputs from other IT group delivered. Failure notifications. The header record is duplicated once for each record.

    Monday, 9:01AM - HEADDESK

    Monday, 9:03AM - I feed manually fixed files to my system and processing resumes. 

    Monday, 10:00AM - Processing passes to yet another external IT group. Miraculously, nothing goes wrong (this is unprecedented in my entire employment history - this is NEVER configured right the first time)

    Monday, 11:00AM - Processing returns to my system. Everything runs fine.

    Monday, 11:30AM - Conf call with stage 2 DR facility to confirm their needs. They have some suggestions to tweak the process to help them get their processes expedited. We promise nothing, but immediately hand the change to a developer.

    Monday, 12:00PM - Developer delivers the requested modifications. I stage for emergency rollout (skipping QA because the process is trivial and timeline critical)

    Monday, 12:15PM - I storm into the ops office and hijack yet another innocent op.

    Monday, 12:17PM - Rollout complete, we execute the DR process, 32 hours early... Which goes hilariously haywire.

    Monday, 12:19PM - I reboot an appserver in our cluster to prevent it from lunching the entire shared environment.

    Monday,  12:25PM - I fix the missing WHERE clause at the ops desk and issue another rollout.

    Monday, 12:30PM - The DR package is delivered for production -  31.5 hours ahead of schedule.

    I  dumped my cell on my desk, grabbed a cigar and the admin assistant and went for an hour-long walk in the park.  We discussed our mutual disdain for humanity and contemplated murdering the ducks in the pond. To put them out of their misery, see? That time went under 'teambuilding'.

    Monday, 1:30PM - I resume work on the insane, short-notice, hell-program due the following Monday (for which we had made the ridiculous roadtrip on Friday)

    Monady, 4:00PM -  I am reminded that the customer also made a hash of things, and advise the ops people to shut off the automated processing of inbound data for this program - we're going to want to put eyes on it in the morning.And we're still in stage 2 DR until further notice, pending shipment of technicians and materials from a country not normally known for high-tech exports or even low cost production.

    Monday, 4:30PM - I make the statement in the title.

     

    I wasn't kidding. That was a good one.

     

     



  • Sounds like fun. More front page stories should be written in this style.



  • @Weng said:

    the next program to go live will have to go live in a partial DR failover mode, because the shiny new machine they just bought wasn't working.
    This happens so often you wonder whether anyone's ever thought of making all new machines non-shiny to see if it would make a difference.


  • Garbage Person

     @da Doctah said:

    @Weng said:

    the next program to go live will have to go live in a partial DR failover mode, because the shiny new machine they just bought wasn't working.
    This happens so often you wonder whether anyone's ever thought of making all new machines non-shiny to see if it would make a difference.

    Actually, that was anonymization. The old one actually had some really elegant looking features to it and it was decidedly shiny. The new one is just clad in black and dark gray plastic. Looks like an HP desktop computer writ large (and has the observed reliability of one)

     



  • @Weng said:

    I notice some peduliarities
     

    As do I.


  • ♿ (Parody)

    Reading this was like watching an episode of Burn Notice.

    You know managers...bunch of bitchy little girls.



  • @boomzilla said:

    Reading this was like watching an episode of Burn Notice.

    You know managers...bunch of bitchy little girls.

    My name is Weng. I used to be a manager, until . . . . .

    "We've got a Failover Notice on you"

     

     



  • @Weng said:

    It was pushed back repeatedly due to hardware difficulties in manufacturing (in short, the 20 year old machine they were trying to use was giving up the ghost.

     

    Are you from Germany, or is that saying really usable in english?

     



  • @fire2k said:

    @Weng said:
    It was pushed back repeatedly due to hardware difficulties in manufacturing (in short, the 20 year old machine they were trying to use was giving up the ghost.
     

    Are you from Germany, or is that saying really usable in english?

     

    "Giving up the ghost" is a common saying in English.

    Or maybe you were referring to "hardware difficulties in manufacturing" 

     



  • @Weng said:

    Monday, 7:20AM - I roll in, and blast into the ops office. "You. Rollout. Now. God's personal authority."

    I call bullshit. No developer worth their salt arrives at work before noon.


  • Garbage Person

    @El_Heffe said:

    @boomzilla said:

    Reading this was like watching an episode of Burn Notice.

    You know managers...bunch of bitchy little girls.

    My name is Weng. I used to be a manager, until . . . . .

    "We've got a Failover Notice on you"

    My technical PMs and I actually routinely compare our daily lives to Burn Notice. It's amazing how applicable that show is to... Whatever the hell it is we do all day.

     

    Day 2 - Tuesday.

    8:00AM - I finish putting gas in the conspicuously brand new, hella-shiny, reasonably priced import sports coupe, I nod at the pretty girl who is my on-again/off-again/oh-god-on-again love interest in the passenger seat and, after a closeup of my mashing the keyless start button (fancy!), a second closeup as I notch it into first gear and drive off with a tiny tire chirp in a mildly aggressive manner intended to show off the manueverability and performance of the car, along with its well-appointed standard features, all the while voiceovering some paff about choosing the right car and evading a pursuer without looking like you're onto them (DAMNED SALESMEN!). The whole thing comes off REALLY product-placey, with at least four gratuitous closeups of the badging on the car.

    8:20AM - Late as hell, I walk in and drop my laptop on the dock, smack the power button, flip on the hyper-exclusive dual $80 21" monitors (Close-up on the nameplate as I push the power buttons), toss my enegy drink in the fridge, my breakfast burrito and cell phone on my desk, and go wander into the admin assistant's cube and talk about her adorable kitty. We bond for several minutes over yogurts.

    8:30AM - I wander over to my technical PM's cube and get the morning chaos sitrep. Situation Abnormal: Nothings Fucking Wrong. 

    8:31AM - I wander over to my cube and click the Gmail icon on my desktop. Pause on a closeup of the icon for conspicuously obvious product placement.  I read an email from my the project PM. We have 3 sorts of PMs on my team. First is a tech PM attached to the team. He knows what we technically can and can not do and is in charge of our resource allocations and really should be the group manager. Then we have a project PM attached to a project or group of projects - their purpose is to aggregate and act as a gatekeeper preventing the unwashed masses from the business from ensuring we get nothing done. They write the specs in collaboration with the tech PM and business PMs. And then there are a wide array of Business PMs who don't understand how this process works despite having been attached to at least a dozen other very similar projects because they have the attention spans of goldfish. This particular project PM is attached to all this client's projects passing through my team - the cream of the crop and she and I have developed a brilliant synergy. I can say stuff like "That won't work because of... uh, that thing that does the stuff to the thinger" and she correctly knows I'm talking about the module that reformats the contract PDFs for electronic delivery. She says she looked at the day's data feed and that it looks goo, and to go ahead and process it.

    8:32AM - I crack open an energy drink. Closeup on the logo.

    8:33AM - I pull up the production control console and click the "Manual process start" buttons in the right sequence.

    8:40AM - Everything seems to have processed fine. I haul my breakfast burrito off to the toaster oven.

    9:00AM - I settle down to my desk and check the job status. I mention to the project PM via IM that a certain processing step has been reached. I pop some quick SQL queries into the production server console (I have read-only access to prod) and retreive a summary of the reject file we sent to the client. I send that data over to her.

    9:15AM - I get an IM. "The record counts don't match."

    9:45AM - After the panicked call between the project PM, technical PM and I, we have identified the problem - the records that had failed out on the customer's end the previous day had failed on ours today. Because of a subtle issue with the way the client's jobs are scheduled production system (manifesting itself as some files being timestamped for one day and some other files being timestamped for another, despite having actually generated on the same day, which is inconsistent with how it always worked in production, when timestamps were always day-consistent) was stomping on records depending on data in the mismatched files. A second bug caused those not to be reported back as failed records. We've also developed a plan for remedying the situation without scuttling the time-crunch DR production date. The records that failed MUST get produced and shipped from the DR facility TODAY. The customer will be on-site tomorrow at the primary site and MUST audit these particular records because they depend on the mismatched resource files.

    9:46AM - I assign the bugfixes to a developer and head up to the ops office.

    9:48AM - "SQL Console. Type exactly what I say. DELETE FROM tblFailureLog WHERE........ "

    10:15AM - Revised feed files drop to one of the external teams.

    10:30AM -  I wander back to my desk and grab the source to a bunch of stored procedures and start hacking at the where clauses...

    11:00AM - IM to the op I'd kidnapped for the day: "Here's a script. Untested. Run it in prod."

    11:05AM - New feed files specifically containing the mislaid records drop to another external group.

    12:45PM (15 minutes until the production deadline) - The aforementioned external group returns its data. 

    12:50PM - Processing completes. I get a count of the number of records returned and ask the program PM to verify my count. Long mintues of silence pass.

    12:55PM - First angry emails about missed deadlines fly. My phone starts ringing with numbers corresponding to senior sales management.

    12:56PM - "Verified." I smack the "Build stage2 DR file" job's go-button.  Long seconds pass.

    12:57PM - "On the FTP. Filename is xxxxx.zip"

    12:58PM - The tech PM sends an email to the production coordinator at the DR site with the filename.

    12:15PM -  Boss+3's number appears on my caller ID

    12:59:30PM - I pick up the phone. It's a conference call.  The full name and jingle of the teleconference service, along with their website are clearly enunciated over the headset. Then the call itself comes on. People are shouting

    12:59:45PM - The GloboRegional UnterVP of Sales starts ranting about how many millions of dollars in future sales the company just lost. It has the word 'thousand' before 'million', identifying the man as an innumerate jerk.

    12:59:59PM - "Oh, hey. There's the file. We'll get that out to production."

    1:00PM - I slam down the receiver, smash "Windows-L" on the keyboard. The Windows logo pops up on my screen. Closeup and hold for product placement.

    1:30PM - The admin assistant and I reach the far point of the path around the lakes at the park and circle back towards the office, yogurts in hand.

    3:30PM - Q2 results call. My team is specifically called out for outstanding, drama-free performance. We're told there will be dividends paid, but no merit raises; revenues and profits are up several hundred percent, and benefits will be slashed but somehow noone on the call would be affected.

    5:00PM - Fiona slaps me for 'hanging out with that whore', and then kisses me passionately after a tense moment of eye contact. We toss our assault rifles into the back of the car, I slam the trunk (conveniently bringing the manufacturer's logo to center screen for a just-slightly-too-long product placement hold) and drive off into the sunset with the mightiest burnout the car can achieve.

     Roll credits on the right third of the screen, promos on the left two-thirds.


  • Garbage Person

    @Soviut said:

    @Weng said:
    Monday, 7:20AM - I roll in, and blast into the ops office. "You. Rollout. Now. God's personal authority."

    I call bullshit. No developer worth their salt arrives at work before noon.

    1) My carpool partner is a contractor and paid by the hour, so being reasonably on time is a plus.
    2) I never said I did any actual WORK before noon (except in times of crisis)
    3) I hesitate to call myself a developer at all anymore. I'm some sort of team lead/project manager/just-plain-manager hybrid in practice. I only write code when I literally can not trust anyone else to do it and get it right and on time.



  • You didn't blow anything up.

    I am disappoint.


  • Garbage Person

     Oops. I forgot.

    1:31PM: A car parked on the adjacent roadway erupts in a completely implausible low velocity gasoline explosion.


  • Discourse touched me in a no-no place

    @fire2k said:

    @Weng said:

    It was pushed back repeatedly due to hardware difficulties in manufacturing (in short, the 20 year old machine they were trying to use was giving up the ghost.

     

    Are you from Germany, or is that saying really usable in english?

     

    It's a perfectly cromulent English sentence.



  • @Weng said:

    We toss our assault rifles into the back of the car,
     

    wait what


  • Garbage Person

    @PJH said:

    @fire2k said:

    @Weng said:

    It was pushed back repeatedly due to hardware difficulties in manufacturing (in short, the 20 year old machine they were trying to use was giving up the ghost.

     

    Are you from Germany, or is that saying really usable in english?

     

    It's a perfectly cromulent English sentence.
    Yeah, how dare you doubt the splendiferousness of my English. I am, after all, a native speaker. A native speaker with an undying love for complex sentence structure and multi-level parentheticals and a massive hate-boner for proofreading.

     


  • Garbage Person

    @dhromed said:

    @Weng said:

    We toss our assault rifles into the back of the car,
     

    wait what


     In retrospect, it's usually only Fiona with a rifle. And it's usually some manner of sharpshooter's rifle rather than an automatic jobber. It's been awhile since I saw any Burn Notice.



  • @Soviut said:

    @Weng said:
    Monday, 7:20AM - I roll in, and blast into the ops office. "You. Rollout. Now. God's personal authority."

    I call bullshit. No developer worth their salt arrives at work before noon.

    7:20 AM isn't before noon.  It's after midnight.

     



  • @Weng said:

    It's been awhile since I saw any Burn Notice.
    If you've seen one episode, you've seen them all.  Seriously.  In the begining I kept watching because it seemed like the show had good potential, but eventually I gave up because every episode is the same.


Log in to reply