Can't do planned rollout


  • mod

    It should be apparent from my previous Side Bars, but I am currently working on a project we are doing with another company. Unfortunately, the other company is utilizing a development team in India. To help keep everything coordinated, we have teleconferences every Tuesday and Thursday morning. This story begins with our most recent meeting last Thursday.

    The conference call had been going well, no major issues. This made Hanzo uneasy, as his turn to take the floor neared. If no issues were raised soon, luck was bound to not be with him.

    "Hanzo," spoke Paul, hanzo's boss. "Did you have anything to discuss?"

    "Yes. I pushed some changes to the ResolveTask service in the test environment a few weeks ago. I was wondering if the India team was ready for us to push those changes to the production environment."

    "That shouldn't be a problem," said Shiva. "We'll just need 15 minutes to push the changes on our side."

    "Great," said Paul. "How about we plan for 7:00 am Pacific time tomorrow?"

    "Works for me," said Hanzo, ready to take charge. "Shiva, I'll email you when I'm ready to start on my side. I'll send you another email when I've finished with my changes. After that, just let me know when you've completed the changes on your side."

    "Sounds good," agreed Shiva.

    "Anything else?" queried Miles, Shiva's state-side boss. He was greeted with a chorus of "No".

    As had been standard practice for the past month, Hanzo quickly saved the meeting minutes - after making sure that the plans for tomorrow were crystal clear - and emailed them out to everyone in both companies.


    The next morning, Hanzo fired off his first email at 7:06 am. A few minutes late, but traffic had been a bitch. He immediately got to work on the planned promotion, which took all of 2 minutes, and then sent of the confirmation email.

    Half an hour later, Hanzo started to worry. No word from the team in India. Not a peep.Maybe they ran into a snag? well, just keep waiting.

    8:00 am, still nothing. Something's wrong. Time to fire off an email.

    **To:** Shiva **CC:** Paul, Miles **Subject:** Promotion

    What's your current status? It's been an hour since we promoted the update on our side, and you guys said you only needed 15 minutes. What's goin on ?

    Hanzo

    Another half hour went by with no response, so Hanzo did the only reasonable thing. He rolled back the changes.

    **To:** Shiva **CC:** Paul, Miles **Subject:** RE: Promotion

    It's now been 90 minutes since the scheduled promotion and I haven't heard anything from you. I'm rolling the changes back on our side. We'll try again next week.

    Hanzo


    Unfortunately for our brave hero, the story doesn't end there. On Saturday, Hanzo finally got a response:

    **From:** Shiva **To:** Hanzo **CC:** Paul, Miles **Subject:** RE: Promotion

    Sorry. We are ready to do promotion. We will push our changes Monday morning.

    Shiva

    Before Hanzo could respond, another email rolled in:

    **From:** Miles **To:** Hanzo, Shiva **CC:** Paul **Subject:** RE: Promotion

    Shiva,

    Do not plan on being able to apply these changes Monday. Hanzo needs to be able to coordinate with the users to schedule a downtime. Wait for further instructions.

    Miles

    And that was it, until Monday morning.

    When Hanzo got in Monday morning, he found out that Paul had arranged some downtime at 7:00 am Pacific time. He hurried and performed his promotion, and notified the India team. A few minutes later, this email came through:

    **From:** Shiva **To:** Hanzo **CC:** Paul, Miles **Subject:** RE: Promotion **Attached:** ServiceRequest.XML

    We were testing in production and got an error. Please verify.

    Shiva

    "What‽ Testing in production? What the hell were they doing when it was in test for three weeks?" Hanzo was grateful for his private office as he started yelling at no one in particular.

    After he calmed down a little, Hanzo finally gathered the courage to look at the attachment that Shiva had provided. Not knowing what to expect, he was not really surprised to find that they weren't including the new parameter in the request.

    **To:** Shiva **CC:** Paul, Miles **Subject:** RE: Promotion

    Your request doesn't include the new required element that you were informed about when these changes were made to the test environment almost 4 weeks ago. That is why you are getting the error.

    Hanzo

    An hour later, Hanzo was in Paul's office for the IT team's weekly meeting.

    "Hanzo, how are we looking for that promotion?"

    "I don't know. The last I heard from the India team, they had screwed up and not even done their part right. They were testing in production. I told them they didn't account for all the changes, and I haven't heard from them since."

    "Can we roll back?"

    "Sure, but since I don't know what changes the India team did account for, I don't know how functional ResolveTask will be. It would be just as effective to leave things as they are."

    "All right, I'll call Miles after we're done here. This is not acceptable."

    Hanzo sat through the rest of the meeting, wondering where else this adventure would lead …



  • So why were the components [Hanzo's and India's] not put into a SINGLE package which would be deployed to TEST, and then validated. Only after an ATOMIC update in test was successful should the SAME PACKAGE have been deployed to production.....


  • mod

    @TheCPUWizard said:

    So why were the components [Hanzo's and India's] not put into a SINGLE package which would be deployed to TEST, and then validated. Only after an ATOMIC update in test was successful should the SAME PACKAGE have been deployed to production.....

    Because Hanzo's components and India's components run in separate, yet cooperative, systems.



  • That must be another definition of "cooperative" that I was not previously aware of.



  • I see none of the little commentary notes that is normal for a story staring Hanzo.


  • sockdevs

    this should be good.


  • mod

    @locallunatic said:

    I see none of the little commentary notes that is normal for a story staring Hanzo.

    CBA


  • Discourse touched me in a no-no place

    @locallunatic said:

    a story staring Hanzo.

    Sometimes when Hanzo stares into the story long enough, the story stares back.



  • @locallunatic said:

    stalking through the night, references to the book of five rings, etc.

    He also normally works at a university, unless he finally made the jump.



  • @abarker said:

    Because Hanzo's components and India's components run in separate, yet cooperative, systems.

    It seemed a single system ["working on a project..." - not the singular] with different servers within the system being developed by different vendors...if so....

    Then why was the Open/Closed principle not followed so that the API was properly versioned and either party could switch between the versions independently????


  • mod

    @TheCPUWizard said:

    with different servers within the system being developed by different vendors...if so....

    Different servers within the system? DId you miss the part about a joint project with another company? MyHanzo's part of the project lives on one company's servers. The Indian team's part of the project lives on their servers. The two shall never be on the same network. They shall only communicate via web services.

    @TheCPUWizard said:

    Then why was the Open/Closed principle not followed so that the API was properly versioned and either party could switch between the versions independently????

    • Only developer
    • This upgrade is one of 4 dozen other tasks which are all URGENT and due NOW
    • The project is "officially" in a UAT phase, not yet released
    • The Indian team wouldn't know the Open/Closed principle if it bit them in the ass

    One or two of these could probably answer your question on their own, but all of them together …



    1. Multi-Site, Multi-Server, Multi-Network et. al. uopdates can be done in an atomic manner. There are distributed tools for exactly this purpose [and it is becoming much more common]
    2. My projects range from "sole developer" to largish teams (100+)...same principles apply to all of them.
    3. " URGENT and due NOW", my professional reputation is more important. Give me the time I need or find someone else to do the job.
    4. Versioning API's only need "Hanzo's Part" and would have been done from day one (well before UAT)

    That being said, I see this type of situation all of the time. A good chuck of my income derives from find9ing these situations (or they find me) and remediating them. So I do understand the pain, but I prefer to treat the cause rather than the symptom.


  • Discourse touched me in a no-no place

    @TheCPUWizard said:

    Multi-Site, Multi-Server, Multi-Network et. al. uopdates can be done in an atomic manner. There are distributed tools for exactly this purpose

    Multi-company updates are something else. The problem isn't that it's technically hard to get a coordinated update out, it's that it requires two sets of developers and two sets of management (and so on…) to actually agree to really do things together. This basically doesn't fucking happen even when people are in the same country and in the same timezone, let alone half a world away. Too many organisational things to go wrong.

    It's much easier (even if that's a very abused term here) to have one side unilaterally update at some point and then for the other side to catch up. If the people involved are really keen on collaborating, a test server might be put up by one side to allow the other to get ready ahead of the production switch, but there's really no way to be certain that that's going to happen, or that what ends up in production is what you tested against.

    (Yes, I've done collaborative work for a long time. It's a PITA until it actually works, then it's actually a bit awesome.)


  • mod

    @TheCPUWizard said:

    4) Versioning API's only need "Hanzo's Part" and would have been done from day one (well before UAT)

    Unfortunately, API versioning was not a viable option. This was due to a decision made above myHanzo's head.

    The database is managed by Paul. The API change was coordinated with a stored proc change. The change to the stored proc broke the previous version of the API by introducing a new parameter that needed to be provided by the API consumer. There is literally no other way to get the information for the new parameter, and providing default data could have negative consequences. In fact, it is the same parameter that Shiva's team was omitting in their "test".

    I hope that one day Hanzo's team will have a true DBA with whom he can coordinate such niceties as deprecation. Of course, it would also be nice for Hanzo's team to be more than Hanzo and the boss.



  • I just have to say that it's typical developers to blame the OP for not following some high-reaching principles which in therory could gave prevented the WTF but in practice wouldn't change a thing. Thanks OP for posting this.



  • And then?



  • @abarker said:

    The change to the stored proc broke the previous version of the API by introducing a new parameter that needed to be provided by the API consumer. There is literally no other way to get the information for the new parameter, and providing default data could have negative consequences.

    Wasn't the system in place and functioning before the parameter was introduced? Proper versioning would entail the old version of the call continuing to function the way it always had until that version is retired. Perhaps the retirement will come quickly, but the progression should be:

    1. Version 1 put in production.
    2. Version 2 introduced, Version 1 still functions.
    3. Consumers move to Version 2.
    4. Version 1 is retired.

    This way, there is never a "we both have to do this at the same time" event. Sure, Paul's actions took the possibility of versioning off the table for your team, but that doesn't mean that the fault is entirely on the offshore team. Paul put the company in a situation where you were putting more responsibility on offshore than they can handle. Part of successfully offshoring is figuring out what they can handle and offloading that work to them.


  • mod

    @Jaime said:

    Sure, Paul's actions took the possibility of versioning off the table for your team, but that doesn't mean that the fault is entirely on the offshore team. Paul put the company in a situation where you were putting more responsibility on offshore than they can handle. Part of successfully offshoring is figuring out what they can handle and offloading that work to them.

    If it were up to Paul, the offshoring would have been onshored months ago. However, the offshore team is handling the code for the other company. Let me distill out the basics here:

    1. Two companies involved, with separation of responsibilities.
    2. Product is in final stages of v1 UAT.
    3. Due to issues outside my control, proper API versioning is not an option[1].
    4. Instead of blindly promoting the changes to production, they were pushed to a shared test environment. The changes were enumerated in detail to the India team at that time.
    5. The India team was asked if they were ready to accommodate the API changes. They confirmed they were ready.
    6. Promotion scheduled. India team did not follow through. promotion rolled back.
    7. Promotion attempt 2 scheduled. India team did testing in production that showed they only accounted for some of the changes.
    8. When the India team reported an error on their side (encountered because they didn't account for all the changes), they clocked out for the day instead of waiting 5 minutes for a response. This left their production systems in a state that would not work with v1 or v2 of the API.

    Would this all have gone more smoothly with proper API versioning? Possibly, except for point 8. Is proper API versioning something I want to implement? Yes. Is it something that I will likely be able to implement any time soon? Not likely.

    Oh, one other point I left out:

    3.5. India team requested an additional input parameter on one of our APIs. Interestingly, this happens to be the input parameter they failed to account for in point 8.


    [1] I'll get there one day. This place has drastically improved in the five years I've been here. proper versioning is on my To-Do list.



  • Well, with proper API versioning, you wouldn't have deprecated the V1 API until you were certain on your end it was no longer being called...


  • mod

    @PleegWat said:

    Well, with proper API versioning, you wouldn't have deprecated the V1 API until you were certain on your end it was no longer being called...

    What's your point? The problem is they were trying to call v2 of the API, but they were neglecting the one change that they requested (there were other changes that the successfully accounted for). They weren't calling v1. In either scenario, v1 was dead.



  • Well, in that situation they would've been able to roll back.

    Granted it seems everyone on your side of the divide is already convinced it's their fault so no big change there...



  • We all know they are morons. Versioning allows you to write software that continues to run properly until the morons get their shit together. It doesn't fix the fact that their new code doesn't work, but it does make it no-one's problem except theirs.

    Versioning is a great solution if you are on the upper half of the competence curve. If you are on the lower half, then your only hope is to try to badger your partner into not screwing up the shared deployment. You seem to be trying to use the latter solution, when the former is much less stressful.


  • mod

    @PleegWat said:

    Well, in that situation they would've been able to roll back.

    Granted it seems everyone on your side of the divide is already convinced it's their fault so no big change there...

    Except they worked themselves into a corner where they couldn't roll back. After promoting:

    1. They tested in production. Why didn't they test in the test environment?
    2. They encountered an error of their own making. They did no checking on their side and assumed the error was not their fault and sent us an email. We responded that the error was in their request and asked if we could do a rollback to give them an opportunity to correct the problem.
    3. After sending the email in #2, they left the office with a broken system in production, with no provision for contacting them.

    All this was mentioned in the OP. Now, to apply this to your post:

    How does the ability to roll back help a team like one that matches the points above? If you are going to test in production, find an error, automatically assume that it isn't your error, and then leave the office, what good does the ability to roll back do you?


    @Jaime said:

    It doesn't fix the fact that their new code doesn't work, but it does make it no-one's problem except theirs.

    Normally I would agree with you. However, in this situation you would be wrong. This project is unusual. Think of it this way: Company A and Company B agree to work together to build a product. Due to areas of specialty, the two companies divide the product into components which each of them will be responsible for. For reasons which are not explained to the developers, each company's portion of the project will be wholly owned by that company. However, each component relies somewhat on the other components to function, so a series of internal APIs are required to allow the entire product to work.

    Now, let's say Company A is upgrades an API using proper versioning. Company B now needs to switch to the new version of the API to make sure that the product is taking advantage of the full functionality. Only Company B screws up and consumes some weird hybrid version of the API that doesn't actually work, and they push that change into production. Well, both companies have their names on the product. The product is broken. That makes it everyone's problem.

    @Jaime said:

    Versioning is a great solution if you are on the upper half of the competence curve. If you are on the lower half, then your only hope is to try to badger your partner into not screwing up the shared deployment. You seem to be trying to use the latter solution, when the former is much less stressful.

    Fortunately, I've been able to convince my boss "Paul" that API versioning is worth looking into for our next update. This fiasco provided the leverage I needed.



  • @abarker said:

    This fiasco provided the leverage I needed.

    If you do not leverage the fiasco - the fiasco will leverage you. :hanzo:


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.