Politics and government subcontracting



  • I was reminiscing with some old friends and something that happened at work 25 years ago came up. I'm not sure if I've ever posted this, but it's worth telling...

    We had this program that took electrical connection requirements (e.g.: 2A @ 10v, etc) and would look up the least electrically expensive path through a relay switch for the power requirements of the connection. To this end, it would loop, and loop, and loop... and it took *forever* to finish. Typically 14 hours for a full run. This meant that there was no way to kick off a second run before the end of the day.

    We were contracted to supply this software along with a huge electronic testing station to the government. The government then gave this station and software to another contractor who would write programs in ATLAS (with statements like: Apply current at 10 amps and 10 millivolts to pin 12345 with waveform x) to be used to test parts on fighter aircraft. The idea was that some 18 year old could diagnose what was wrong with the jet and be told: replace part x.

    Unfortunately, the subcontractor underbid their part of the job, and so blamed the government furnished equipment (e.g.: our stuff) for the delays- to justify the cost overruns.

    My mission was to speed the software up.

    Long story short, most functions in the program were called proportional to the number of connections. However, one function, to compare strings, was called multiple billions of times, accounting for 50% of the total run time. It turns out that it took a generic 255 byte string argument. Under the hood, the assembly would blindly copy 255 bytes to the stack, regardless of what was actually passed. Changing it to an even number of bytes (e.g.: 256) caused it to copy 128 words. But the longest parameter ever passed was only 9 bytes. I changed it to 16, and got it to run a whole lot faster. 50% faster to be exact.

    Joy, right? Wrong!  Our next contract for upgrades stipulated that the software could not run any slower. Period.

    As such, I was ordered to hold back the fix and never tell the other subcontractor.

    One day, I had to go onsite at the other subcontractor to fix an unrelated problem, and had to run this software as part of the setup. Naturally, I had no patience to wait for a 14 hour job to finish, so I put in my patch and got done by 4:30. The other guys just knew I had a fix, but they also knew I wasn't allowed to talk about it.

    And so the political war over cost overruns continued.

    Five years later, I left the company. The fix had never been deployed.

     



  • @snoofle said:

    Our next contract for upgrades stipulated that the software could not run any slower. Period.

    Wait, what?  So the fix to make it *faster* wasn't done because you couldn't make it *slower*?  What's the problem? You're making it faster, not slower, unless you meant you weren't allowed to make it faster...

    Man, I sure wish the day would go faster...



  • Good ol' ATLAS.  I just shipped an ATLAS diagnostic program a few months back. 

    Speaking of blind buffer copying, we were dealing recently with a bug from a subcontractor that would cause intermittent crashes of the main program with memory access violations.  The root cause was several dozen cases of calling strcpy() with a buffer several times too small.



  •  Weirdly I was reading back on a couple of stories today and that was one of them  . From Jun09 or so ....

     



  • @C-Octothorpe said:

    @snoofle said:

    Our next contract for upgrades stipulated that the software could not run any slower. Period.

    Wait, what?  So the fix to make it *faster* wasn't done because you couldn't make it *slower*?  What's the problem? You're making it faster, not slower, unless you meant you weren't allowed to make it faster...

    Man, I sure wish the day would go faster...

    No, the next contract said they couldn't *ever* make it run slower than before, so he wasn't allowed to make the software run faster *now*, because that would reduce acceptable time for future runs.


  • @C-Octothorpe said:

    @snoofle said:

    Our next contract for upgrades stipulated that the software could not run any slower. Period.

    Wait, what?  So the fix to make it *faster* wasn't done because you couldn't make it *slower*?  What's the problem? You're making it faster, not slower, unless you meant you weren't allowed to make it faster...

    Man, I sure wish the day would go faster...

    I had to make it faster so that it wouldn't get slower with the forthcoming enhancements. The speed-up fix was held back pending the enhancement contract, which never came.

  • ♿ (Parody)

    @snoofle said:

    I had to make it faster so that it wouldn't get slower with the forthcoming enhancements. The speed-up fix was held back pending the enhancement contract, which never came.

    Still, you'd think they'd let you cut it down a little for now, so you get the speedup from going from 255 to 200, and you could slowly ratchet it down as needed before milking everything out of that little change. Another triumph of optimizing based on profiling (even if it never actually got implemented).



  • This was previously posted as a feature article....



  • Ok ... SOP. I work for ... a gov't "entity". That's how we do things. Pay yer fucking taxes and shut up. jk

     

    I want out of here sooo bad.

     

    Anyone hiring?

     



  • @snoofle said:

    Long story short, most functions in the program were called proportional to the number of connections. However, one function, to compare strings, was called multiple billions of times, accounting for 50% of the total run time. It turns out that it took a generic 255 byte string argument. Under the hood, the assembly would blindly copy 255 bytes to the stack, regardless of what was actually passed. Changing it to an even number of bytes (e.g.: 256) caused it to copy 128 words. But the longest parameter ever passed was only 9 bytes. I changed it to 16, and got it to run a whole lot faster. 50% faster to be exact.

    I did something similar once.  Back in the mid-90s I was working for a large audio equipment manufacturer which for reasons of anonymisation and because that's what all the graffiti in the toilets called it, we shall refer to as "Soundcrap", developing a large computer controlled audio mixer which for reasons of anonymisation we shall refer to as "Widestreet".  It used separate digitally-controlled audio-processing racks and automated flying-fader control surfaces, all linked by a simple custom token-ring network protocol (to enable isochronous bandwidth guarantees) implemented over 10-BASE-2 ethernet.

    At power-on, all the audio racks and control surfaces would register all their audio processing elements and user interface controls in a central object registry, and the automation controller would then look them up and hook up the control elements to the processing elements according to a user-defined configuration.  There were typically thousands of these objects in the registry, think (one VCA + one pre/post switch) per crosspoint on a 40x40 bus mixer and you'll get some idea.

    Booting up the system took somewhere between fifteen minutes and half-an-hour, depending on the number of racks and surfaces involved.

    After a bit of browsing through the code paths involved in this whole process, it all seemed fairly simple; receive a network message containing details of an object to be registered or connected, create a new object in the registry and copy a handful of parameters into it from the message, generally fairly simple straight-line code and not seemingly much optimisable.  But then I found the object name lookup routine, and realised it was doing a straight sequential strcmp search through the entire list of registered objects.

    I mentioned it at the team meeting that day but the rest of the team expressed skepticism that could make much difference; after all, how slow can a few strcmps be compared to all that hardware-twiddling and network messaging that the rest of the system had to do?  Still I went ahead, and replaced the sequential strcmps with a simple hash-table.

    My testrig went from booting in around twentythree minutes to booting in two and a half.

    See, it wasn't just that most of the strcmp calls were unnecessary; it was that there were enough of them that it meant that when the controller node received an incoming message and had to do a lookup, it took just long enough to process that it couldn't send the reply in time while it still had the network token, so every single message exchange took one or more extra token ring cycle periods to complete.  If the whole thing had all been running on one node, well it would have still been slow and inefficient but not nearly so much, but the interaction with the network timing made it awful.

    My story ends on a happier note than snoofle's for two reasons: 1) I was indeed able to check the fix in, and it made all the dev's lives a hell of a lot easier and more productive, and 2) it wasn't nearly so long as five years before I was able to escape from the flaming WTF-pit that was that job.  But that's a story for another day.




  • @DaveK said:

    Back in the mid-90s I was working for a large audio equipment manufacturer which for reasons of anonymisation

     I know exactly the company, and the specific console. You see, during the same time frame, I was working for a competitor (who could not call their automated faders "flying faders")...

     It does bring to mind a good story. To provide for "dynamic labeling" of the channels we used gas-plasma displays (bright, fairly good readability). The system was also made up of modules that were bolted together on a rail (making it much easier to install since it was not a big monolithic console).

     During a demo at a NAAB conference, one of the resistor packs for the displays shorted, allowing excessive current to the display, which got hot...hot enough to ignite the Lexan that provided the static graphics on the surface...

     The person presenting was not looking at the console as little (about 1") flames were coming from the one module. Another engineer and I jumped up, powered down the one module, ripped it off the rail, grabbed a spare, mounted it, and powered it on.

     During the entire time the system was playing a mix (with all the automation running), and did not miss a beat.

     Since "zero downtime" is critical in live situations (such as Audio Broadcasting aka Radio), the viewers were amazed, and a number of sales were made based on this event.

     It was suggested by one "pointy hair" that we stage fires at future events - fortunately, he was tied-up and thrown in a closet (Wishful thinking, but at least his suggestion was never acted on).


  • Garbage Person

    @TheCPUWizard said:

     It was suggested by one "pointy hair" that we stage fires at future events - fortunately, he was tied-up and thrown in a closet (Wishful thinking, but at least his suggestion was never acted on).
    I wouldn't have dismissed his suggestion entirely - I wouldn't have gone so far as to set equipment on fire, but demonstrating a hot swap like that is certainly impressive and worthy of demonstration, especially if it's a proven selling point. 



  • Hot swaps were indeed were part of the "personal" demonstrations even before this incident, and were added to all public showings.



  • @TheCPUWizard said:

    Hot swaps were indeed were part of the "personal" demonstrations even before this incident, and were added to all public showings.

    So what's the problem? They just wanted to make the hot-swap a little bit hotter. :)


Log in to reply