"I swear to you, I did exactly as you told me......"


  • Grade A Premium Asshole

    Last night around 1am my phone rings. My phone is set so that only a select list of numbers will actually cause my phone to ring at that hour so this has to be important. I glance at the Caller ID and it is one of the people that works with me. He had volunteered for some after hours work to which I had given him explicit instructions to help avoid disaster and told him that it could be done literally anytime after 10pm Monday-Thursday.

    Shit. This probably isn't good. I answer the phone.

    polygeekery "I am going to assume you are not calling me with good news like you found out a billionaire died and I am the only next of kin?"
    👨 "I swear to you, I did exactly as you told me, but now all of the (important system) devices here at Initrode are down."
    polygeekery "See, that is impossible. If you had done as I told you there is no possible way that all of the (important system) devices could be down. I gave you explicit instructions that included testing along the way and not updating all of the devices at once so if they are all down you did not follow my instructions."
    👨 "Well, they aren't all down. Three of them are working. The three I did the final testing on are working fine, but the rest are completely fucked."
    polygeekery ".......well that's interesting. Fucking hell. Let me grab my stuff and I will be right there. I'm 40 minutes out."
    👨 "Is there anything that I can do in the meantime?"
    polygeekery "Go buy some Red Bull. We're going to be there a while. I will pay you back."

    Okay, now that we have the in media res intro out of the way let's back up.

    Initrode has devices that they use which are fairly vitally important to their line of work. I don't really want to say what they are as it could possibly doxx me pretty well. Let's say that they are wireless inventory scanners. That seems roughly as important as these are and I think that I can explain all the shit that led to this in a way that would make sense. I recently read something that had the COMDEX trade show as part of it, so we will call them Comdex devices

    A few weeks ago I received an email from Comdex saying that their system has a CVE in the release we are running. Specifically it is in the administration web interface. This is no major surprise as the admin website of the onsite server literally only works in Internet Explorer. I decided it was not a huge issue as long ago I assumed it was roughly as secure as a screen door so we only allow access to it from three IP addresses which correspond to our onsite monitoring and administration machine and the two employees that need to access it. I ask for volunteers to do it off hours to minimize impact (the afterhours premium on billing was never a thought, perish the thought). It pays well and usually the hassle is minimum for this type of work so there is never an issue getting someone to put their head on the chopping blockdo the work.

    So, the Comdex system is a bit of a hacked together piece of shit. Ya know, like most things in software and IT in general. When the Comdex scanners boot up they have a configuration in their system. It tells them how to connect to the wireless, what the IP address of the server is, wireless options, etc. This config file is updated at the server in that there is a plain text file exposed on the web server that after it boots and finds the server it will ingest and compare values and if anything is different it will update itself to the new config. The devices do this absolutely blindly. No validation on their end. If you put a trailing space in a wireless password the device will absolutely shit itself and not fallback to previous values. You have to factory reset the device and start out fresh. This is an absolute pain in the ass.

    In order to factory reset the device you have to hold down three of the interface buttons while simultaneously inserting the battery. It is not impossible to do by yourself, but it sure as fuck is not easy to do. You then have to keep those buttons depressed until a specific point in the boot process. When a device is factory reset it loads in the firmware that it came from the factory with out of what I presume to be factory programmed NVRAM. This version may be quite old, depending on when the device was manufactured. At this point it is tabula rasa. No config. No way to connect to the wireless. The device only has a ~2" square display and 6-8 buttons on it. Thankfully we do not have to attempt to configure it via this limited UI. But it does necessitate a separate machine. When they are booting from a tabula rasa state they look for a specific SSID of "COMDEX". They will then assign themselves a random IP of 10.X.X.X with a subnet mask of 255.0.0.0 and look for a machine running their configuration manager application at an IP address of 10.0.0.1. There is literally a laptop onsite for this purpose and this purpose only with an access point and they are only powered on for this purpose when needed. The device grabs its config, reboots, connects to the wifi and looks for its server.

    Okay, so now it is connected to its server. It then compares its firmware version with the one the server expects it to run. If they are equal (in a tabula rasa state they will not ever be) it boots up and starts doing its thing. If they are not (tabula rasa, with firmware that is potentially years old) they then go through an upgrade sequence. As you can imagine, they have an upgrade path that they have to follow. Depending on the age of device this can be up to maybe a half dozen reboots.

    I told you all of that to tell you this: You DO NOT want to fuck up the config. If you do, you want to do it on a small scale. For any system changes we take the system entirely offline, pull the battery from all of the devices (that we can find, they are frequently misplaced or not turned in), make changes, validate changes (manually, because of course they do not have any way for the server to validate them), revalidate them, maybe validate them again, then and only then will you power up a few devices and triple check that everything is working properly before bringing the rest of them online.

    Many years ago I learned this lesson the hard way when I knocked the system offline for ~8 hours during normal business hours. That was a fun day.

    rolling-eyes.gif

    So I arrive onsite a bit before 2am. As soon as I see my technician:

    👨 "I swear to you, I did exactly as you told me."
    polygeekery "I trust you, but that doesn't matter. It is fucked and now we need to unfuck it."

    Sure enough, there are three Comdex devices booted up and ready to go. I put batteries in a few of the others and they immediately get stuck in a reboot loop. They boot up and some cannot find the server so they reboot and others cannot even connect to wifi and reboot. Some of them are throwing other errors and I assume that their firmware may be fucked.

    polygeekery "Did you change anything in the configuration file?"
    👨 "I never even touched it. There wasn't anything in the upgrade notes about any changes to be made."
    polygeekery "I assume that the three that are working are from your test and rest aren't?"
    👨 "Yep. I did exactly as you said and tested a few. I then put batteries in the rest and they started upgrading and this was the result."

    Over a hundred devices completely offline and unusable. Fucking hell.

    Worth noting is that the client pays for enhanced support or whatever the hell they call it at Comdex. I give them a call. They tell me to factory reset all of the devices and go through all of the OOBE config procedure and if that doesn't work to call them back. That's real fucking helpful. Thanks.

    So we fire up the config laptop and access point and start resetting them all. One by one. With our fingers contorted like we are making some weird gang sign to do the factory reset process.

    One thing that I forgot earlier is that after it loads the factory firmware you have to pull the battery and do a hard reset. So we are doing this all in batches and this is where I believe that I found where things went sideways. If we worked in small batches and allowed things to finish it all went fine. But on the first round we did a batch of 20 or so and the first couple finished fine but the rest failed and ended up in the same sorts of failed states. From that point on it was batches of 10 and letting them finish and come online before doing the next 10. This took over 5 hours. We finished shortly before the main production workers clocked in this morning. Fun times. I had to explain to the client this morning why this particular line item on the bill is going to be a few multiples of what I had estimated. They were fine with it.

    Time to add another line or two to that SOP.



  • @Polygeekery Oh, great, an Arbitrary Batch Size Limit. Those are fun.


  • BINNED

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    wireless inventory scanners

    Or infrared fire detectors


  • I survived the hour long Uno hand

    @Luhmann said in "I swear to you, I did exactly as you told me......":

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    wireless inventory scanners

    Or infrared fire detectors

    Wouldn't that be like the mob selling protection against your front counter getting smashed in? :thonking:


  • Grade A Premium Asshole

    @izzion said in "I swear to you, I did exactly as you told me......":

    @Luhmann said in "I swear to you, I did exactly as you told me......":

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    wireless inventory scanners

    Or infrared fire detectors

    Wouldn't that be like the mob selling protection against your front counter getting smashed in? :thonking:

    We may or may not have been contracted 2-3 years ago to redo the networks in many local metro fire stations. I assume that enough time has now passed that it would not be easy to trace that back to me. So yeah, amusingly we installed and still manage the local networks for a multitude of metropolitan fire stations.

    SleepyDimCoyote-max-1mb.gif

    You have no idea how badly I wanted to tell the entire forums when we even submitted a bid on that job, let alone when we got it.


  • Grade A Premium Asshole

    @Rhywden said in "I swear to you, I did exactly as you told me......":

    @Polygeekery Oh, great, an Arbitrary Batch Size Limit. Those are fun.

    Once we hit 10 and it was working we rolled with it. With the way things were going you could be certain that if we had tried for 12 that 11 would have failed.



  • @Rhywden said in "I swear to you, I did exactly as you told me......":

    @Polygeekery Oh, great, an Arbitrary Batch Size Limit. Those are fun.

    I'm wondering if the server bogged down with too many reset devices trying to interface with it and upgrade, stuff starts timing out, and nothing was designed to notice if that happens; failing halfway through the upgrade path the device continues blithely on and reboots, leaving it borked.


  • I survived the hour long Uno hand

    @Watson said in "I swear to you, I did exactly as you told me......":

    @Rhywden said in "I swear to you, I did exactly as you told me......":

    @Polygeekery Oh, great, an Arbitrary Batch Size Limit. Those are fun.

    I'm wondering if the server bogged down with too many reset devices trying to interface with it and upgrade, stuff starts timing out, and nothing was designed to notice if that happens; failing halfway through the upgrade path the device continues blithely on and reboots, leaving it borked.

    Probably a time of day based session limit configured to comply with some Euroland regulation about working outside of normal hours. If you do the upgrades in the middle of first shift, you can do as many as you want with no issues 🏆



  • @Polygeekery said in "I swear to you, I did exactly as you told me......":

    @Rhywden said in "I swear to you, I did exactly as you told me......":

    @Polygeekery Oh, great, an Arbitrary Batch Size Limit. Those are fun.

    Once we hit 10 and it was working we rolled with it. With the way things were going you could be certain that if we had tried for 12 that 11 would have failed.

    Almost sounds like

    They will then assign themselves a random IP of 10.X.X.X

    isn't very random and maybe those devices ran into IP conflicts.


  • Notification Spam Recipient

    @Watson said in "I swear to you, I did exactly as you told me......":

    trying to interface with it and upgrade

    It really sounds like an arbitrary port-half-open limit set somewhere....



  • I may have another idea why the limit is 10...

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    So, the Comdex system is a bit of a hacked together piece of shit.

    Microsoft said in the Windows Workstation EULA:

    You may permit a maximum of ten (10) computers or other electronic devices (each a “Device”) to connect to the Workstation Computer to utilize the services of the Product solely for File and Print services, Internet Information Services, and remote access (including connection sharing and telephony services).


  • Grade A Premium Asshole

    @TwelveBaud said in "I swear to you, I did exactly as you told me......":

    I may have another idea why the limit is 10...

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    So, the Comdex system is a bit of a hacked together piece of shit.

    Microsoft said in the Windows Workstation EULA:

    You may permit a maximum of ten (10) computers or other electronic devices (each a “Device”) to connect to the Workstation Computer to utilize the services of the Product solely for File and Print services, Internet Information Services, and remote access (including connection sharing and telephony services).

    That actually isn't it, although it could be at another location. We are given the software to install on any Windows machine we wish and we installed it on a Server 2016 machine.

    The software in question is built upon Java so you are welcome to look there for reasons this situation got fuckticated.


  • Grade A Premium Asshole

    @dcon said in "I swear to you, I did exactly as you told me......":

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    @Rhywden said in "I swear to you, I did exactly as you told me......":

    @Polygeekery Oh, great, an Arbitrary Batch Size Limit. Those are fun.

    Once we hit 10 and it was working we rolled with it. With the way things were going you could be certain that if we had tried for 12 that 11 would have failed.

    Almost sounds like

    They will then assign themselves a random IP of 10.X.X.X

    isn't very random and maybe those devices ran into IP conflicts.

    Almost 17M addresses and your suspicion is IP address conflict? I mean, it isn't impossible but I would have to work pretty hard to come up with an algorithm where this would be the contention that caused it.

    That being said, I think that I am pretty shit at being a code monkey and I have to entertain the possibility that I am on the other side of the Dunning-Kruger curve than I think that I am. But I think that I am almost entirely shit and that I depend on the people that I hire to keep me from doing retarded things. So adjust for windage. Flip a coin and then assume that I am too retarded to know what I don't know.


  • Notification Spam Recipient

    @TwelveBaud Oh damn, I forgot about that!

    Because I don't run software intended to be used as a server to other machines on "client mode" systems.

    And if I did, I definitely would not use IDS!


  • Notification Spam Recipient

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    The software in question is built upon Java so you are welcome to look there for reasons this situation got fuckticated.

    Different fuck, more 'tard, I suppose.


  • Considered Harmful

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    @dcon said in "I swear to you, I did exactly as you told me......":

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    @Rhywden said in "I swear to you, I did exactly as you told me......":

    @Polygeekery Oh, great, an Arbitrary Batch Size Limit. Those are fun.

    Once we hit 10 and it was working we rolled with it. With the way things were going you could be certain that if we had tried for 12 that 11 would have failed.

    Almost sounds like

    They will then assign themselves a random IP of 10.X.X.X

    isn't very random and maybe those devices ran into IP conflicts.

    Almost 17M addresses and your suspicion is IP address conflict? I mean, it isn't impossible but I would have to work pretty hard to come up with an algorithm where this would be the contention that caused it.

    The overall sophistication of the thing sounds a lot like they'd initialize their RNG only with the system time. If that was measured in ticks of 50/60 Hz and you'd have a realistic 200 ms of random boot-time variation between devices until the point where it has to choose an IP, you'd end up with about 10 different addresses.
    Next time you try a batch of 20, run tcpdump :tro-pop:


  • Grade A Premium Asshole

    @Tsaukpaetra said in "I swear to you, I did exactly as you told me......":

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    The software in question is built upon Java so you are welcome to look there for reasons this situation got fuckticated.

    Different fuck, more 'tard, I suppose.

    Presumably.

    MV5BOTJjODE0YzYtNzU0Ni00NGM1LWFiMmMtYTU1YjBlYTJlNjkzXkEyXkFqcGdeQXVyNjAwODA4Mw@@.V1.jpg


  • Discourse touched me in a no-no place

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    The software in question is built upon Java so you are welcome to look there for reasons this situation got fuckticated.

    Then I'd guess that the limit is something that could be overridden in a config file, if only you could discover how. Thread pool sizes are usually total guesses.


  • Grade A Premium Asshole

    @LaoC I get what you're saying but chances are near 100% that before

    @LaoC said in "I swear to you, I did exactly as you told me......":

    Next time you try a batch of 20

    I will end up retiring. I am on the wrong side of the "Sherlock Holmes this is (fun to figure out/this pisses me off further) curve. I am well past the "Let's drink more and extend the curve" point in my life.


  • Grade A Premium Asshole

    @dkf said in "I swear to you, I did exactly as you told me......":

    Thread pool sizes are usually total guesses.

    In this case I imagine it is a guess based upon the capabilities of the random idiot in charge of this point release. SOP got updated to 5 units per iteration. Our ass is covered and we can use client choices to justify this.

    Shit like this is why stop loss provisions in service contracts exist.


  • 🚽 Regular

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    I would have to work pretty hard to come up with an algorithm where this would be the contention that caused it.

    ipAddress = "10.0.0." + random.GetNextInt(1, 10);
    

    Hire me, Comdex! /s/s/s/s/s/s/s


  • ♿ (Parody)

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    Almost 17M addresses and your suspicion is IP address conflict? I mean, it isn't impossible but I would have to work pretty hard to come up with an algorithm where this would be the contention that caused it.

    But how hard is Kevin willing to work to accomplish this feat?


  • Considered Harmful

    @Rhywden said in "I swear to you, I did exactly as you told me......":

    @Polygeekery Oh, great, an Arbitrary Batch Size Limit. Those are fun.

    Haha, it's not arbitrary, it's probably 1. Sound like the kind of concurrency issues you get when a web dev has never heard of concurrency but has heard of globals.


  • Considered Harmful

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    @TwelveBaud said in "I swear to you, I did exactly as you told me......":

    I may have another idea why the limit is 10...

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    So, the Comdex system is a bit of a hacked together piece of shit.

    Microsoft said in the Windows Workstation EULA:

    You may permit a maximum of ten (10) computers or other electronic devices (each a “Device”) to connect to the Workstation Computer to utilize the services of the Product solely for File and Print services, Internet Information Services, and remote access (including connection sharing and telephony services).

    That actually isn't it, although it could be at another location. We are given the software to install on any Windows machine we wish and we installed it on a Server 2016 machine.

    The software in question is built upon Java so you are welcome to look there for reasons this situation got fuckticated.

    Servlet Instance Variables, aka SIV.


  • Considered Harmful

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    I am well past the "Let's drink more and extend the curve" point in my life.

    You're obviously lying, but if you weren't, that would be sad.



  • @Polygeekery said in "I swear to you, I did exactly as you told me......":

    I mean, it isn't impossible but I would have to work pretty hard to come up with an algorithm where this would be the contention that caused it.

    xkcd: int random() { return 4; }
    Which, with this site, means they probably initialize the generator with 0 each time they want a number.

    But reality is probably far worse... (for sanity, do not look behind the curtain!)


  • Grade A Premium Asshole

    New version of the Comdex devices recently dropped. They went from:

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    The device only has a ~2" square display and 6-8 buttons on it.

    To no display and 4 buttons. Oh, and this:

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    In order to factory reset the device you have to hold down three of the interface buttons while simultaneously inserting the battery.

    Went to holding down two buttons while inserting the battery. One point that I cannot remember if I mentioned in the last one is that holding down the three buttons and inserting the battery on the older models entered you into a boot options screen. You then had to go through a couple of menus, select "Factory reset", press a button and then the screen would have two options.

    Cancel
    Confirm Factory Reset

    You had to move the cursor finally execute the factory reset. This is all a very deliberate action. Essentially impossible to accidentally factory reset the device.

    Now that the display is gone and with the new hardware revision you only have to hold down two buttons while it boots up. The device will then speak:

    ⚡ "Maintenance mode. Press (button) to reset device."

    If you then press the button it says, it will then say:

    ⚡ ""Press (button) to confirm."

    Press that button again and:

    ⚡ "Resetting device."

    A minute or so later it is booted back up, tabula rasa, no display to say anything, just a flashing light until the config is loaded.

    The two buttons to initiate the reset are basically where a person's hand could be, depending on how they are holding the device when they insert the battery or how they usually carry it.

    So let's say that you're an employee of Initrode. You come in to work, grab your Comdex device and a battery, pop the battery in and then Bob from accounting starts talking to you about last night's sportsball game, or Jane from HR starts talking to you about.......whatever the fuck women talk about. A few minutes pass, you're done chatting and you press the big button in the middle of the device like you do every morning. It says something, but doesn't really sound like it normally does, or maybe you don't even notice because of background noise or someone else starts talking to you. You press the button again. More garbled speech synthesizer stuff that you either ignore or cannot make out. Then you go to use your device and it is just sitting there flashing, not doing anything.

    This now happens pretty regularly. This is now a weekly checklist item for this client, to reconfigure all of the devices that became tabula rasa since we were last there.

    The way of resetting the older devices was overly complicated and dumb, and they did make that easier. Unfortunately they went too far and now users are resetting them by accident.

    Seriously, what the fuck is wrong with a pinhole and a tact switch on the PCB?



  • @Polygeekery said in "I swear to you, I did exactly as you told me......":

    what the fuck is wrong with a pinhole

    It is not a square hole?


  • Discourse touched me in a no-no place



  • @Polygeekery said in "I swear to you, I did exactly as you told me......":

    Seriously, what the fuck is wrong with a pinhole and a tact switch on the PCB?

    That adds an additional 0.3¢ per unit. Developer time for new, exciting reset sequences is (probably) salaried op-ex, so "free". Customer disservice is also similarly "free". Free beats 0.3¢ every day.



  • Also, since a customer stabbed themselves with a paperclip and tried to sue the manufacturer, their lawyers got twitchy.



  • @Zerosquare What? A suicide attempt with poor innocent clippy?


  • Notification Spam Recipient

    @Polygeekery said in "I swear to you, I did exactly as you told me......":

    Press that button again

    You would think they would alter the sequence so "simple" button presses were out of the question to confirm.

    but I suppose any sequence that uses easy-to-push buttons is doomed to break the fool...



  • @Polygeekery said in "I swear to you, I did exactly as you told me......":

    Seriously, what the fuck is wrong with a pinhole and a tact switch on the PCB?

    We had that, on a device that's impossible to reprovision from factory without opening it up and connecting via serial debug line.

    We asked a customer to power cycle the bridge (unplug and plug back in). They instead managed to push the unmarked factory reset button. :facepalm:



  • @TwelveBaud said in "I swear to you, I did exactly as you told me......":

    Free beats 0.3¢ every day.

    Further costs savings: measure voltage over the speaker, so it can act as a (terrible) microphone. Remove one of the buttons in favour of the user screaming at the device.


  • Grade A Premium Asshole

    @Tsaukpaetra said in "I swear to you, I did exactly as you told me......":

    You would think they would alter the sequence so "simple" button presses were out of the question to confirm.

    You'd think so, wouldn't you?

    You would be wrong.


  • ♿ (Parody)

    @cvi said in "I swear to you, I did exactly as you told me......":

    @TwelveBaud said in "I swear to you, I did exactly as you told me......":

    Free beats 0.3¢ every day.

    Further costs savings: measure voltage over the speaker, so it can act as a (terrible) microphone. Remove one of the buttons in favour of the user screaming at the device.

    I feel like I would be willing to pay more for this feature.


Log in to reply