Who watches the watchers?



  • I really need to find a new place to work. Or, perhaps, a new line of work.

    We've been having some spectacular database issues lately. This is likely to get ranty.

    (1) Our building lost power, all the UPS and backup generator stuff saved the computer and networking equipment ... but apparenly something happened to the power to the HVAC bits. Heat escalated quickly. Luckily, the power outage happened on a weekday (as far as I know, its a lights-out facility after hours/weekends) and some folks were around for the ops-floor folks to call around to all the server owners (they're kinda like a colo -- they provide network, HVAC, power, etc., for people with needs and means) to let them know what happened so that they could take care of their own stuff. By the time our DBA/Admins arrived on scene to shutdown their systems (seeing as how someone or something cut off remote access), it was past 100F in the room.

    Later, when our systems came back up, something was wrong with the DB server. It took hours to get it righted (again, I'm not the DBA and I'm not privy to the sysadmin side of things ... unless they decide to ask me for help.) After this bad day, stuff started working weird.

    (2) We had recently added a deferrred constraint to a database table to ensure we didn't insert duplicate data (:sigh:) We have soooooo many duplicate records there, that it would take a long damn time to delete all the dupes, so, instead, we have a deferred constraint (no new data can duplicate, but we don't care about the old.) Are we working on removing the extraneous duplicates? No.

    However, after we added this constraint, our data load procedure slowed to a crawl. Like, 1000 loaded records every 10 minutes. We disabled the constraint but the slowness still existed. It wasn't until the constraint was dropped did everything seem to return to normal.

    (3) We've recently been getting deadlock exceptions from the database in the area of our work queues. These cause some of our processes (which I discussed in another post) to lose their minds and just basically spin and no longer work. I think they're waiting for the database to return, based on profiling the active processes (jvisualvm is a pretty damned nice tool to have.) So, work gets log-jammed for database reasons and the only indication we get is when the queues grow.

    So, the real WTF is that, supposedly as a stop-gap measure, I've been asked to come up with a 'smart-script' which monitors the various queues and, when some threshold is reached, to actively kill the running processes, do some cleanup and restart them.

    I asked if I had to write something which ensures that the smart-script is running properly...

     

     



  • I just commented to a co-worker ... "I guess I finally get to create the 'killer app' I've always wanted to..."



  • Your someone or something that cut off remote access may have been the switch room going down, taking out network connectivity. Does your building happen to have one or more colocation rooms that didn't lose power and HVAC? If so, I think I'm familiar with the outage you referenced. 



  •  I hope your dbas ran DBCC CheckDB. 



  • @zelmak said:

    So, the real WTF is that, supposedly as a stop-gap measure, I've been asked to come up with a 'smart-script' which monitors the various queues and, when some threshold is reached, to actively kill the running processes, do some cleanup and restart them.

    .. as opposed to getting a DBA to investigate and fix the underlying cause...?

    @zelmak said:

    I asked if I had to write something which ensures that the smart-script is running properly...

    "no, we've got a brillant coder who's done that already.. but we don't fully trust the quality of her code. Could you write some monitoring utility that..."



  • @RichP said:

    Your someone or something that cut off remote access may have been the switch room going down, taking out network connectivity. Does your building happen to have one or more colocation rooms that didn't lose power and HVAC? If so, I think I'm familiar with the outage you referenced. 

    I don't know. Again, I'm not privy (nor do I really care in my current position) about what happened as a lowly (hah!) programmer, but my sysadmin background leads me to guess that yes, that's the case.

    And I'm not trying to 'diss' the sysadmins ... this was just a cascade of errors/faults which happened all at the same time creating the 'perfect storm.'

    I've been there done that -- in fact, I foresaw one, but my boss wouldn't spend the money to have someone come out and check the 'ops-floor' UPS. Until the following week after it's complete failure. The batteries ended up needing replaced. According to the built-in log, it said it maintained power for 10 seconds before it failed.



  • Update:

    Not only does it have to detect broken processes, it also has to detect when additional processes would help alleviate the backlog, then detect when those processes are no longer needed and kill them.

     



  • @zelmak said:

    Not only does it have to detect broken processes, it also has to detect when additional processes would help alleviate the backlog, then detect when those processes are no longer needed and kill them.
     

    Sounds like you're rewriting Oracle's SMON and PMON.

    This *nix or Win platform? I'm guessing from your sysadmin background you know of monitoring tools that can watch for such events (but possibly not respond to them, other than just alert).



  • @Cassidy said:

    @zelmak said:

    Not only does it have to detect broken processes, it also has to detect when additional processes would help alleviate the backlog, then detect when those processes are no longer needed and kill them.
     

    Sounds like you're rewriting Oracle's SMON and PMON.

    This *nix or Win platform? I'm guessing from your sysadmin background you know of monitoring tools that can watch for such events (but possibly not respond to them, other than just alert).

    *nix for the time being ... I can't imagine how much fun it will be if/when we move to Windows ... or implement distributed processing the way it was originally intended.

    Knowing of the existence of possible tools? Yep. Getting them approved for use? G'luck, mate.



  • @Cassidy said:

    @zelmak said:
    I asked if I had to write something which ensures that the smart-script is running properly...

    "no, we've got a brillant coder who's done that already.. but we don't fully trust the quality of her code. Could you write some monitoring utility that..."

    Just integrate the monitoring application monitoring the monitoring application into the monitored application. (No, not into the monitoring application. That would be silly.) It can be shown by induction that now all application are monitored. The proof is left as an exercise for the reader.



  • @zelmak said:

    Knowing of the existence of possible tools? Yep. Being asked to write one yourself instead of taking COTS that fulfils the task? Fuck cheapskate management decisions.

    FTFY. I don't envy your position, I've been in a similar one wayback when.


Log in to reply