Another Sunday at work: Let's see who screwed up this time!


  • Garbage Person

    Stopped in the office on the way to a social engagement to do a component deployment (in other words "I had to shit really badly in the way that's rude to do to other people's toilets")

    While I was busy reading these forums on my phone while taking care of business, I got a text message from our on-call. "We have a problem. I can't get get to the prod fileserver, and neither can the apps."

    So much for social engagements.

    So I clean up and walk down to my office. After an interminable period waiting for the lovechild of Symantec Endpoint Encryption and The World's Slowest 4200RPM Hard Disk to let me into my goddamn OS, I verified that, in fact, I could get to the prod fileserver just fine.

    Let's drill into this a bit deeper:

    • Our "prod fileserver" is a Windows box with a share on it. Yay!
    • I can remote to that box just fine, and go directly to its share via UNC path.
    • That share is accessed by the applications (and all non-me users) through a DFS mountpoint (which is TRWTF, because it doesn't have any replicas, and if it did all hell would break loose because replication isn't fast enough for the speed at which we hand off file handles across the network anyway)
    • I can get to the DFS mountpoint.
    • If I remote in to a server in the datacenter, I CANNOT get a DFS mountpoint.
    • DFS is managed by Domain Controllers, and subject to replication latency.
    • Our on-call is at home, remoted in through VPN. This would put her "local" domain controller as the one in the datacenter, rather than the one in my office a thousand miles away.
    • The Windows patch window was last night.
    • This seems to have started shortly after the closure of the patch window.

    That stinks like "Something or other to do with those ratbastard sysadmins" so I opened a ticket with the Wintel admin team.

    The problem has since been resolved, and nobody seems to want to talk to me about what the problem actually was. Seriously. I didn't even get a call to my desk phone telling me they were looking into it. Just the automated "Someone is looking at it" email, and then "All done! Please verify!"

    I have 8 hours of downtime to answer for here.... Do I really have to reach for the Root Cause Analysis paperwork?



  • Go Hanzo on their ass.



  • Sic the Book of Five Rings on their ass.


  • Garbage Person

    Root Cause Analysis is a fate worse than death. It's basically a WMD, as it ensures everyone in your chain of command from the bottom all the way up to the CIO gets told "Something broke and these people touched it! Here are their meager excuses!"

    Anyway, I got a response. It's very wibbly-wobbly, basically "We patched it last night. It rebooted, and some driver didn't load. Rebooted it again and everything was fine." except they used twice as many words to convey equally little content.

    No mention of what "It" is. Or what driver didn't load. The implication is that it was my server, but that doesn't fit with the "It works from my office" symptom. I suspect the answers are "The primary domain controller" and "DFS".

    They did reboot my server (I got the monitoring ticket screaming "OH SHIT YOUR SERVER REBOOTED OFF SCHEDULE!", because we all know application teams are always the ones responsible for unscheduled reboots) , but it was back up half an hour before things actually started working. And the eventlog is empty.



  • Ugh.



  • @Weng said:

    Symantec Endpoint Encryption

    Hooray for Symantec's vice-like grip on corporate IT. Get into work, Start computer, log in, lock computer, go make a coffee and a sandwich, drink coffee, eat sandwich, talk to boss for 20 minutes about bullshit, go back to desk, try to start VS, go get another coffee, talk to boss some more, go back to desk, work (maybe).

    @Weng said:

    No mention of what "It" is. Or what driver didn't load. The implication is that it was my server, but that doesn't fit with the "It works from my office" symptom. I suspect the answers are "The primary domain controller" and "DFS".

    Sounds like your system is a house of cards waiting for a breeze.


  • Garbage Person

    The older iterations of it are, yes.

    In fact, I've been screaming my head off at the boss about the fact that Windows 2003 support is ending and that CorpIT isn't going to let us use 2008 32bit - which means that all those AnyCPU compiled apps we have with 32bit dependencies are going to die a horrible death if we put them anywhere near a 64bit OS. And recompiling them ain't gonna hel and/or be possible, because the idiot cowboys who built this thing originally weren't particularly conscientious about putting things in Visual Source Safe. We need to migrate clients off to a newer version NOW. Not later. NOW. Nobody's particularly interested in giving me the ungodly amounts of resources necessary to do that, though.

    (Yeah, I can edit corflags on the binaries.)

    The second newest iteration is almost OK. We at least reasonably think we have source for that, and my team has made progress towards making it less house-of-cardsy. It's enterprise scalable (add more disk and more nodes and it'll go basically forever), but not enterprise-grade (this is the one that fell over today because some fuckwit screwed up an unrelated Windows patch on an unrelated server)

    The newest version, which we rebuilt mostly from scratch, keeping only the good parts, is probably* enterprise-grade. It's got a lot more designed-in resiliency and way fewer external dependencies (It should stay working as long as it's component servers are up and DNS works)

    *As in "we designed it to be, but haven't really tested it at scale because nobody wants to pay for that"



  • @Weng said:

    (Yeah, I can edit corflags on the binaries.)
    Damn it, you're not supposed to reveal secrets like that! Now anyone who comes across this topic can suggest that as a way to dodge upgrading!


  • Garbage Person

    Alright! Went back down last night! So they rebooted again. And then it went down five minutes later.

    They are now blaming the antivirus (SEP).


  • Discourse touched me in a no-no place

    @trithne said:

    Get into work, Start computer, unlock Endpoint Encraption, wait 5 mins for laptop to start, log in, wait 5 more mins for laptop to login, lock computer, go make a coffee

    FTFY.

    One of the many reasons I don't use my company issue laptop.


  • Grade A Premium Asshole

    Symantec is utter crap. We migrate our clients off of it every time it comes up for upgrade. There is not a single thing they do that someone else does not do better, and usually cheaper.


  • Discourse touched me in a no-no place

    We use McAfee Endpoint Encryption, but it's still annoying.


  • Grade A Premium Asshole

    Blah. That is just as bad.


  • Garbage Person

    Is endpoint protection even the correct product for a server install?


  • Discourse touched me in a no-no place

    On a server?!
    I'm thinking of the wrong product then as I'm thinking of endpoint encryption, which encrypts the hard drive.


  • Garbage Person

    I was griping about encryption in the OP. Administrators are now blaming SEP.


  • Grade A Premium Asshole

    @Weng said:

    I was griping about encryption in the OP. Administrators are now blaming SEP.

    That's the way we roll around here.


  • Java Dev

    How WTF are those? I believe we are currently transitioning from McAfee to Symantec encryption, but wouldn't know because linux laptops are exempt.


Log in to reply