Warning: run-on sentences and nested parentheses ahead. tl;dr version: placebo them.
@pbean said:
He was so used to the rebooting that he completely freaked out
Ah, that brings back painful memories.
Back in the mid 90's, I had the "pleasure" of maintaining a few SCO "UNIX" servers. At that time, OpenSewer (Sorry - it's been so long, I forget what the correct term was) was available, and had been for long enough there had been several updates to *that*, and the upgrade to OpenSewer was cheaper than our maintenance support of the ancient version we were running, but some PHB decided that we couldn't upgrade those boxes except by transitioning the application to Solaris. We couldn't transition the application to Solaris, because the guy who wrote it didn't keep the (or more likely, didn't make a) requirements document and we couldn't get the users to assist us in making one.
Anyway, one of the quirks of the system was, intermittently, logging out of a terminal wouldn't release the TTY. Not a big deal, as security was handled by the camera on the pole behind the user, combined with two badge reader turnstile doors to even get *to* the terminals, and a small-town mentality that meant every new face was greeted by, on average, about 6 people who wanted to know who you were and why you were there before you could get that far into the facility. As such, people usually didn't log out of the terminal - sessions would last for days. Unfortunately, it seemed to be less intermittent of a problem when the session was really old, and some people insisted on logging out at the end of their shift (despite the fact that the login and password were the same, and printed on a label stuck to each machine). It still wouldn't have been a big deal, except that SCO sold their products under N terminal licenses. The company found a sweet spot with 16 terminal licenses, which let them run four terminals on one box for about a month before rebooting.
Naturally, that meant two nights after I assumed responsibility for the system, I got a page at around 2AM (shift change) because one of the systems needed to be rebooted. The computer room people could do it - they just needed the system maintainer to approve it. I hadn't had time to go over the docs for the system, the techs seemed confident it would work, I approved it, waited for them to verify it was up and the line operators were 'happy' again, and went back to bed, with a mental note to check on it in the morning. Two or three nights later, the same thing happened, on a different machine.
Now, I'd talked with the guy who 'maintained' those boxes before he left, and he'd never mentioned anything about that - but I also recalled he had a very pessimistic attitude towards spending time on code reliability. He hadn't written the software for that system either - he didn't even know the language it was written in (C) - so I hadn't taken that as an indication I had to go over that code immediately on assuming responsibility. (I had with the app I got he did write - and that was a completely different and much worse mess of worms, but at least I'd started diving into that code *before* assuming responsibility, so I had a half-dozen patches ready to go when problems arose (I'd tried to get approval to just deploy them, but management wouldn't let me do any updates until I "had time to get familiar with the code" except for break fixes.), and thus had most of that crap fixed in a couple of weeks.) I was really disappointed to learn it wasn't a problem with the app, but rather the OS.
About a week later, I managed to find a command-line option to a standard utility that promised to release stuck TTYs. I set up a cron job on one of the boxes that hadn't rebooted in a few weeks to run the command once a week, and set an uptime monitor to compare that box's uptime to the others. It stayed up for 45 days, then 15 days, and then went back to being rebooted every 30 days. Before giving approval for the fourth reboot after my adjustment, I actually connected in remotely to see what was up. I hadn't done this earlier, due to being told downtime on those machines counted at around $90,000 per hour. I connected remotely without issue. The box was idle, with no remote terminal connections. However, just the fact I could ssh *in* meant that the box wasn't completely hosed yet (but if it only allowed three terminals, that was $22,500 per hour, assuming all four machines were usable.) It also meant I'd installed unapproved software for that OS, but I didn't let that bother me. Ssh was standard for all of the newer systems, and was only not on the SCO boxes because it "couldn't" be installed there. Bah. A quick check on available terminals indicated 12 were open. I used wall
and kill
to simulate a system reboot, and had the computer room guy tell the operators the system was rebooted. I then set something up so that whenever the system was up for at least a week, and all four terminals were disconnected at the same time (that is, no production work running on the box), it would automatically simulate a reboot, picking on any sessions that weren't using ssh to connect plus the terminal connections for the four machines. When the box reached an uptime of 60 days, I implemented the TTY clear and the simulated auto-reboot on all of the machines.
That worked like gang-busters, until about a year later, when the terminal connection problem resurfaced - and this time, it hit the computer room team, so they couldn't remotely reboot the machine. Apparently, three of the four machines were running product, and they refused to log out (especially since the box had just "rebooted"), and nobody could log in from the computer room without one of them freeing up a TTY. I logged in remotely (thanks to screen
- I had a connection from my development box, I just needed to remote in to there to use it), checked, and found that there were stale processes attached to 12 of the TTYs. So the command wasn't complete magic - but it didn't take much to free up the system again, and not that much more to figure out a way to identify the stale processes and auto-kill them before running the TTY clear.
The final point to this story came years later, after I'd moved on to another job, I ran into one of the computer room people, who wanted desperately to know how I managed to get those systems to reboot in ten seconds, compared to the normal five minutes. He admitted they'd decommissioned all of those boxes, he just felt it might be usable on other servers as well.
Note: I can't actually remember if SCO OpenSewer had this same issue or not. It had enough problems it didn't really matter whether or not that particular bug was fixed. It probably wasn't; my real point about the OpenSewer thing was the OS on those boxes was ancient. I don't know if it was XENIX ancient, but I'm certain it wasn't MS XENIX ancient. I think it was SCO OpenSomethingOtherThanSewer. Thinking about it longer, I think the bug was still there, but the pricing had changed such that 64 terminal licenses were affordable.
Note 2: While I did (and do) know C, I wasn't able to port the application to Solaris any more than anyone else, because the program was written as a single routine named 'main', whose last statement was basically 'execve(argv[0])'.