Reboot mandatory, not optional

pbean

The server application we develop is kind of hungry for memory and seems to have some problems with its caches. As a result it has to be rebooted on average every day, but some times it's able to run for 2 days without having to rebooted. The guys at the server department monitor the memory usage and exceptions in the logs, and when they reach certain numbers they'll restart the servers.

So the past few weeks I've been kind of monitoring the rate of restarts of the servers and it was going steady with a reboot about every 1-2 days. Then suddenly this, after we deployed some refactorings, I noticed one of the server nodes wasn't restarted after 3 days. And after 5 days, it was still running. So today, after some more days, I went to the tech lead and told him the server hadn't been restarted for a full week. I wanted to congratulate him on the accomplishment.

However, he panicked, wondered why the procedures weren't followed, and ran off to the server guys to inquire what was wrong. He was stupified to hear that the memory numbers and such hadn't reached critical levels yet. So the only thing he could imagine was that the applications was somehow malfunctioning. After a batch of tests he confirmed it was actually functioning correctly.

He was so used to the rebooting that he completely freaked out and couldn't believe it even after seeing it with his own eyes. He hasn't been quite the same since then. His eyes are dreamy, almost as if filled with hope.

Renan

You guys could get a dummy server online for him to reboot every other day. Would keep his rebooting cravings at bay.

Cassidy

@pbean said:

The server application we develop is kind of hungry for memory and seems to have some problems with its caches. As a workaround, we reboot servers on average every day, but some times it's able to run for 2 days without having to rebooted.

@pbean said:

He was so used to the rebooting that he completely freaked out and couldn't believe it even after seeing it with his own eyes. He hasn't been quite the same since then. His eyes are dreamy, almost as if filled with hope.

Only call the men in white coats when he shows signs of curling up into a ball in the corner, rocking gently with "works... it works..."

I knew of one firm that bounced their AIX server every friday afternoon as a way of freeing hogged memory from a leaky application (that a vendor had convinced them to buy, then convinced them that weekly reboots were acceptable for "optimum performance"). But it was a verson of AIX with a self-tuning kernel that optimised its settings after a week of use, so it had to relearn performance demands at the start of the next week... until it was rebooted... etc...

DOA

@pbean said:

His eyes are dreamy, almost as if filled with hope.

You gave him hope, but you know soon enough there will be some new kind of fail and his hopes will be squashed. And with them another little part of his soul will shrivel up and die.

It is a terrible thing you have done. A terrible, terrible thing...

MarkJ

@Renan said:

You guys could get a dummy server online for him to reboot every other day. Would keep his rebooting cravings at bay.

Do they have a 12-step program for that?

RickWeston

If yes then please post it here?

erikal

@Cassidy said:

<pedantic mode="dickweed">FTFY</pedantic>

After having seen the term pedantic dickweed being mentioned several times now I really start to wonder. Who coined the term "dickweed" and under which circumstances was it done? Whatever I imagine, it ain't purty.

Cassidy

AIUI... I believe it was our beloved Blakeyrat, in circumstances when people focussed upon picking apart minutiae in his details and overlooking the bigger picture (ie: the point of his post).

The pedant bit is obvious; the dickweedery bit implies that nothing of real value nor consequence is actually being added to the discussion.

Blakey (or another long-term poster) should be along later to correct me if I'm wrong (which, as recent posts have shown, I am wont to be), and someone may be able to identify the original post which started the whole WTFmeme.

I use it to identify circumstances where I'm rephrasing someone's post, substituting colloquialisms, jargon, opinion or feeling for factual accuracy, lest someone get the wrong idea. It's usually akin to an annoying turd interrupting group discussions with their schoolteacher-like corrections. I often think of the "Reg wants to be a woman" script in Life Of Bryan as a good example.

PJH

@erikal said:

After having seen the term pedantic dickweed being mentioned several times now I really start to wonder. Who coined the term "dickweed" and under which circumstances was it done? Whatever I imagine, it ain't purty.

The last time this question was asked was here.

erikal

Nice! People here ask the questions I care about before I do.

Cad_Delworth

@Cassidy said:

It's usually akin to an annoying turd interrupting group discussions with their schoolteacher-like corrections. I often think of the "Reg wants to be a woman" scene in The Life Of Brian as a good example.

You mean, like this? ^

frits

@erikal said:

Who coined the term "dickweed" and under which circumstances was it done?

This link may prove useful. It looks like it first came into usage in 1984 and was made popular by Master's Bill and Ted during some sort of adventure.

Cassidy

@Cad Delworth said:

You mean, like this? ^

There! That's the fellah! Sorted![1]

[1] ascending and only on an indexed column, naturally, for those database devs.

Zylon

@frits said:

@erikal said:
Who coined the term "dickweed" and under which circumstances was it done?
This link may prove useful. It looks like it first came into usage in 1984 and was made popular by Master's Bill and Ted during some sort of adventure.

Master's WHAT?

Mason_Wheeler

@pbean said:

So the past few weeks I've been kind of monitoring the rate of restarts of the servers and it was going steady with a reboot about every 1-2 days. Then suddenly this, after we deployed some refactorings, I noticed one of the server nodes wasn't restarted after 3 days. And after 5 days, it was still running. So today, after some more days, I went to the tech lead and told him the server hadn't been restarted for a full week. I wanted to congratulate him on the accomplishment.

That doesn't sound like refactorings to me. Refactorings are only supposed to deal with code cleanliness, not change behavior. Sounds like you deployed some bugfixes.

nexekho1

@Mason Wheeler said:

That doesn't sound like refactorings to me. Refactorings are only supposed to deal with code cleanliness, not change behavior. Sounds like you deployed some bugfixes.

If there was a lot of repeated code it's a lot more likely that there were lots of minor bugs present and the refactor could have easily got rid of them.

Cassidy

Either way, Mason's right in that observation (and so are you, looking at it): it wasn't actually refactoring that eradicated the memory leak, it was some fixes deployed as a result of the refactoring.

I only hope some manager doesn't equate refactoring = reducing memory leaks. I suppose a side eaffect is that some manager sees the refactoring exercise as cost-justifiable optimisation and plans in some more to clean up the codebase.

frits

@Zylon said:

@frits said:

@erikal said:
Who coined the term "dickweed" and under which circumstances was it done?
This link may prove useful. It looks like it first came into usage in 1984 and was made popular by Master's Bill and Ted during some sort of adventure.

Master's WHAT?

Did you just dickweed my dickweedery about dickweed?

Watson1

@Cassidy said:

I only hope some manager doesn't equate refactoring = reducing memory leaks. I suppose a side eaffect is that some manager sees the refactoring exercise as cost-justifiable optimisation and plans in some more to clean up the codebase.

I've had managers who equated refactoring with adding new features. "We're going to refactor this so that users can frob their whatsits".

Actual refactoring came under "if it works don't fix it."

Cassidy

Isn't that Cargo Cult behaviour? When someone conflates the results with the operation and believes that repeatingthe operation will yield the same results (irrespective of how they are connected)...?

tgape

Warning: run-on sentences and nested parentheses ahead. tl;dr version: placebo them.

@pbean said:

He was so used to the rebooting that he completely freaked out

Ah, that brings back painful memories.

Back in the mid 90's, I had the "pleasure" of maintaining a few SCO "UNIX" servers. At that time, OpenSewer (Sorry - it's been so long, I forget what the correct term was) was available, and had been for long enough there had been several updates to *that*, and the upgrade to OpenSewer was cheaper than our maintenance support of the ancient version we were running, but some PHB decided that we couldn't upgrade those boxes except by transitioning the application to Solaris. We couldn't transition the application to Solaris, because the guy who wrote it didn't keep the (or more likely, didn't make a) requirements document and we couldn't get the users to assist us in making one.

Anyway, one of the quirks of the system was, intermittently, logging out of a terminal wouldn't release the TTY. Not a big deal, as security was handled by the camera on the pole behind the user, combined with two badge reader turnstile doors to even get *to* the terminals, and a small-town mentality that meant every new face was greeted by, on average, about 6 people who wanted to know who you were and why you were there before you could get that far into the facility. As such, people usually didn't log out of the terminal - sessions would last for days. Unfortunately, it seemed to be less intermittent of a problem when the session was really old, and some people insisted on logging out at the end of their shift (despite the fact that the login and password were the same, and printed on a label stuck to each machine). It still wouldn't have been a big deal, except that SCO sold their products under N terminal licenses. The company found a sweet spot with 16 terminal licenses, which let them run four terminals on one box for about a month before rebooting.

Naturally, that meant two nights after I assumed responsibility for the system, I got a page at around 2AM (shift change) because one of the systems needed to be rebooted. The computer room people could do it - they just needed the system maintainer to approve it. I hadn't had time to go over the docs for the system, the techs seemed confident it would work, I approved it, waited for them to verify it was up and the line operators were 'happy' again, and went back to bed, with a mental note to check on it in the morning. Two or three nights later, the same thing happened, on a different machine.

Now, I'd talked with the guy who 'maintained' those boxes before he left, and he'd never mentioned anything about that - but I also recalled he had a very pessimistic attitude towards spending time on code reliability. He hadn't written the software for that system either - he didn't even know the language it was written in (C) - so I hadn't taken that as an indication I had to go over that code immediately on assuming responsibility. (I had with the app I got he did write - and that was a completely different and much worse mess of worms, but at least I'd started diving into that code *before* assuming responsibility, so I had a half-dozen patches ready to go when problems arose (I'd tried to get approval to just deploy them, but management wouldn't let me do any updates until I "had time to get familiar with the code" except for break fixes.), and thus had most of that crap fixed in a couple of weeks.) I was really disappointed to learn it wasn't a problem with the app, but rather the OS.

About a week later, I managed to find a command-line option to a standard utility that promised to release stuck TTYs. I set up a cron job on one of the boxes that hadn't rebooted in a few weeks to run the command once a week, and set an uptime monitor to compare that box's uptime to the others. It stayed up for 45 days, then 15 days, and then went back to being rebooted every 30 days. Before giving approval for the fourth reboot after my adjustment, I actually connected in remotely to see what was up. I hadn't done this earlier, due to being told downtime on those machines counted at around $90,000 per hour. I connected remotely without issue. The box was idle, with no remote terminal connections. However, just the fact I could ssh *in* meant that the box wasn't completely hosed yet (but if it only allowed three terminals, that was $22,500 per hour, assuming all four machines were usable.) It also meant I'd installed unapproved software for that OS, but I didn't let that bother me. Ssh was standard for all of the newer systems, and was only not on the SCO boxes because it "couldn't" be installed there. Bah. A quick check on available terminals indicated 12 were open. I used wall and kill to simulate a system reboot, and had the computer room guy tell the operators the system was rebooted. I then set something up so that whenever the system was up for at least a week, and all four terminals were disconnected at the same time (that is, no production work running on the box), it would automatically simulate a reboot, picking on any sessions that weren't using ssh to connect plus the terminal connections for the four machines. When the box reached an uptime of 60 days, I implemented the TTY clear and the simulated auto-reboot on all of the machines.

That worked like gang-busters, until about a year later, when the terminal connection problem resurfaced - and this time, it hit the computer room team, so they couldn't remotely reboot the machine. Apparently, three of the four machines were running product, and they refused to log out (especially since the box had just "rebooted"), and nobody could log in from the computer room without one of them freeing up a TTY. I logged in remotely (thanks to screen - I had a connection from my development box, I just needed to remote in to there to use it), checked, and found that there were stale processes attached to 12 of the TTYs. So the command wasn't complete magic - but it didn't take much to free up the system again, and not that much more to figure out a way to identify the stale processes and auto-kill them before running the TTY clear.

The final point to this story came years later, after I'd moved on to another job, I ran into one of the computer room people, who wanted desperately to know how I managed to get those systems to reboot in ten seconds, compared to the normal five minutes. He admitted they'd decommissioned all of those boxes, he just felt it might be usable on other servers as well.

Note: I can't actually remember if SCO OpenSewer had this same issue or not. It had enough problems it didn't really matter whether or not that particular bug was fixed. It probably wasn't; my real point about the OpenSewer thing was the OS on those boxes was ancient. I don't know if it was XENIX ancient, but I'm certain it wasn't MS XENIX ancient. I think it was SCO OpenSomethingOtherThanSewer. Thinking about it longer, I think the bug was still there, but the pricing had changed such that 64 terminal licenses were affordable.

Note 2: While I did (and do) know C, I wasn't able to port the application to Solaris any more than anyone else, because the program was written as a single routine named 'main', whose last statement was basically 'execve(argv[0])'.

Zemm

@Zylon said:

@frits said:
@erikal said:
Who coined the term "dickweed" and under which circumstances was it done?
This link may prove useful. It looks like it first came into usage in 1984 and was made popular by Master's Bill and Ted during some sort of adventure.

Master's WHAT?

This guy:

His

and

Cassidy

@tgape said:

At that time, OpenSewer (Sorry - it's been so long, I forget what the correct term was)

It was "OptServer", because pretty much everything was under /opt then symlinked to other SVR4-compliant directories. And it sucked harder than crack-addled whore convinced you've pocketfuls of rocks.

@tgape said:

We couldn't transition the application to Solaris, because the guy who wrote it didn't keep the (or more likely, didn't make a) requirements document and we couldn't get the users to assist us in making one

"we can't fulfill management's transition requirements due to management's incompetance at controlling the development process..."

I had similar experiences with SCO's excrement. Finding that changing network settings in "scoadmin" didn't persist, scoadmin ignoring any defaults set up for new user creation, deviations from SVR4 standards like their behaviour for the "ls" command (although admittedly all flavours of Unix/Linux deviate to some extent). It certainlyprepared me for the notion that paid-for products that were developed and supported by a large organisation is not an indication they're actually any good.