How to "fix" a broken fileserver.

HAX

Here in The Netherlands we have this very populair website: [url]http://fok.nl[/url]. It's a big community with every kind of news, forums, weblogs and more. There are around 280.000 users registered, And even much more people are visiting this site every day.

A bit more than week ago, they started to move all their servers (during daytime) from one datacenter to another. While they did this, they also replaced a few old servers with new ones (1st WTF, doing this at the same time). From this time, the problems were starting.

The site came online again and seemed very fast. But it only lasted a hour till the site was unreachable: "Kernel Panic" on the fileserver. They placed a downtime-page online on another server, saying they couldn't reboot the server remote because they didn't wrote down to which powerplug the server is connected. (2nd wtf). Also everyone was already at home, and they don't wanted to go to the datacenter anymore that day, so a day passed.

Next day, they did the usual server maintaince. Replace some memory, look up in the logs and try to disable/change some settings. They put the site online again, went home and bam. Site offline again. This story repeated itself for a few days long. Everything was done everyday by only 1 guy, An unpaid volunteer.

Few days ago they finally tried to get help from DELL (where they got their servers from). I don't know exactly what he did but at least the server could now work for 10 hours straight! Hooray. But the problem wasn't over yet.

Now they came up with a *great* solution. They created a little script, which reboots this server from time to time (main wtf!). Yes, it works. Now the site is only around 2 / 3 times a day , 10 minutes offline. But it seems like they gave up on it and just keep this solution. They're even refusing most help.

So I guess we just have to deal with this maintaince page a few times a day:
FOK Maintaince page

Ofcourse there are many more WTF's, in this story and in the site itself. A little while ago a scriptkiddy also 'hacked' the server and got hands on all passwords. They were stored plaintext in the database! All kind of WTF you'll expect from a site with only 100 users, not 280.000.

Just, wtf..

BTW. Sorry for my bad English!

viraptor

While they did this, they also replaced a few old servers with new ones (1st WTF, doing this at the same time).

Not sure about this one... It seems much easier (if you have a fast enough link), to synchronise the data first and then simply switch off the old server in datacenter A and switch on the new ones in datacenter B. You also minimise the downtime (theoretically). I think it was a good decision to do it at the same time.

Also if you don't earn loads of money and simply run a community site, there's not much you can do if you have a hardware problem that causes a kernel panic... if it only occurs under high load and after many hours of normal operation, everything might have worked well during the testing. They dealt with that as well as they could (especially when they don't have a proffesional admin on the team - only volunteers).

Random problems, cheap service - sure. But it's not like they're going to lose all the popularity overnight (think twitter crashes). Dunno... everyone can experience some downtime once in a while - if they don't have people responsible for proper uptime, it's not a big wtf when some crashes happen.

RogerWilco

Well, running sites of that size even gets more professional organisations into problems, (just ask Wizards of the Coast for example), at some point things that are trivial at home or in a small business, become a pain on a larger scale. Still I'm not a fan of Fok! so I can't be too sorry for them, especially hearing that they don't store passwords properly. Each their own I suppose.

dhromed

Everybody knows the Dutch don't know squat about server management.

I suppose creating dry land is a valuable skill also.

morbiuswilters

@dhromed said:

I suppose creating dry land is a valuable skill also.

What's the point of creating dry land if you're just going to fill it with broken, hacked servers and your genetically-inferior, Dutch (I repeat myself) children? Worst use of windmills, ever.

dhromed

@morbiuswilters said:

What's the point of creating dry land if you're just going to fill it with broken, hacked servers and your genetically-inferior, Dutch (I repeat myself) children? Worst use of windmills, ever.

Look, I'm sorry.

That's all.

bstorer

@morbiuswilters said:

Worst use of windmills, ever.

What, in your mind, constitutes the best use of windmills ever? Turning a millstone to grind corn? Target for Don Quixote? The promise of an easier life ahead for the animals toiling on the farm while you live in the farm house and drink whiskey?

astonerbum

@viraptor said:

While they did this, they also replaced a few old servers with new ones (1st WTF, doing this at the same time).
Not sure about this one... It seems much easier (if you have a fast enough link), to synchronise the data first and then simply switch off the old server in datacenter A and switch on the new ones in datacenter B. You also minimise the downtime (theoretically). I think it was a good decision to do it at the same time.
Also if you don't earn loads of money and simply run a community site, there's not much you can do if you have a hardware problem that causes a kernel panic... if it only occurs under high load and after many hours of normal operation, everything might have worked well during the testing. They dealt with that as well as they could (especially when they don't have a proffesional admin on the team - only volunteers).
Random problems, cheap service - sure. But it's not like they're going to lose all the popularity overnight (think twitter crashes). Dunno... everyone can experience some downtime once in a while - if they don't have people responsible for proper uptime, it's not a big wtf when some crashes happen.

The OP was saying that why would they upgrade their servers to B and at the same time upgrade some of A. The smart thing to do is to eliminate variables by making cluster B run. If B fails, keep A going so the site still works worst case scenario. Once you know B is stable, then you can upgrade A. See thats called semi-intelligent management.

The fact that passwords are stored in plain text means very simply: // todo: encrypt data here.

DaveK1

@bstorer said:

@morbiuswilters said:
Worst use of windmills, ever.
What, in your mind, constitutes the best use of windmills ever? Turning a millstone to grind corn? Target for Don Quixote? The promise of an easier life ahead for the animals toiling on the farm while you live in the farm house and drink whiskey?

Toothpicks for Godzilla to get all those stray bits of Tokyo out of his mouth.

badcaseofspace

Good soap. When is John de Mol jumping into this?
"Reality TV from the A/C Catacombs"

I could defend the honour of the Nederlandse Systeembeheerder, but I think it's better this way.

viraptor

@astonerbum said:

The OP was saying that why would they upgrade their servers to B and at the same time upgrade some of A. The smart thing to do is to eliminate variables by making cluster B run. If B fails, keep A going so the site still works worst case scenario. Once you know B is stable, then you can upgrade A. See thats called semi-intelligent management.

I may understand it incorrectly, but I read it as: after the move they had no more servers in the first datacenter and mixed A/B servers in the second, but only a full set of A could actually run the site... I'm not sure what was their grand plan - we may have too little information to say that reverting to the old setup was possible. I say it's a bit foolish, but innocent, until proven wtf'y.

How to &quot;fix&quot; a broken fileserver.

How to "fix" a broken fileserver.