Reason #7927 to always use virtualization

izzion

So, a phone server I deployed four weeks ago just started having lots of problems with call quality out of the blue.

Customer is a 20Mbps fiber connection, not maxing that...

Phone system is up and working, routing calls correctly... but man, this SSH session is a bit draggy. My terminal server connection is fine for other things, so maybe it's the actual SSH session... let's run top... wooooah,18 load on a box that has way less than 18 cores.

What're all these power_saving/X processes running maxed out on CPU?

http://en.community.dell.com/support-forums/servers/f/1466/t/19456558

And that, boys and girls, is why you don't bare metal servers. Because at least VMWare fixes their shit and doesn't leave it un-backported from upstream for months and months and months and months on end.

RaceProUK

@izzion said:

dell

And that's all you need to know about this

;)

blakeyrat

Link says it's a RedHat bug.

aliceif

He should switch to Ubuntu?

gordonjcp

Notice how all the comments on that bug are from three years ago? Notice how 2.6.32 was EOL last year?

Your hopelessly out-of-date server is TRWTF.

powerlord

RHEL (and its clone CentOS) are always hopelessly out of date.

Because old is apparently "stable." Never you mind that we're all the way up to version 4.0.1 of the kernel now and tons of bugfixes have gone into the kernel in the last 5+ years...

powerlord

Kernel bug, actually. However, the kernel in question was released in 2009... and the 2.6 line was in maintenance mode for years prior to that.

This is the equivalent of running Windows XP and complaining about how it has bugs on modern hardware.

anonymous234

I like how we need an extra layer of protection between the thing that's supposed to manage the system and run programs (the OS) and the actual system because the OS always ends up doing weird shit.

izzion

Well, yes, the bug is 3 years old. And as referenced in the very last comment on the Dell forums link, CentOS 6.5 (basically "Windows 7", in terms of recency of CentOS builds, about 4 months of patch revisions back from "now") still doesn't have the fix applied. And I can confirm from my own personal trial and error that the most current version of CentOS 6.6 still doesn't have the fix applied.

I'm just more frustrated that the bug could be easily mitigated by having an abstraction (virtualization) layer in between my hardware and my OS, so that the virtualization layer handles all the messy ACPI stuff and presents synthetic hardware to the OS that wouldn't cause this problem. And that RHEL/CentOS haven't backported such an important fix into the supported-line kernel versions.

PleegWat

Hence why the main difference between RHEL and OL is the kernel version. Though calling your kernel 'Unbreakable' rings alarm bells with me as well...

powerlord

@izzion said:

And that RHEL/CentOS haven't backported such an important fix into the supported-line kernel versions.

They could spend effort fixing it... or they could just release RHEL 7 / CentOS 7 instead. Which is what they did.

izzion

I'm so glad that Microsoft doesn't take that attitude toward their server and desktop operating systems.

powerlord

Let me put it to you another way: Windows XP has components newer than the latest supported RHEL/CentOS 6 do.

That's really all that needs to be said, isn't it?

Polygeekery

@izzion said:

And that, boys and girls, is why you don't bare metal servers.

Bullshit. Hypervisors have their place, but making blanket statements that running on bare metal is a bad idea is just idiotic. There are lots and lots of reasons to still run bare metal.

blakeyrat

The problem here seems to be running an ancient (and low quality) OS, it has nothing to do with virtualization.

Polygeekery

Exactly.

Run that same OS on VMWare and you are still going to have problems. You can just have lots of those problems on a single set of hardware.

izzion

I guess I don't understand the claim that an OS released in 2011 and under regular support through 2018 and LTS through 2021 is "ancient".

RHEL making a poor decision to not backport that kernel fix, I could see as a criticism.

But y'all're basically calling Server 2008R2 an ancient operating system here...

Polygeekery

@izzion said:

I guess I don't understand the claim that an OS released in 2011 and under regular support through 2018 and LTS through 2021 is "ancient".

I guess I don't understand how any Hypervisors is going to fix the issue you are seeing.

izzion

My expectation is that virtualization would mask the ACPI stuff and prevent the bug's effect.

But mostly I just needed to rant after I got woken up from my post-maintenance-window afternoon nap because the consultant for this customer has been a lazy COMPLAIN for the last few weeks and gave notice this week so he's showing even less work ethic than normal.

Polygeekery

Fair enough, but virtualization is not a panacea. I lost a consult with a potential client the other day because they became incredulous when I even mentioned the possibility of moving them off of virtualization. Their environment was nowhere near complex enough for VMWare and they had absolutely no redundancy.

I understand needing to bitch though. Keep on keeping on. I empathize on having to work with pains in the ass.

Nprz

It is easy to add redundancy with virtualization.
Can you pass the client to me and I'll help them out?

accalia

@Nprz said:

It is easy to add redundancy with virtualization.

still will need at least two physical servers for proper redundancy.

redundant virtual servers on the same host are a step up from no redundancy, but they're still not proper reduncancy because they'll both fall over on issues with the host.

tarunik

@accalia said:

redundant virtual servers on the same host are a step up from no redundancy, but they're still not proper reduncancy because they'll both fall over on issues with the host.

You can have all the virtuals you want, but if you don't have physically separated redundant physicals, you'll still get screwed when the cleaner's vacuum stalls and trips the feeder breaker for the server room because nobody bothered to selectively coordinate the lighting circuits.

boomzilla

Better move off planet, too, in case of asteroid storm.

Hmmm....might need to go extra solar system. You never know when the next local gamma ray burst will show up.

TwelveBaud

Make sure to pack an ansible; the I/O latency from reliable transplanetary replication will kill your performance otherwise.

accalia

Well it's a matter of balancing risk/reward/cost.

if you need to have redundancy then you want at least two bare metal servers, that way you cna have host upgrades/repairs without losing your "redundnacy"

of course just two physical servers isn't full redundnacy either, but depending on your needs it may be sufficient for you.

RaceProUK

@accalia said:

of course just two physical servers isn't full redundnacy either, but depending on your needs it may be sufficient for you.

Would three physical servers be enough? It does seem to me that triple redundancy is considered Doing It Right™

Rhywden

@boomzilla said:

Hmmm....might need to go extra solar system. You never know when the next local gamma ray burst will show up.

For that one, 3 data centers equidistant on the equator should suffice - a gamma ray burst will cook only the Earth's side towards the burst :)

boomzilla

@Rhywden said:

For that one, 3 data centers equidistant on the equator should suffice - a gamma ray burst will cook only the Earth's side towards the burst :)

Right, but: ASTEROID STORM

mott555

I even have VMware servers with a single VM on them, simply because it's convenient for backup, snapshotting, or migrating to other machines. I'd even do that to my personal system if I felt like jacking around with PCI Passthrough to get my graphics card and USB ports accessible.

swayde

+1
It's fucking annoying that (most) consumer processors do not have vt-d.

blakeyrat

@Rhywden said:

For that one, 3 data centers equidistant on the equator should suffice - a gamma ray burst will cook only the Earth's side towards the burst

Depends on how long it lasts.

Inconstant Moon - Wikipedia

mott555

@swayde said:

+1It's fucking annoying that (most) consumer processors do not have vt-d.

Oh yeah, forgot about that. I have an i5 2500k, and the k-series CPUs don't have vt-d so no PCI passthrough for me.

Most non-k Intel CPUs have vt-d now, or at least the i5's and i7's do. Not sure about the lower-end models.

powerlord

@mott555 said:

Oh yeah, forgot about that. I have an i5 2500k, and the k-series CPUs don't have vt-d so no PCI passthrough for me.

Most non-k Intel CPUs have vt-d now, or at least the i5's and i7's do. Not sure about the lower-end models.

They don't? I haven't poked around in my BIOS options that much, but I thought my 2600k did have that option.

mott555

It will have VT-x, it will not have VT-d. They're different. Also, the 2600 has VT-d, but the 2600k does not.

accalia

@RaceProUK said:

Would three physical servers be enough?

depends on your needs.

do you need triple redundnacy?

does the const/benefit analysis indicate that you would get a net benefit from triple redundancy?

TwelveBaud

It won't help on the road to six nines, but it'll definitely help on the road to nine sixes!

accalia

When Designing the Saturn V rocket Goddard told Von Braun and his scientists that he wanted the rocket to have five nines of reliability, and indeed he got five neins.

RaceProUK

@accalia said:

he wasted the rocket

@accalia said:

he got five niens

INB4 no flag

CarrieVS

@accalia said:

indeed he got five niens.

As a result of which, all the efforts on the rocket were wasted?

I actually thought it must be something like that for some time before I realised it was a typo. How do you even substitute an S for an N?

TwelveBaud

awww, the pun was amusing enough I'd let the misspellings go, but you just had to kill it...

accalia

corrected the first one, the second one was spelled correctly. ;-)

accalia

@CarrieVS said:

How do you even substitute an S for an N?

by being @accalia
:-P

TwelveBaud

@accalia said:

the second one was spelled correctly.

No it wasn't

.

RaceProUK

@CarrieVS said:

How do you even substitute an S for an N?

@accalia said:

corrected the first one

@accalia said:

the second one was spelled correctly

…
*awaits flag for whoosh (not from @accalia though)*
@TwelveBaud said:

No it wasn't

…or not…

FrostCat

@powerlord said:

They don't? I haven't poked around in my BIOS options that much, but I thought my 2600k did have that option.

I just checked Intel's site and, for example, the 3570k doesn't have it.

flabdablet

@Rhywden said:

3 data centers equidistant on the equator

No good if gamma ray bursts are from due north or due south. You want four data centres at the vertices of an equilateral tetrahedron, minimum.

accalia

@TwelveBaud said:

No it wasn't

yes it was

/me holds a paw over the orange edit pencil

Polygeekery

@Nprz said:

It is easy to add redundancy with virtualization.

Thank you Mr. Obvious. But virtualization also carries a high cost when you are only dealing with a few guests. Couple that with a VM that is a file server, serving ~2TB of files in an office of 15 people and they would be better off in every respect to be bare metal with a quick recovery strategy.

They also needed some of their functions coalesced as they had separate VMs for standalone domain controller (no other roles installed), terminal services, Exchange, a Blackberry server that is not doing anything anymore and some other VM that was no longer used but was just tying up resources.

So, my proposal was to move the file server to its own bare metal install running on cheap SATA drives instead of the ten 300GB 15K SAS drives, move the domain controller to a bare metal install, once again on cheap SATA drives and use it as a replication point to backup the file server to for quick recovery, move to hosted email as their internet connection is really flaky and take one of the VM Ware boxes and run a few VMs on it for the very few people who work out of town to connect to through a simple port forward on RDP.

Simpler, cheaper, easier to backup, easier to recover in the event of failure and there would have been money left over in the budget to install a backup target for onsite backups to VHD over iSCSI while still saving them money.

@Nprz said:

Can you pass the client to me and I'll help them out?

So, do you still think that you know better than I do, now that we have went from a sentence or two to a more full explanation? ;)

boomzilla

@Polygeekery said:

So, do you still think that you know better than I do

If he isn't the sort of person to think that, he wouldn't have created his account here.