Outage scheduling.

Weng

"I need an 8 hour outage to migrate servers".

"Weekdays are unacceptable. "
"Business hours for any North American or European timezone are unacceptable, even on weekends".
"The last weekend of the month is not acceptable. "
"The first weekend of the month is not acceptable. "
"May and June are peak season for business type A and therefore unacceptable. "
"September through December is peak season for business type B and therefore unacceptable "
"January through April are peak season for business type C and therefore unacceptable. "
"The 14th 15th and 16th are unacceptable due to mid-month reporting"
"Support staff are unavailable on Sundays. "
"Midnight to 2am Saturday morning is autopatching and therefore maintenance cannot be performed then."

So. Uh. That leaves me July 11, 18, August 8 and 22. At null o'clock, because eastern Europe's business hours and Hawaiis are exactly opposed and the only non overlap is 12 to 3am. On local Sunday. Which isn't long enough anyway.

And this migration is to get off of Windows 2003, which makes July 11th the last possible date.

Yamikuronue

@Weng said:

get off of Windows 2003

We're doing that now. Random outages throughout the day in demo. Yay us.

"I spent three hours trying to figure out why the code I deployed didn't get served up until I realized you'd taken the server I deployed to out of the pool and replaced it with a new 2008 server. Warning would have been nice."

hungrier

What you need to do is tell one of your bosses/clients/whoever the month that you'll be migrating, and another one the date, and write down some possibilities on a piece of paper. By the time they figure it out, you will have performed the migration.

TwelveBaud

"There will be an 8 hour outage. Either I will decide when it is, taking your suggestions into account, or the hackers will, when they destroy the servers. Choose."

loopback0

Is whatever you're migrating not resilient enough that you could do half of the service in one batch, then the other half a different time?

Jaime

It doesn't take that many moving parts to dig this kind of hole. At my last job business ran from 7am to 11pm every day except Sunday. Someone put a business critical nightly job in from midnight to 5am and a weekly job that ran most of the day on Sunday.

Some people don't get that maintenance periods are blocks of time that no business-critical stuff is supposed to occur, not a convenient time to run batch jobs.

Weng

SQL version bump mixed in with this one. Plus various license managers for third party apps are moving. And those apps don't run without license servers.

The actual app servers are resilient, and the SQL boxes are resilient, but they need SQL up to do anything useful. So if you offline SQL to do a version bump, you're offline.

delfinom

Or just conveniently have an server failure during business hours, but don't mention the why.

loopback0

@Weng said:

So if you offline SQL to do a version bump, you're offline.

Ah, bugger.

CoyneTheDup

Big meeting of the principles. Tell them they have to pick a date...or if they like, they can wait until Microsoft kills the servers on July 11. Tell them they can leave once they've selected the best compromise outage period.

Edit: Oh...and try to come up with an architecture after this, that won't have all your eggs in a solitary basket.

flabdablet

@Yamikuronue said:

until I realized you'd taken the server I deployed to out of the pool and replaced it with a new 2008 server.

Why would you go for 2008 instead of straight to 2012r2? Isn't that just setting yourself up for the next round of pain to happen five years earlier than you need it to?

This is a serious question - I'll need to move a couple of school servers off 2003 this year and if there's a good reason to avoid 2012r2 I'd like to know about it.

flabdablet

@CoyneTheDup said:

wait until Microsoft kills the servers on July 11

Is this hyperbole, or are there reasons other than the unavailability of patches why 2003 won't work after July 11?

Weng

Internal compliance wonks. Who have actually set the deadline at June 1st.

But it's hyperbole if you aren't in a closely regulated line of busyness

flabdablet

Can you perhaps get your internal compliance wonks fighting directly with your unreasonable uptime wonks, then do the migration while they're both distracted?

izzion

If you're using primarily Microsoft technologies and solutions, I would say that 2012R2 is strictly better than either version of 2008 anyway. Much better toolkit for centralized remote management of servers, and easier to install most of the MS roles & features.

If you have a legacy line of business app that pitches 3 kinds of fit about windows 8, you're probably stuck with server 2008 (vista) or 2008R2 (7)

immibis_

@hungrier said:

What you need to do is tell one of your bosses/clients/whoever the month that you'll be migrating, and another one the date, and write down some possibilities on a piece of paper. By the time they figure it out, you will have performed the migration.

You missed the part where you get them to argue over whether the Windows 8 logo is blue and white, or black and gold.

flabdablet

I have found the solution for you! You don't need an eight hour migration, just a one minute upgrade.

Simply replace the patch cables into the back of your Windows 2003 servers with these beauties. They will become completely immune to all forms of malware and gain so much clarity and depth that your compliance wonks will think you're running at least Windows 2023.

Yamikuronue

@flabdablet said:

Why would you go for 2008 instead of straight to 2012r2?

I have no idea. That's the middleware and server teams, I rarely interact with them. We do have a slight fear of things being "too new", but 2012 is probably stable enough.

The bulk of our web servers are moving to linux instead, so I have to assume there's some legacy apps or something preventing us from going to 2012 on those servers that use AD authentication (which is the primary reason they're staying on windows, it's the internal-facing servers).

da_Doctah

@immibis_ said:

What you need to do is tell one of your bosses/clients/whoever the month that you'll be migrating, and another one the date, and write down some possibilities on a piece of paper. By the time they figure it out, you will have performed the migration.

You missed the part where you get them to argue over whether the Windows 8 logo is blue and white, or black and gold.

Or the part where they can't decide if the upgrade is going down the stairs or up.

Weng

The reason for 2008 is easy. It's the last 32bit Windows.

This iteration of the platform is compiled anycpu, but with 32bit dependencies. So it won't run on 64bit. And we don't have source to half of it to do a proper build.

And that's before you get to the 16bit dependencies.

flabdablet

@Weng said:

anycpu, but with 32bit dependencies

Ouch. What's the point of anycpu unless you can do the dependencies properly? Did that happen just because anycpu is the default?

@Weng said:

we don't have source to half of it

Ow ow owie ow ouch! Still, as long as you can recompile the main exes as x86 you should eventually be good on WOW64... unless, heaven forbid,

@Weng said:

the 16bit dependencies

IN 2015?

Fuck.

TwelveBaud

@Weng said:

This iteration of the platform is compiled anycpu, but with 32bit dependencies. So it won't run on 64bit. And we don't have source to half of it to do a proper build.

If it's not signed or strongnamed, you can use corflags to make it 32bit.@Weng said:

And that's before you get to the 16bit dependencies.

... oh. Damn.

lolwhat

My current client does it right. They have two of every type of server (DB, Web, whatever else), as well as load balancers in front of them. At deployment time, the load balancers gradually shut off traffic to one set of servers. Those servers are then updated. The load balancers then begin to direct traffic to that first set, while also shutting off traffic to the other set. The other set is then updated and brought back into service. I guess the major tricky part is ensuring data integrity between both sets of data store servers while schemata and such are being updated... but they have this shit down cold, such that customers never see a service interruption.

dkf

@lolwhat said:

I guess the major tricky part is ensuring data integrity between both sets of data store servers while schemata and such are being updated...

It's not hard, as long as you're very disciplined. So… it's hard I guess.
500 storm: try… err shit I lost count

CoyneTheDup

@Weng said:

This iteration of the platform is compiled anycpu, but with 32bit dependencies. So it won't run on 64bit. And we don't have source to half of it to do a proper build.

And that's before you get to the 16bit dependencies.

Well there's your main there! When you're a development org, losing the source is like Coke losing the formula for their soda.

Yamikuronue

@lolwhat said:

load balancers

we have load balancing in demo->prod, but not in dev, and in demo, there's only two per platform, so it's common to bypass the load balancer when testing things.

Weng

Historically, we weren't viewed as a development group, but instead an 'automation' team within operations. Only in the past 2 years have my colleagues and I managed to expel enough of the rot to start bringing in concepts like proper source control and testing and an SDLC. We get resistance at every single baby step because doing software right is seriously fucking expensive (in people's minds).

We are buried under 20 years of legacy technical debt. Nobody will commit to modernizing any of it. Ever. The fuckers were convinced they could sweep Win2003 EOL under the rug until six months ago, and then they just held meetings about planning every 2 weeks while making no forward progress until literally the very last moment.

darkmatter

didnt we just solve one of these? what's that bitch cheryl hiding now?

Jaime

@lolwhat said:

They have two of every type of server (DB, Web, whatever else), as well as load balancers in front of them

You can't just slap a load balancer in front of a pair of database servers. You'd either need something to keep the databases in sync or have your data sharded. If you sync them, a major version upgrade would mean taking both offline. If you shard, then half of your data is inaccessible at any time during the upgrade.

The best I could imagine with MS SQL Server is this:

Build new server.
Take down app.
Shut down old server.
Dismount the drives from the old server and mount on new server.
Attach databases to new server - the internal database structure will be upgraded as it comes online.
Change the DNS entry so traffic goes to new server.
Bring the app back online.

You could probably get steps 2 through 7 down to a few minutes with some practice and automation.

Weng

We actually did the upgrade today. This is the procedure we used.

Took a few hours. The rest of the window was qualifying that everything still works.

izzion

Though the Always On Availability Group feature in SQL 2012+ works really well for the "load balancer & something to keep the databases in sync" option. As long as both servers are actually on the same LAN. And the one doesn't have some sort of weird data drive issue that's possibly VMWare driver related and seems to result in 100-200% increases in transaction commit times when it's active. And as long as that "slow" server doesn't somehow magically make itself active right before your 4 hour monthly billing maintenance script. Every. Damn. Time.

ben_lubar

"There will be an outage in 15 minutes. If you press this button, the outage will be postponed to 1 hour from now. The button may be used multiple times, each time resetting the time to outage to 1 hour. Have fun!"

dkf

I imagine that big DB version updates would be some of the most painful things to contemplate, and would involve finding the quietest time of the year and a lot of meetings.

Schema changes would be simpler provided you can do them as one of these:

Changing the stored procedures (nearly instantaneous)
Adding columns (might be cheap, depending on DB, but at least it doesn't require major jiggery-pokery)
Removing unused columns (caveat from above applies).

Maintaining multi-decadal ABIs has similar challenges.

martijntje

I don't know about MSSQL, because we don't use any Microsoft products where we work (it's all Linux, thankfully), and there it's rather easy to do a database upgrade without any downtime. Basically you do the following:

Set up replication
Upgrade replication server
Wait for it to catch up
Change the master host in the application

Since our database servers are all running from a SAN, this is doable in a manner of minutes. You just shutdown the replication server that's there for backups, clone the volume and mount this on the new database server. The whole upgrade takes around 10 minutes tops and can be done during business hours, while customers use the application.

dkf

Which DB engine is this?

lolwhat

@dkf said:

I imagine that big DB version updates would be some of the most painful things to contemplate, and would involve finding the quietest time of the year and a lot of meetings.

SQL Server does support replication across multiple server versions, so the world wouldn't end.

dkf

@lolwhat said:

SQL Server does support replication across multiple server versions, so the world wouldn't end.

I was assuming that database engines would only do this across relatively minor version number changes. Major version number changes — when things like the replication algorithm might change — would be Scary Stuff, and would probably not be done while keeping things live (and quite possibly not at all ever). Plus there's still the problem of schema changes; they're quite capable of really causing huge trouble if done wrong.

lolwhat

@dkf said:

Major version number changes — when things like the replication algorithm might change — would be Scary Stuff, and would probably not be done while keeping things live (and quite possibly not at all ever).

My client has most certainly updated SQL Server versions without shutting down all the things. After all, Microsoft isn't Linux.

That being said, I'm also certain that a shitload of discussion and planning went into the move.

lolwhat

@dkf said:

there's still the problem of schema changes

My client has a very strict protocol to follow there also, with script reviews by fellow devs and in-house DBA's.

These guys have a fuckton of data stored across something like a thousand SQL Server databases in production. They've pushed SQL Server to every conceivable limit, while still maintaining a pretty damn good uptime. They know what they're doing.

dkf

@lolwhat said:

That being said, I'm also certain that a shitload of discussion and planning went into the move.

I'll bet there was some testing too. As long as the format of the DB on disk and the protocol spoken on the wire aren't changing (well, not instantly) then it's not too hard to migrate. The problem comes when you've got in a state where you need to do a Big Bang migration; those are usually a sign of someone really not knowing what they were doing (and usually that someone was not knowing things quite some time ago, so you're stuck).

botondus

"I need an 8 hour outage to migrate servers".

TRWTF

abarker

@botondus said:

TRWTF

Let's assume best cast scenario, and that @Weng's company has sufficient hardware to allow them to do basic setup of the new servers prior to the migration (OS and software install, maybe setup basic schemas). You're claiming that it should take less than 8 hours to:

Clone current data from old servers to new servers.
Switch from old servers to new servers.
Verify that everything is working properly.
And still have enough time to correct any issues or roll back to the old servers as appropriate.

And you're making this assertion without knowing how many servers @Weng is talking about, or how many people are going to be available to help with said migration.

Have I approximated your claim?

trwtfbot

@botondus is TRWTF