Can we shut down production for a week?

snoofle

Our system runs 7x24. It is our sole source of income. Getting downtime approved is a huge deal.

We are finally upgrading our Oracle RAC setup from 10g to 11g. The idiots DBAs use Sungard. Note: that's not a shot at Sungard or DBAs in general, just the DBAs that work here.

They just called and asked me if I can take down our production environment for a full week so they can set up Sungard.

Erm, other businesses use Sungard and have managed to migrate without shutting down their primary customer facing revenue generating system for any length of time; perhaps you should investigate further.

But we need a week to do it!

My boss is out of town for the holiday weekend, but approval has to come from several levels above him; please send an email to the whole hierarchy requesting that they close the business for a week so you can do your work.

He does.

The response comes fast and furious, and for your convenience, I will summarize: WTF?!

In my boss' absence, I respond that other businesses have done this exact migration without major interruptions; perhaps we (e.g.: the DBAs) should investigate as to how they pulled off this feat before shutting down our bread and butter operation.

The emails that shot back and forth after that went something like this:

Management to Head DBA: investigate how other companies did it

Head DBA: we need a week

M: Investigate

D: we need a week

M: INVESTIGATE NOW (implied: you pin-headed twit)

D: (meekly) ok.

Xyro

@snoofle said:

D:

Yes, exactly.

C_Octothorpe

This could be a good thing! Think about it: this could expose them for the useless sacks of flesh they really are!

Oh wait, this is reality, isn't it? Hmm, I predict that particular DBA will get a raise and a bonus.

Anketam

I am scared to think what voodoo magic they would be doing to the production database over that week. I bet that by day 4 they would have somehow managed to corrupt all the production data.

Our program which is not as large is also going through the fun of creating a migration plan from Oracle 10 to 11. They will only have 1 night to perform the actual switchover, and when I say one night, it is more like a 4 hour slot, and leadership would start to get worried and pestering after the 2 hour mark, so yea they really only have 2 hours to do it and 2 hours of backup time.

PJH

@snoofle said:

Our system runs 7x24...

Minor picking of nits - you appear to have very short days and very long weeks....

ekolis

No, the weeks are normal length; it's just the days that are short!

PJH

Seems too late for me to delete/edit that post without it being noticed, since I got it so wrong myself; at least the meaning wasn't totally lost...

@PJH said:

@snoofle said:
Our system runs 7x24...
Minor picking of nits - you appear to have very short days and very long weeks....

He could have just said 168.

ekolis

@El_Heffe said:

He could have just said 168.

So it doesn't even run for half of the year?! :P

DaveK1

@snoofle said:

Our system runs 7x24. It is our sole source of income.

You have been brainwashed into the mindset. It's not your system. It's not your sole source of income. You have been suborned.

DaveK1

@C-Octothorpe said:

Think about it: this could expose them for the useless sacks of flesh they really are!

Wait, I have loads of sacks of flesh . Are you really saying they're useless?

da_Doctah

@PJH said:

@snoofle said:
Our system runs 7x24...
Minor picking of nits - you appear to have very short days and very long weeks....

It's summer, when the nights are very short.

On the other hand, they're also very wide.

boomzilla

@snoofle said:

We are finally upgrading our Oracle RAC setup from 10g to 11g. The idiots DBAs use Sungard. Note: that's not a shot at Sungard or DBAs in general, just the DBAs that work here.

They just called and asked me if I can take down our production environment for a full week so they can set up Sungard.

Hmm...previous stories suggest that you guys have a variant of a Developmestuction environment. The 10g to 11g upgrade may be a much bigger problem than waiting for Sungard, etc. I know that we had several issues with the upgrade (and then more when we went from 11.0.1 to 11.0.2...or whatever). Have you guys done testing in an 11g environment?

@da Doctah said:

@PJH said:
@snoofle said:
Our system runs 7x24...
Minor picking of nits - you appear to have very short days and very long weeks....
It's summer, when the nights are very short.
On the other hand, they're also very wide.

Some people are afraid of heights. I'm afraid of widths.

boomzilla

@PJH said:

@snoofle said:
Our system runs 7x24...

Minor picking of nits - you appear to have very short days and very long weeks....

Probably just trying to sound more European or something by writing date type stuff backwards. At least he didn't say 365x7x24.

snoofle

@boomzilla said:

Hmm...previous stories suggest that you guys have a variant of a Developmestuction environment.

True. But in this case, it's more about replicating our 40TB db under a new Sungard setup (I have no clue how it works so I need to rely on the DBAs, but I have to believe there's a way to do it in a reasonable window; a week just seems way too long).

@boomzilla said:

Have you guys done testing in an 11g environment?

Extensively - done by me personally. We found a few (Java) driver issues, but have a working solution. I made a friend on the QA team, and whenever I want to force thorough testing, I just walk over and and let him know what's what. He does the rest.

dtech

@boomzilla said:

@PJH said:
@snoofle said:
Our system runs 7x24...

Minor picking of nits - you appear to have very short days and very long weeks....

Probably just trying to sound more European or something by writing date type stuff backwards. At least he didn't say 365x7x24.

Yes, it would have been silly for him to suggest that the systems only works for a total of 7 years

Ben L.

@dtech said:

@boomzilla said:
@PJH said:
@snoofle said:
Our system runs 7x24...

Minor picking of nits - you appear to have very short days and very long weeks....

Probably just trying to sound more European or something by writing date type stuff backwards. At least he didn't say 365x7x24.

Yes, it would have been silly for him to suggest that the systems only works for a total of 7 years

I'd give an estimate of closer to one and a half.

Mcoder

@DaveK said:

@snoofle said:
Our system runs 7x24. It is our sole source of income.
You have been brainwashed into the mindset. It's not your system. It's not your sole source of income. You have been suborned.

Why do you say so? It is the system that earns money to pay for his work. As Snoofle doesn't seem to have any other client at this time, it is probably his sole source of income.

DCL

More nitpicking. İt's my understanding that there are 52 weeks in a year not 365. So that should read 24x7x52 (yeah İ know some years have more than 52 weeks). To be truly pedantic 24x365.25 would be the most accurate.

error

@DCL said:

To be truly pedantic 24x365.25 would be the most accurate.

But this assumes a leap year happens every fourth year, not every fourth year except every hundredth, except every four hundredth.

I pasted the following in Firebug to calculate a more accurate number of days per year:

function isLeap( year ) {
  if( ( year % 400 ) === 0 ) return true;
  else if( ( year % 100 ) === 0 ) return false;
  else if( ( year % 4 ) === 0 ) return true;
  else return false;
}
var count = 0;
function calc( year ) {

if( isLeap( year ) ) count++;

if( ( year % 501 ) === 0 ) console.log( year + ': ' + ( 365 + (count / year) ) );

setTimeout( function() { calc( year + 1 ); }, 0 );

}
calc( 1 );

It seems to fall somewhere between 365.2424 and 365.2426 days per year.

tgape

@snoofle said:

@boomzilla said:
Hmm...previous stories suggest that you guys have a variant of a Developmestuction environment.
True. But in this case, it's more about replicating our 40TB db under a new Sungard setup (I have no clue how it works so I need to rely on the DBAs, but I have to believe there's a way to do it in a reasonable window; a week just seems way too long).

<twitch>Install the software on the new cluster. Load the data from a backup of production. Load the data changes since the backup. Start the downtime Switch the production cluster out for the new cluster. Load the data changes from the first re-up to the start of downtime. Make sure everything works that you can without ending downtime End the downtime. Make sure everything works. Turn off the old hardware.</twitch>

At least, this is what I've seen done on MySQL, MSSQL, PostgreSQL, Sun LDAP, OpenLDAP, Netscape LDAP, simple Oracle (no Sungard, no RAC), and at least a couple of environments I've completely forgotten about. I've also seen places use a little longer downtime window, and only do a single data re-up pass.

Now, if you are not running a production-worthy environment, there is one or two additional steps you need to insert into the process at some place. In no particular order:

Turn off change logging.
Turn on change logging.

Without one of those steps added at the appropriate place, two of the steps in my process above (or possibly one, if you back up often, or like longer downtimes) are very tricky.

I don't know Sungard, and I don't really know RAC, but I can't imagine a software change that does not break everything horribly that would break that basic process. This could be a limitation of my imagination. Now, I have seen a number of times where the DBAs insisted that the above process would not work for their database due to some addon or another they had, but every time, it actually came down to them not having a change log running, so any major database corruption would result in a loss of all data since the last backup. And, in every one of those cases, they were at best doing one backup a month. And, finally, in every one of those cases, there was a VP that someone involved was able to find and inform, who was properly livid at the thought of possibly losing a month of production data because the DBAs wanted to shave 1% (or less) off of their system resource usage.

Disclaimer: The vast majority of the migrations I mentioned above were entirely performed by other people. I merely spectated, sometimes even vicariously.

Speakerphone_Dude

@tgape said:

@snoofle said:
@boomzilla said:
Hmm...previous stories suggest that you guys have a variant of a Developmestuction environment.
True. But in this case, it's more about replicating our 40TB db under a new Sungard setup (I have no clue how it works so I need to rely on the DBAs, but I have to believe there's a way to do it in a reasonable window; a week just seems way too long).

<twitch>Install the software on the new cluster. Load the data from a backup of production. Load the data changes since the backup. Start the downtime Switch the production cluster out for the new cluster. Load the data changes from the first re-up to the start of downtime. Make sure everything works that you can without ending downtime End the downtime. Make sure everything works. Turn off the old hardware.</twitch>

At least, this is what I've seen done on MySQL, MSSQL, PostgreSQL, Sun LDAP, OpenLDAP, Netscape LDAP, simple Oracle (no Sungard, no RAC), and at least a couple of environments I've completely forgotten about. I've also seen places use a little longer downtime window, and only do a single data re-up pass.

Now, if you are not running a production-worthy environment, there is one or two additional steps you need to insert into the process at some place. In no particular order:

Turn off change logging.

Turn on change logging.

Without one of those steps added at the appropriate place, two of the steps in my process above (or possibly one, if you back up often, or like longer downtimes) are very tricky.

I don't know Sungard, and I don't really know RAC, but I can't imagine a software change that does not break everything horribly that would break that basic process. This could be a limitation of my imagination. Now, I have seen a number of times where the DBAs insisted that the above process would not work for their database due to some addon or another they had, but every time, it actually came down to them not having a change log running, so any major database corruption would result in a loss of all data since the last backup. And, in every one of those cases, they were at best doing one backup a month. And, finally, in every one of those cases, there was a VP that someone involved was able to find and inform, who was properly livid at the thought of possibly losing a month of production data because the DBAs wanted to shave 1% (or less) off of their system resource usage.

Disclaimer: The vast majority of the migrations I mentioned above were entirely performed by other people. I merely spectated, sometimes even vicariously.

Moving 40TB is a different beast entirely. Using the backup/restore solution is time consuming; a decent RAC will usually peak at 1TB per hour. The best way to move that volume of data is to use storage level replication; if the source and destination are the same SAN, a simple snap copy will do and can be performed in a jiffy, with the front-end ready immediately and the heavy lifting being done in the backend with little performance impact (if it's an enterprise-grade unit). If the destination is a different SAN, than unless there is some kind of storage virtualization availaible (like Hitachi USPV or IBM SVC) then replication will take a while and will also require a bit of downtime as to quiesce the source and avoid data corruption.

So this is no simple task but if they have the proper hardware and software, and the source and destination are in the same data center, then it could be done in less than a week. They probably keep a big buffer to rollback.

Watson1

@joe.edwards said:

But this assumes a leap year happens every fourth year, not every fourth year except every hundredth, except every four hundredth.
...
It seems to fall somewhere between 365.2424 and 365.2426 days per year.

To be precise, (365*400 + 100 - 3) / 400 = 365.2425 days.

tgape

@tgape said:

<twitch>Install the software on the new cluster. Load the data from a backup of production. Load the data changes since the backup. Start the downtime now<explain type="redundant for speakerphone dude">, after you've already loaded all of the data, and loaded the changes that happened while the load was happening.</explain> Switch the production cluster out for the new cluster. Load the data changes from the first re-up to the start of downtime. Make sure everything works that you can without ending downtime End the downtime<explain type="redundant for speakerphone dude">, after only needing to load a few hours worth of changes, switch server identities, and test some stuff</explain>. Make sure everything works. Turn off the old hardware.</twitch>

@Speakerphone Dude said:

Moving 40TB is a different beast entirely. Using the backup/restore solution is time consuming;

Sure. But that's not downtime time, if you're doing it right. That recovery from backup is happening on the new cluster, while the old cluster is still chugging merrily along, blithely unaware that it's about to get the ax. That's the entire point of the text you were replying to.

I assumed they weren't sharing a SAN between the old and new environments, because if they were, it's completely trivial.

Cassidy

@Speakerphone Dude said:

Moving 40TB is a different beast entirely. Using the backup/restore solution is time consuming

I wasn't aware it was a move, either.

I saw it as being a copy (which may take some time) then apply delta changes to the new cluster - the only downtime will be to freeze production whilst you roll forwards those incremental changes, and even then Oracle has various ways and means to reduce the time taken by replicating the changes to other instances.

witchdoctor

@tgape said:

@tgape said:
<twitch>Install the software on the new cluster. Load the data from a backup of production. Load the data changes since the backup. Start the downtime now<explain type="redundant for speakerphone dude">, after you've already loaded all of the data, and loaded the changes that happened while the load was happening.</explain> Switch the production cluster out for the new cluster. Load the data changes from the first re-up to the start of downtime. Make sure everything works that you can without ending downtime End the downtime<explain type="redundant for speakerphone dude">, after only needing to load a few hours worth of changes, switch server identities, and test some stuff</explain>. Make sure everything works. Turn off the old hardware.</twitch>

@Speakerphone Dude said:
Moving 40TB is a different beast entirely. Using the backup/restore solution is time consuming;

Sure. But that's not downtime time, if you're doing it right. That recovery from backup is happening on the new cluster, while the old cluster is still chugging merrily along, blithely unaware that it's about to get the ax. That's the entire point of the text you were replying to.

I assumed they weren't sharing a SAN between the old and new environments, because if they were, it's completely trivial.

Horrible thought: Maybe there is only one production cluster. And the downtime comes from running conversion and test scripts. It is snoofle's source of WTF after all.

Lorne Kates

@tgape said:

@snoofle said:
@boomzilla said:
Hmm...previous stories suggest that you guys have a variant of a Developmestuction environment.
True. But in this case, it's more about replicating our 40TB db under a new Sungard setup (I have no clue how it works so I need to rely on the DBAs, but I have to believe there's a way to do it in a reasonable window; a week just seems way too long).

<twitch>Install the software on the new cluster. Load the data from a backup of production. Load the data changes since the backup. Start the downtime Switch the production cluster out for the new cluster. Load the data changes from the first re-up to the start of downtime. Make sure everything works that you can without ending downtime End the downtime. Make sure everything works. Turn off the old hardware.</twitch>

At least, this is what I've seen done on MySQL, MSSQL, PostgreSQL, Sun LDAP, OpenLDAP, Netscape LDAP, simple Oracle (no Sungard, no RAC), and at least a couple of environments I've completely forgotten about. I've also seen places use a little longer downtime window, and only do a single data re-up pass.

Now, if you are not running a production-worthy environment, there is one or two additional steps you need to insert into the process at some place. In no particular order:

Turn off change logging.

Turn on change logging.

Without one of those steps added at the appropriate place, two of the steps in my process above (or possibly one, if you back up often, or like longer downtimes) are very tricky.

I don't know Sungard, and I don't really know RAC, but I can't imagine a software change that does not break everything horribly that would break that basic process. This could be a limitation of my imagination. Now, I have seen a number of times where the DBAs insisted that the above process would not work for their database due to some addon or another they had, but every time, it actually came down to them not having a change log running, so any major database corruption would result in a loss of all data since the last backup. And, in every one of those cases, they were at best doing one backup a month. And, finally, in every one of those cases, there was a VP that someone involved was able to find and inform, who was properly livid at the thought of possibly losing a month of production data because the DBAs wanted to shave 1% (or less) off of their system resource usage.

Disclaimer: The vast majority of the migrations I mentioned above were entirely performed by other people. I merely spectated, sometimes even vicariously.

That sounds too complex. They should just take the production database offline for a week. It's simpler, cleaner.

Zylon

@boomzilla said:

Hmm...previous stories suggest that you guys have a variant of a Developmestuction environment.

-1000 for even attempting to use that awful abortion of a neologism.

boomzilla

@Zylon said:

@boomzilla said:
Hmm...previous stories suggest that you guys have a variant of a Developmestuction environment.

-1000 for even attempting to use that awful abortion of a neologism.

~~Why are you such a tedious person?~~ Oh, wait, we're scoring like golf right? Thanks!

Zylon

@boomzilla said:

Oh, wait, we're scoring like golf right?

No, we're not.

boomzilla

@Zylon said:

@boomzilla said:
Oh, wait, we're scoring like golf right?

No, we're not.

Oh, your mistake then...

@Zylon said:

@boomzilla said:
Hmm...previous stories suggest that you guys have a variant of a Developmestuction environment.

-+1000 for even attempting to use that awful abortion of a neologism.

FTFY

Xyro

@boomzilla said:

<strike>-</strike>

... Did you just try to strike through a hyphen?

boomzilla

@Xyro said:

@boomzilla said:
<strike>-</strike>

... Did you just try to strike through a hyphen?

Thanks for noticing.

error

@Xyro said:

@boomzilla said:
<strike>-</strike>

... Did you just try to strike through a hyphen?

Yeah, he should be using the semantic <del> and <ins> tags! (Which aren't styled here.)

dhromed

@boomzilla said:

@Xyro said:
@boomzilla said:
<strike>-</strike>
... Did you just try to strike through a hyphen?

Thanks for noticing.

I noticed too. :(