Okay, DBAs

sootzoo

I'm tasked with transitioning a non-trivial web app from ASP (I know...) / Access (I know!) to .NET / SQL Server. As it stands, every customer using the app currently has their own Access database with exactly the same schema, and the app connects to the appropriate client DB by looking an ID# up in a master DB accounts table with the accompanying file name to use in its client database connection string. Worked "well" with 50 or so customers. Now that we're eclipsing that several times a month, a redesign is probably in order.

In SQL Server, I'd love to have one schema (one copy of the dozen or so core tables in one database, currently repeated in every client database) and dump data from all the customers into that table set, adding a column for the client identification. This could, I assume, with intelligent indexing, have some serious performance benefits. Security pitfalls notwithstanding, I was wondering if anyone saw any obvious drawbacks to doing things this way. Each table could have potentially millions of rows at any given moment, and in production, we're talking several dozen to a few hundred inserts and deletes per second.

The app in question will run on one server mostly dedicated to ASP.NET apps we've written. The data files will move from a separate folder on the same machine to one database on a dedicated SQL Server box with the best specs out of any of our hosted machines. For this app, I'm limited to just those two boxes.

What would your suggestions be for the best way to optimize performance in this situation? I assume SQL can handle the type of throughput we want to achieve -- hell, the ASP/Access version is nearly achieving it now, but we're starting to get very sluggish response times (expected on the old architecture). I'm nervous about performance bottlenecks but I'm not a DBA pro and I want to know at least where I can expect to see problems / limitations.

If anyone is about to suggest that TRWTF is that we did not immediately call up Microsoft and ask for SQL Server consulting, be assured that we'll go that route or something similar before deploying the thing, but I'm looking to spot any obvious WTFs with this approach upfront if possible.

Thanking you all in advance,
sootzoo.

CodeWhisperer

It's hard to say if you'd get a performance benefit from this or not, because you'd be constantly working with subsets of the data that are non-contiguous. Further, you'd run the security risks, and the possibility you could destroy all of your customer's data at once rather than just one customer at a time :)

I'll defer to the more db oriented guys here, but you _can_ still just maintain separate DBs for each customer, and have the connection string set according to a similar scheme you have now. I guess you might lose some of the benefits of connection pooling, but that can probably be mitigated. But you also get the ability to move dbs to other servers -- say you have one or two really busy customers who are constantly churning on their data, you could move them to a separate server and isolate that churn from your 100+ other customers.

It also makes it easier to remove an old customer -- no worries about deleting records out of the middle of big DBs while others are acting on it.

I suppose a lot depends on your hardware, I can't say I've ever run SQL server with 100+ databases all going at once. Like I said, one of the guys who actually spends their days in DBs will probably have more comprehensive advice, I just shudder at the idea of merging all your customer's data together -- and I suspect they might shudder too :)

-cw

dhromed

@CodeWhisperer said:

I just shudder at the idea of merging all your customer's data together -- and I suspect they might shudder too :)

But as the app is "non-trivial", I'd suggest doing exactly that. A massive merger operation, and then after the dust has settled and the magma hardened, you have fertile ground on which to build an empire. If that system of "everybody his own DB" doesn't die soon, the app will probably implode and take the company with it.

@CodeWhisperer said:

I suppose a lot depends on your hardware, I can't say I've ever run SQL server with 100+ databases all going at once.

Our current setup is somewhere around 300+ DBs. I'm not DBA enough to say whether that's "moronic" or, "hey, can't hurt," but it runs smoothly, and any performance troubles we've had are the result of improper hosting (every so often, my SAs badger my boss to invest in proper, core-business hosting, but it's an issue with an inexplicably tremendous inertia), and have absolutely nothing to do with the SQL server. In fact, the SQL server has so far proven to be the most robust part of our hosting setup.

ammoQ

I suggest you simply do some benchmark tests of both versions (one schema for all customers / one database instance per customer) before you decide how to do it. If both perform equal, I'd prefer one database instance per customer, since it gives you more security and flexibility. But of course you have to consider topics like "How will I install upgrades on 300 similiar databases?" - even if it only takes 30 minutes per database, this means one month of work... This is definitely easier if you have just one instance, one schema.

KenW

Haven't any of you read the most recent WTF (http://thedailywtf.com/Comments/Roll_Your_Own_Clustered_Index.aspx)? It discusses the exact situation the OP here is asking about, and explains exactly why one DB per customer (or even one table per customer) is totally wrong.

Having all of the data in a single DB with a customer ID is the proper way to do this, and with appropriate indexes the performance will be much better than having 1000s of individual tables in a single DB.

ammoQ

@KenW said:

Haven't any of you read the most recent WTF (http://thedailywtf.com/Comments/Roll_Your_Own_Clustered_Index.aspx)? It discusses the exact situation the OP here is asking about, and explains exactly why one DB per customer (or even one table per customer) is totally wrong.
Having all of the data in a single DB with a customer ID is the proper way to do this, and with appropriate indexes the performance will be much better than having 1000s of individual tables in a single DB.

One table per customer is definitely stupid. Big pain, no gain.

Whiskey_Tango_Foxtro

@sootzoo said:

Security pitfalls notwithstanding, I was wondering if anyone saw any obvious drawbacks to doing things this way.

Not so far. That would probably be the best way to do things, although I'd suggest re-approaching the schema with that in mind, rather than just tacking a column onto each table.

@sootzoo said:

Each table could have potentially millions of rows at any given moment, and in production, we're talking several dozen to a few hundred inserts and deletes per second.

Ok, *now* I see a drawback. I'm obviously without a sense of context here, but this -- on the surface -- appears insane. Is your code doing this, or are you just talking about general users doing what users do? (i.e. create a new product/item/whatever, then thinkbetter of it and delete it). Take a good, *hard* look at your database, and what it's supposed to do, and ask yourself "do I ***REALLY*** need to constantly insert and delete from the database?" Seriously, why do you need to insert and delete all the time? I'm sure there has to be a better way. It's one thing if it's just user's doing what users do, it's an entirely different puppy if your code does this as part of "normal operations". If it's the latter, I again suggest and highly encourage approaching your problem from another direction.

@sootzoo said:

What would your suggestions be for the best way to optimize performance in this situation? I assume SQL can handle the type of throughput we want to achieve -- hell, the ASP/Access version is nearly achieving it now, but we're starting to get very sluggish response times (expected on the old architecture). I'm nervous about performance bottlenecks but I'm not a DBA pro and I want to know at least where I can expect to see problems / limitations.

SQL Server is a pretty powerful RDBMS. It should be able to handle your throughput with ease -- provided you don't do something inappropriate (see my previous paragraph). The problem with constantly inserting and deleting is that SQL Server will end up spending more time re-indexing than it will serving up queries. What you really want to do is optimize your database for queries and limit updates to "fringe use". Yes, you'll still have inserts and deletes and that's fine and won't bog down the server, but if your code inserts into temporary tables then deletes them -- perhaps as a means of implementing transactionality -- you need to re-think things.

If you wish to optimize performance, just "do things right". Approach your database re-design from a relational standpoint. Normalize it -- get it into at least third normal form, Boyce-Codd if possible. Utilize it appropriately in your code -- optimize your UI for queries, cache locally, avoid temporary tables, use transactions for updates. The normal Good Thing(tm)s.

CodeWhisperer

@ammoQ said:

One table per customer is definitely stupid. Big pain, no gain.

I was actually referring to a separate database, not a separate table. This would be the closest analog to what the OP is already doing, but it's big and I don't know what SQL server is like if you try doing that.

Merging data has it's benefits, but it's drawbacks too.

-cw

sootzoo

@Whiskey Tango Foxtrot? Over. said:

@sootzoo said:
Each table could have potentially millions of rows at any given moment, and in production, we're talking several dozen to a few hundred inserts and deletes per second.

Ok, *now* I see a drawback. I'm obviously without a sense of context here, but this -- on the surface -- appears insane. Is your code doing this, or are you just talking about general users doing what users do? (i.e. create a new product/item/whatever, then thinkbetter of it and delete it). Take a good, *hard* look at your database, and what it's supposed to do, and ask yourself "do I ***REALLY*** need to constantly insert and delete from the database?" Seriously, why do you need to insert and delete all the time? I'm sure there has to be a better way. It's one thing if it's just user's doing what users do, it's an entirely different puppy if your code does this as part of "normal operations". If it's the latter, I again suggest and highly encourage approaching your problem from another direction.

It's an e-commerce app. Fictional scenerio: the app is designed to capture PayPal IPN responses (transfer from customer account successful), provide them to an offline aggregator / reporting service via SOAP webservice calls, which then disposes of the original payment information -- security, you know. Pretend that we're breaking down things like the customer e-mail into one table, the transaction ID into another table, etc. Now have the app service about 5,000 websites where that's the only payment option. There are going to be some lulls and spikes in traffic, but in general, you've got almost constant traffic from PayPal (inserts) and frequent traffic from each client's aggregator (select, delete). No temp tables involved, but the residence of that data is somewhere between minutes and hours -- a few days, max -- for any given customer.

@Whiskey Tango Foxtrot? Over. said:

The problem with constantly inserting and deleting is that SQL Server will end up spending more time re-indexing than it will serving up queries.

I was afraid of this. Limiting to primary keys - let's say, a customer ID and a transaction ID - generally would work for all the major tables we use. A two-column clustered index over 50 million rows is probably safe, yeah?

@Whiskey Tango Foxtrot? Over. said:

If you wish to optimize performance, just "do things right". Approach your database re-design from a relational standpoint. Normalize it -- get it into at least third normal form, Boyce-Codd if possible. Utilize it appropriately in your code -- optimize your UI for queries, cache locally, avoid temporary tables, use transactions for updates. The normal Good Thing(tm)s.

Very helpful stuff - thanks!

ammoQ

@CodeWhisperer said:

@ammoQ said:
One table per customer is definitely stupid. Big pain, no gain.

I was actually referring to a separate database, not a separate table. This would be the closest analog to what the OP is already doing, but it's big and I don't know what SQL server is like if you try doing that.

Merging data has it's benefits, but it's drawbacks too.

-cw

One seperate database per customer could be acceptable, depending on a lot of things we both don't know. ;-)

Whiskey_Tango_Foxtro

@sootzoo said:

It's an e-commerce app. Fictional scenerio: the app is designed to capture PayPal IPN responses (transfer from customer account successful), provide them to an offline aggregator / reporting service via SOAP webservice calls, which then disposes of the original payment information -- security, you know. Pretend that we're breaking down things like the customer e-mail into one table, the transaction ID into another table, etc. Now have the app service about 5,000 websites where that's the only payment option. There are going to be some lulls and spikes in traffic, but in general, you've got almost constant traffic from PayPal (inserts) and frequent traffic from each client's aggregator (select, delete). No temp tables involved, but the residence of that data is somewhere between minutes and hours -- a few days, max -- for any given customer.

Hmm... Ok, I have a high-level picture here. Regarding security... IANAL, and I don't know enough about Sarbanes Oxley to be sure here, but I think you're supposed to *retain* transactions for accounting purposes, not delete them. I believe you are supposed to separate the transaction from the personal details of the transactor, which you've already done if you normalized the database correctly.

I'm not saying you *don't* have to delete the records... just questioning. If you still think a full delete is called for, I suggest you consider a flag and delete later scheme. An update of a record, as long as you don't modify an index, is relatively harmless. You could put a Deleted column into your transaction table and update *that* when you "delete" the record. Then, at the end of the day, week, whatever, run a single query that deletes every record that was flagged as "deleted". The index is only updated once, and you have a very performant database with an occasional slowdown at 2AM.

lpope187

So if I understand correctly, the process is somewhat like the following

PayPal request --> Queue Table --> Payment Processed --> Transaction History Log (with sensitive information removed)

So the queue table should have relatively few records since they get cleaned up after successful processing. The problem is the amount of data in the Transaction History Log - it is the table with the millions of rows. So the problem is aggregating the details of the Transaction History for each customer by some time frame. I smell an aggregation service that runs nightly/hourly that does the precalcs for you. IMO, OLTPs should make data modifications fast - If you need fast reporting, build a data mart / warehouse or belly up to the bar and cluster the thing.

sootzoo

A delete flag / scheduled delete would be easy to implement - sounds perfect here. Thanks again!

sootzoo

@lpope187 said:

So if I understand correctly, the process is somewhat like the following
PayPal request --> Queue Table --> Payment Processed --> Transaction History Log (with sensitive information removed)
So the queue table should have relatively few records since they get cleaned up after successful processing. The problem is the amount of data in the Transaction History Log - it is the table with the millions of rows. So the problem is aggregating the details of the Transaction History for each customer by some time frame. I smell an aggregation service that runs nightly/hourly that does the precalcs for you. IMO, OLTPs should make data modifications fast - If you need fast reporting, build a data mart / warehouse or belly up to the bar and cluster the thing.

Sort of. I may have obfuscated the example a bit too much to be helpful.

The reporting is of less actual concern here. It's better to think of the thing as a fairly dumb XML parser and data dump. "PayPal" would cause data to be inserted, each client would need to quickly output any "transactions" relevant to them. The current implementation cleans up after itself, deleting the successfully transmitted data remotely. The client pulls are not under our control - we want the data dumped by PayPal to be available to them as quickly as possible, so nightly aggregations / precalcs aren't really what I'm after here. The major concern I had was if there were any penalties for a large number of inserts, selects and deletes on tables that large. I'm pretty confident that we can make inserts and selects relatively painless. WTFO raised a concern about lots of simultaneous deletes, but I think that's easily solved with a delete flag and nightly maintenance plan to purge the flagged records.

Clustering, sadly, is not an option here. Not enough money in the budget, such as it is.

I guess my last concern here is transactions - any .NET / SqlClient gurus in the house? - if I do transacted inserts, does that not lock the tables in question until the transaction is committed? Is this a per-connection thing (in which case, no big deal) or table-level (which would be disastrous)?

lpope187

Okay that makes more sense. Just a queue table with variable client enabled polling.

In terms of transactions and locking, that is a complex thing - one could write a whole book on how Sql Server implements it. I'll try to give you a good overview without going too far into the details. First, all modifications require obtaining a lock on the required resources. The optimizer will determine the best level for the situation (row, page, extent, table, etc). Under normal circumstances, a single insert will just require a few row locks but there are times that the engine will escalate to a table lock.

By default, SQL2005 escalates locks when the number of locks on a single resource exceed 5,000 - A resource is a table or index, different indexes equal different resources The other time lock escalation occurs is when the number of locks on the instance exceeds the lock limit. Again by default, this limit is determined by dynamic memory allocation - the database engine will automatically allocate more memory for locks when it reaches 40% utilization. The engine will continue to allocate more lock memory until it uses 60% of the total available to the instance, at which point the allocation fails and lock escalation occurs.

To mitigate lock escalation, you should make your transactions as quick as possible, few statements and few indexes on the tables your editing. You can also use an isolation level that doesn't allocate shared locks for reads (snapshot and read committed with READ_COMMITTED_SNAPSHOT). Snapshot works by versioning rows - a transaction reads only the data that was committed when it started but requires more resources to manage all the row versions. Snapshot is good when you absolutely need consistent data, have long running aggregates, and/or have deadlock issues.

My gut tells me to whip up a quick test script to test the concept on a heavy load and see if you have any issues. Better to spend a few days on the concept now then to find out you have issues when your in production.