Storage in cost per giga-NULL

maldrich

Every DBA has a story about some vendor's ... er ... interesting system. I think mine is getting to the point where it could fill a book, but here's just a snippet:

DB Server has performance problems. Low 'page life expectancy' is in evidence -- for the non SQL-Server folks, that means the needed data won't fit in the RAM cache, which means disk I/O, or what I like to call I/SLOW.

So we start to look at the tables. The largest table contains 96 columns and 16 million rows. No big thing. Ah, but look closer: of those columns, about half are totally empty of values. Within the remaining half, only 18 contain data in more than half the rows. Other tables look similar. Hmm.

So we say, "Shucks, golly, Mr. Vendor, the database might run faster if it's RAM were not 80% full of NULLs at any given time. And the DB would be a few tens of gigabytes smaller. And we might not fill the whole storage array in the next twelve months."

Next came a few rounds of, "no it's not." "yes it is." "no it's not." "yes it is." ("it's only a flesh wound, come back here! I'll bite your leg off!")

Then the their solution: "Upgrade to the next version." (Surprise).

The "Next Version" added 30 more empty columns to said transaction table, and added two more tables filled essentially with copies of that data to speed up reports (which reports could not be run on the actual data because the design of the tables is so poor, and the query plans are a nightmare rat maze of table-scanning horror.) The DB jumped in size from 40 to 70 gigs the weekend of the upgrade and now appears as if it will grow at ten percent per month, rather than the previous 2-3 percent, which was already a farce given what this application does.

And the cool part: there are "no plans," says the vendor, "to change the schema of the existing tables." Awesome.

Right now (I cringe each time I think of it) the server is madly swapping approximately 40 gigabytes of zeros and nulls in and out of a 1.6 GB cache, and we're getting ready to budget for the purchase of more drives to store more gigs of nulls and zeros next year. This rocks.

The vendor does say we can delete transactions, though, as a workaround. Awesome.

merreborn

Anything preventing you from doing a DELETE FROM ungodly_table WHERE col_1 IS NULL AND col_2 IS NULL AND .... col_n IS NULL?

maldrich

Alas, each row does represent a transaction. It's just got about 10 columns with values and 120 that are empty. We really need "alter table drop column worthless_junk" about 120 times, but it'll break the application.

lpope187

A transaction table having 120 columns? Sounds like someone doesn't understand the concept of normalization.

I can feel your pain though, I've worked in systems where the Invoice table references the Order table through a varchar column. Since the order table's primary key is numeric, you had to cast it to a varchar for the join. Those reports took forever to run.

maldrich

That's pretty sweet too. The closest thing I can find over here is a bunch of numeric(18) columns that store boolean values. I guess that's in case they need a numeric range of 18 digits for future possible values of true and false. Hey, at least we'll be ready when that happens. The storage will already be allocated, no nead to go out and buy 24 more disks...

lpope187

I can see that type of data model if it stores bitmasks, but if its just storing 0 and 1 then someone needs to be shot.

I just remembered some of my other "favorite" things about the app I mentioned. First it allowed 0 for an item quantity so I could create an order with a 1000 line items each with a quantity of 0. Who in their right mind orders 0 of something? Second, all of the free form notes attached to orders, invoices, etc were stored in tables like the following

UniqueID             Number                  (InvoiceNo, OrderNo, etc)
PageNo               Number                  (1 - 999)
LineNo                Number                  (1 - 50)
Text                     Varchar(100)          (The actual text for that line)

That really made my day when I was asked to put special order notes on reports, especially when you could have notes on page 200 but nothing on pages 1-199 and the same for line nos. I don't even want to think about the code required if you insert text into an existing line, it overflows and cascades a few lines.

I have plenty of other WTFs from that piece of crap, but I've done my best to purge them from active memory.

maldrich

Nice.

Alas, no bitmasks. The boolean columns have these rather quaint properties:

1. Valid values are 0, -1, and null - where 0 and null mean false and -1 means true.
2. There are no constraints, though. Too slow, apparently.
3. They are typically named using double-negatives, e.g. isNull( bInactive, 0 ) means that inactive is false (the thing represented is "active") while bInactive = -1 means true ( the thing is inactive).

I LOVE that. Makes for very clear and simple code.

script_man

If null is the absence of something, and zeo represents nothing, equivalent to nothing, then storing nothing billions of times should amount to storing nothing, which should take no time and no space! So why the problem? </smirk>

Zygo

Some databases actually do have cheap NULL storage. They use a bitmap in the row header that says which columns are NULL. Therefore each column costs 1 bit + the size of the column's data type unless the column is NULL in which case the cost is only one bit. I have no idea whether SQL Server is implemented this way.

Depending on schema, it can be a huge win if the database storage is structured this way and you have columns with a single common value that can be represented as NULL.

On the other hand, it makes some code somewhere painfully complex--either the application needs to know that "NULL is zero" or "NULL is true" or "NULL is 3 (3 being the most common value of whatever column)", or you need to use views everywhere to translate the NULL value into something sane from the application's point of view. It's not something to bother with unless you're talking about saving double-digit percentages of the total storage size, and you own the application and employ all of its users personally.

I kind of suspect that it still doesn't help in your vendor's application's case though.

Grauenwolf

@Zygo said:

Some databases actually do have cheap NULL storage. They use a bitmap in the row header that says which columns are NULL. Therefore each column costs 1 bit + the size of the column's data type *unless* the column is NULL in which case the cost is only one bit. I have no idea whether SQL Server is implemented this way.

SQL Server does use a null bitmask.

kexugit

SQL Database Engine Blog

maldrich

@Grauenwolf said:

@Zygo said:
Some databases actually do have cheap NULL storage. They use a bitmap in the row header that says which columns are NULL. Therefore each column costs 1 bit + the size of the column's data type unless the column is NULL in which case the cost is only one bit. I have no idea whether SQL Server is implemented this way.

SQL Server does use a null bitmask.

kexugit

SQL Database Engine Blog

That is a great blog. I'd been reading it -- when I have the brain power (usually a little at a time, mornings :-)

While there is a null mask, I don't think that SQL Server does what Zyco describes. I believe a record consisting of fixed width columns (char, datetime, int, whatever) will have space allocated even in cases where the item is NULL. Each column will also be represented in the NULL bitmask, which indicates if the item is null or not null.

One test I've done for this is to make two test tables with a large number of fixed-width columns and check them with sp_spaceused. I could be wrong, but I'm pretty sure the nulls take the same amount of space, for a fixed width datatype, as the columns with values. The size reported for each table is the same.

Also, the blog entry has this to say (my emphasis):

"

Record structure

...

The record structure isn't relevant to a discussion on fragmentation but is for CHECKDB internals, so here it is:

record header

4 bytes long
two bytes of record metadata (record type)
two bytes pointing forward in the record to the NULL bitmap

fixed length portion of the record, containing the columns storing data types that have fixed lengths (e.g. bigint, char(10), datetime)
NULL bitmap

two bytes for count of columns in the record
variable number of bytes to store one bit per column in the record, regardless of whether the column is nullable or not (this is different and simpler than SQL Server 2000 which had one bit per nullable column only)

..."

Grauenwolf

I see what you mean. I only use varible length strings, so I wasn't counting that as a big cost. I can really see fixed-length strings can be a problem. And with that many rows even a bunch of null ints will be costly.

reed

Hey, you just need to change the storage to a filesystem with automatic compression. I bet if it was the right algorithm or configured right, the compression/decompression time would be worth trading for the massive amount of swap space used.