Database Design

snoofle

We have a database with about 100 tables comprising multiple TB of data. Most of the tables are exceedingly normalized. So you have table A with some values and an Id. Ditto for table B. Then a mapping table ABxref with just the Ids of stuff that references both tables. This is done in 99% of the cases.

Then I find one table with a column: listOfCustomers, which contains a comma separated list of hundreds of customer Ids; sometimes with spaces after the commas, sometimes before, sometimes both; sometimes with multiple commas in a row (sometimes with intervening spaces but no Ids). And the code to access it parses out the string and iterates over it - both in stored procedures and in Java code.

Sigh.

gobes

Think about the advantage: you have only one row to read to find every entry. What a time saver.

Renan

Hey, CSV is a long standing industry standard.

Lorne Kates

"We asked CommaTech for a database export of their customer list, and the idiots sent us a CSV. I'm going to have to write an import parser to..."

"Whoa there, Mr. Ivory Tower with his 'importing'. They sent us data, and that's good enoguh. Just put it in the database. A CSV is an industry standard way of storing data, and I once read it was best practice."

"But I'm still going to have to write a parser for the select..."

"No, don't write a parser. That's not a good time to reward ratio. Just use a loop. That's why the commas are there. This is how it's been done for years."

"[Insert the frustrated sound of one of those fudged-up Unicode tags people use in their signature here]"

C_Octothorpe

@Lorne Kates said:

"We asked CommaTech for a database export of their customer list, and the idiots sent us a CSV. I'm going to have to write an import parser to..."

"Whoa there, Mr. Ivory Tower with his 'importing'. They sent us data, and that's good enoguh. Just put it in the database. A CSV is an industry standard way of storing data, and I once read a whitepaper saying it was best practice."

"But I'm still going to have to write a parser for the select..."

"No, don't write a parser. That's not a good time to reward ratio. Just use a loop. That's why the commas are there. This is how it's been done for years."

"[Insert the frustrated sound of one of those fudged-up Unicode tags people use in their signature here]"

FTFY

Justice

@Lorne Kates said:

"[Insert the frustrated sound of one of those fudged-up Unicode tags people use in their signature here]"

ZALGO!

C_Octothorpe

Granted that this is a pretty bad facepalm, but the database I'm currently working on has, wait for it, 212 tables... Want to know how many are actually used by the application? 28... ;-(

-- It's also worth noting that there are 300+ stored procs, and it would be safe to say that less than 50 are actually being used.

So I was trying to dig through some data access routines (inline SQL + string contatentation, hurray!), and most of them point to views... Which select from views, which select from views... ;-(

So I opened the first view, and holy shit it was like looking into the matrix... Each view contains no less than 6 case blocks (one had 20+ before I closed the designer because I had to run to the bathroom and throw up) along with hard coded values that *may* have applied back when it was written back in '02 and is a wall of badly indented SQL.

The upshot of this is that I'm not maintaining this, although I do need to understand (read: document) it as I'm going to be rewriting it.

Kinda funny how the technical specs/requirements are being written 10 years after the app was written by somebody who had nothing to do with the original...

/rant

Justice

@C-Octothorpe said:

Granted that this is a pretty bad facepalm, but the database I'm currently working on has, wait for it, 212 tables... Want to know how many are actually used by the application? 28... ;-(

-- It's also worth noting that there are 300+ stored procs, and it would be safe to say that less than 50 are actually being used.

Can any of those just be outright dropped, or is this one of those cases where a table isn't actually used but still gets referenced all over the place?

Kinda funny how the technical specs/requirements are being written 10 years after the app was written by somebody who had nothing to do with the original...

/rant

Hm...better late than never?

C_Octothorpe

@Justice said:

Can any of those just be outright dropped, or is this one of those cases where a table isn't actually used but still gets referenced all over the place?

The database, from the looks of it, was the dumping ground for several old DB schemas simply based on the different name conventions used.

So I have tbl_SomeTable, SM_OLD_STNDRD, tablename_tbl, Some_Random_Name_Table, D_SM_OLD_STNDRD, tblTableName, not to mention 5 "user" tables... With passwords in clear text. AWESOME!

And I've already documented what tables are being used and their relationships (BTW, there aren't any on the ones being used; referential integrity anyone?). I'm not going to bother dropping the unused tables simply because it would take too long to remove the undocumented relationships in the right order and drop them. Like I said, it's a rewrite so I'm not about to try to "fix" this one...

@Justice said:

Hm...better late than never?

My thoughts exactly.

Dorus

@snoofle said:

Most of the tables are exceedingly normalized. So you have table A with some values and an Id. Ditto for table B.

As far as i know, id fields and normalization doesn't come together.

@snoofle said:

Then I find one table with a column: listOfCustomers, which contains a comma separated list of hundreds of customer Ids; sometimes with spaces after the commas, sometimes before, sometimes both; sometimes with multiple commas in a row (sometimes with intervening spaces but no Ids).

Was to be expected after ths id madness.

boomzilla

@Dorus said:

@snoofle said:
Most of the tables are exceedingly normalized. So you have table A with some values and an Id. Ditto for table B.

As far as i know, id fields and normalization doesn't come together.

Dude, come on. This isn't true. We told you all about this in a previous thread! Stop being a natural key chauvanist.

Dorus

I lacked time to write up another responce in that topic, but at the very least, normalization absolutely does not say anything about adding id fields. It doesn't say anything about them at all.

At design time, you should not add id fields or other unnatrual keys. I know you add them during implementation to get questionalbe performance gains, and other add them for lazyness or lack of understanding of database design. However, when you desing a database, i *really* hope you try to apply normalization, not adding surrogate key's all over the place.

boomzilla

@Dorus said:

At design time, you should not add id fields or other unnatrual keys.

I know you add them during implementation to get questionalbe performance gains, and other add them for lazyness or lack of understanding of database design. However, when you desing a database, i *really* hope you try to apply normalization, not adding surrogate key's all over the place.

Nope. Sorry, but you've shoved concepts into normalization that just aren't there. Your natural key purity is really orthogonal to normalization.

Dorus

The entire point of normalization is to find a unique key, not to create one (ok, you create one by moving colums to other/new tables and by splitting the data in one colum into multiple columns). Nowhere during the normalization process do you add new data that wasn't there before.

boomzilla

@Dorus said:

The entire point of normalization is to find a unique key, not to create one (ok, you create one by moving colums to other/new tables and by splitting the data in one colum into multiple columns). Nowhere during the normalization process do you add new data that wasn't there before.

That is totally not the entire point of normalization. The definition I've always heard has more to do with eliminating redundant data in order to do things like enforce referential integrity and keeping data consistent across the database. I still say you're confusing concepts here.

Dorus

Please be more specific, at what step in eliminating redundant data, enforcing referential integrity and keeping data consistent do you have to add substitute keys?

Edit: and okay, finding the correct key is not the goal of normalization, it's the process.

boomzilla

@Dorus said:

Please be more specific, at what step in eliminating redundant data, enforcing referential integrity and keeping data consistent do you have to add substitute keys?

Edit: and okay, finding the correct key is not the goal of normalization, it's the process.

And if you decide a natural key is fine, then you can use that. But using a natural or surrogate key is orthogonal to normalization. In other words, what part of using a surrogate key is inconsistent with normalization?

Monomelodies

@Dorus said:

Please be more specific, at what step in eliminating redundant data, enforcing referential integrity and keeping data consistent do you have to add substitute keys?

Either you're kidding, or I need you very badly to stay far away from any database I happen to be working on.

CREATE TABLE user (
firstname text NOT NULL,
lastname text NOT NULL,
CONSTRAINT PRIMARY KEY(firstname, lastname)
);

INSERT INTO USER VALUES ('john', 'smith');
INSERT INTO USER VALUES ('john', 'smith'); -- error

And don't even get me started on why a multifield primary key is NOT practical in almost any fathomable case where you refer to it as a foreign key.

The 'key' or 'id' is something internal; it's a guaranteed-unique representation of a row. Almost ANY other natural data IS NOT. Phone numbers? Nope, people change them. E-mail addresses? Same. Until we can use our DNA reliably as a primary key, we're stuck with serials.

Exceptions are linking or pivot tables (PK = refKEY1+refKEY2), i18n tables (where the PK should be id + language) and anything which happens to already contain a UUID or something else unique. Which, I would like to remind you, are usually machine-generated too, only from an external source.

blakeyrat

@Dorus said:

Please be more specific, at what step in eliminating redundant data, enforcing referential integrity and keeping data consistent do you have to add substitute keys?
Edit: and okay, finding the correct key is not the goal of normalization, it's the process.

You know there's a large contingent of database designers who don't believe natural keys even exist?

(Or, better put: the odds of a natural key existing are so low, you're better off assuming they never exist. And as Monomelodies points out, in the extremely rare case you do find a workable natural key, it's usually only because some other database system which didn't believe in natural keys created it for you.)

Our faction is arming ourselves. Soon there will be war. You'll be the first against the wall when the revolution comes!

Someone_You_Know

@Monomelodies said:

Until we can use our DNA reliably as a primary key, we're stuck with serials.

My identical twin and I really hope you're being sarcastic here.

PJH

@Someone You Know said:

@Monomelodies said:
Until we can use our DNA reliably as a primary key, we're stuck with serials.
My identical twin and I really hope you're being sarcastic here.

Depends on whether you're assuming he's talking about the limited set of markers that are typically used in, say DNA profiling, or the other bits of DNA string

SQLDave

@snoofle said:

We have a database with about 100 tables comprising multiple TB of data. Most of the tables are exceedingly normalized. So you have table A with some values and an Id. Ditto for table B. Then a mapping table ABxref with just the Ids of stuff that references both tables. This is done in 99% of the cases.

Then I find one table with a column: listOfCustomers, which contains a comma separated list of hundreds of customer Ids; sometimes with spaces after the commas, sometimes before, sometimes both; sometimes with multiple commas in a row (sometimes with intervening spaces but no Ids). And the code to access it parses out the string and iterates over it - both in stored procedures and in Java code.

Sigh.

TRWTF is that it should have been in XML, right?

Dorus

@Monomelodies said:

@Dorus said:
Please be more specific, at what step in eliminating redundant data, enforcing referential integrity and keeping data consistent do you have to add ~~substitute~~ surrogate keys?

Either you're kidding, or I need you very badly to stay far away from any database I happen to be working on.
CREATE TABLE user (
firstname text NOT NULL,
lastname text NOT NULL,
CONSTRAINT PRIMARY KEY(firstname, lastname)
);
INSERT INTO USER VALUES ('john', 'smith');
INSERT INTO USER VALUES ('john', 'smith'); -- error
And don't even get me started on why a multifield primary key is NOT practical in almost any fathomable case where you refer to it as a foreign key.

Either your data model is bogus, or you might need a substitute key. However that's related to your domain, not your database desgn. It's not something you as a databasdesinger can solve. Adding keys still isn't part of normalization.

And damn, i'm confusing substitute with surrogate again.

hoodaticus

@C-Octothorpe said:

The upshot of this is that I'm not maintaining this, although I do need to understand (read: document) it as I'm going to be rewriting it.

Kinda funny how the technical specs/requirements are being written 10 years after the app was written by somebody who had nothing to do with the original...

/rant

Welcome to my hell. I have to rewrite a Table-Oriented transaction processing and transmission engine written in MS Access and VBA, with:

Every control named something like Ctl_0100145

The average function length is about 8 pages.

No classes.

All the SQL statements contain references to the value of form controls in the Access application.

No indentation.

All variables are Variant.

Extremely abbreviated variable names, like "rs" and "ors" for naming DAO record sets that are being looped over in a nested fashion.

Most functions have giant, nested CASE statements for handling what should have been subclasses, visitors, or configurations.

The entire thing was built ad-hoc. New features meant finding a point in the code/SQL/form event handlers and inserting a quick, uncommented hack. Of which there have been at least a thousand inserted since the last time the application made sense - if it ever did to begin with.

Eventually, one of the programmers on the project learned .NET. He then proceeded to write new .NET apps... that use only static classes, 15 page long functions, and call into Access for their local data store.

This is my inheritance, and my curse.