Help arguing Flat vs RDBMS

DZ_Jay

Does anybody know a good and convincing argument in favor of desiging and using a normalized relational database versus the flat-file-in-sqldb approach many developers take? For example, in the company I work for, many databases have long tables with columns such as:

recid
firstname
lastname
addrs1_1
addrs2_1
city_1
state_1
zip_1
addrs1_2
addrs2_2
city_2
state_2
zip_2
addrs1_3
addrs2_3
city_3
state_3
zip_3
special_code1
special_code2
special_code3
special_code4
special_code5
extra_special_code1
extra_special_code2
extra_special_code3
extra_special_code4
extra_special_code5

Every time a new attribute is needed, a new column is added to the table, usually just another "super_extra_special_code" or something like that. And then, of course, anything that doesn't fit there falls into the catch-all "key-value" table.

It seems impossible to me to convince anybody how bad and braindead that mess is, and whenever I propose a more normalized solution, it usually is met with resistance and suggestions of unnecessary "over-engineering". "Things just work," I'm told, "so what's the real advantage?". For every one of my arguments regardng performance gains, easier maintenance, conceptual integrity, etc. I am again reminded how everything has worked so far and have been maintainable -- in its own way, so again: why change?

Seeing databases like that make my eyes burn, but no matter what I say, its all perceived as "my way of doing things" versus "their way", neither right nor wrong; just two different ways. Then I feel like killing someone.

Can anybody help with some good arguments?

-dZ.

MasterPlanSoftware

My advice? Run. Just run, don't look back.

Sunstorm

I don't think there's anything inherently wrong about long tables, providing there's a lot of different data to store about something. As far as person tables go, that doesn't seem to be particularly horrible. All of that is information about the person. I'm not sure how would you like to normalize this further. Separating addresses into their own table?

asuffield

There are no obvious places where this table is denormalised. Even if there were, you cannot justify anything on the basis of 'optimisation' unless you have a real performance problem. 'Maintainance' would have to be justified on the code we cannot see; it's valid only if it would significantly reduce the size and/or complexity of the code accessing the database.

Any attempt to apply some standard of 'conceptual integrity' to a project that has left the design phase is doomed to failure and insanity. Don't do that. If you don't have it from the start, you can't add it later.

Pap

I agree; person tables always look unnormalized so I wouldn't lose much sleep over it.

You never told us what your job position is and what "their" job position is. Are you the dev or the dba?

And of course: TRWTF is why you posted in the sidebar forum for help instead of the general forum.

DZ_Jay

OMG... I should have known better than to post in here.

Do I work with any of you? He, he, he.

-dZ.

Eve_Teschlemacher

The main flaw I see with the table structure you describe is the field names. I'm hoping that was just done in the interest of anonymization. If the fields have descriptive names, adding more fields is not necessarily bad.

Normalization isn't even always the best approach. How much data there is, how often it changes, how often it's queried, and what it's used for are all factors to consider. Maybe you save some storage space and your upload processes run faster. But, what's the productivity cost for your database users if many of the queries they write now need multiple table joins?

Seriously, compare "SELECT recid, firstname, lastname FROM someTable WHERE extra_special_code5 = 4" to "SELECT recs.recid, users.firstname, users.lastname FROM userTables users INNER JOIN requisitions recs on recs.userID = users.userID INNER JOIN extra_special_code5_table esc5 ON esc5.recID = recs.recid WHERE esc5.code = 4". Now imagine a more complicated query, in which the user wants to filter or report on multiple different codes. Every table you have to join in makes the query take that much longer to type, not to mention the time needed to look at the database spec to figure out how the tables are related so that you know how to structure the join and what all the names of the foreign keys are.

You might think that the users should just suck it up and deal with it but there is a real cognitive burden for humans to comprehend normalized relational database structure. If it takes a salaried analyst 10 minutes to figure out and write a query against your elegant normalized tables versus one minute against a flattened table, that's a huge productivity hit.

RichardNeill

Agreed - a single large table is often better. There are 2 cases where you don't want it.

1)If you end up having repeated rows with only one different element.

2) Where one of the columns contains a very much larger amount of data than the rest (eg you have 20 columns of varchar (50), and a column of varchar(20000). Then, you should split the table, so that the small table can be resident in RAM, and be quickly scannable.

Iago

@Eve Teschlemacher said:

Now imagine a more complicated query, in which the user wants to filter or report on multiple different codes. Every table you have to join in makes the query take that much longer to type, not to mention the time needed to look at the database spec to figure out how the tables are related so that you know how to structure the join and what all the names of the foreign keys are.

Does the phrase CREATE VIEW mean anything to you?

Sunstorm

@Iago said:

@Eve Teschlemacher said:
Now imagine a more complicated query, in which the user wants to filter or report on multiple different codes. Every table you have to join in makes the query take that much longer to type, not to mention the time needed to look at the database spec to figure out how the tables are related so that you know how to structure the join and what all the names of the foreign keys are.
Does the phrase CREATE VIEW mean anything to you?

So you're advocating splitting the data into normalized tables, just so you can then promptly join it back together in a view, and read it as it was before?

Do you constantly excercise in futility?

drinkingbird

@Eve Teschlemacher said:

Normalization isn't even always the best approach. How much data there is, how often it changes, how often it's queried, and what it's used for are all factors to consider. Maybe you save some storage space and your upload processes run faster. But, what's the productivity cost for your database users if many of the queries they write now need multiple table joins?

What's the productivity cost of having a denormalised database structure which doesn't properly reflect any reasonable mental model of the data it contains?

I've worked with those conditions, and it's ridiculously costly, since queries are necessarily contrived and difficult to understand. A complex and well reasoned mental model is much more easily understood than a simple model bearing little resemblance to reality.

Besides, if your users are constantly typing up queries, that's where your main productivity cost is going to be, normalised or not, and there should be a better interface to the system.

You seem to have missed the fact that the whole point of an RDBMS is to store related data in a normalised state, and allow you to exploit the relationships between the data to analyse or retrieve it as needed for different circumstances.

asuffield

@drinkingbird said:

You seem to have missed the fact that the whole point of an RDBMS is to store related data in a normalised state, and allow you to exploit the relationships between the data to analyse or retrieve it as needed for different circumstances.

First normal form is sufficient for this purpose - and since first normal form is precisely equivalent to the set of things which are databases (rather than just a random pile of unrelated data) that means you can use pretty much any old table structure that you like and still satisfy that condition.

More sophisticated forms of normalisation are nothing more than tools to deal with certain classes of problem (mostly relating to data integrity across updates). Those problems can also be solved in other ways, or they can simply not exist in your system. There is no evidence of them here.

CodeWhisperer

@asuffield said:

There are no obvious places where this table is denormalised.

Wait...what?

Granted, it's been a few years since my DB classes, and I realize there is some debate about exactly what consitututes 1NF, but I don't think that you can look at that table and say "nope, no normalization issues here"...however, leaving behind raw definitions: multiple occurances of the data from the same domain, tales of "Every time a new attribute is needed, a new column is added to the table"...these aren't hallmarks of a well designed database.

Now, if there are no maintenance issues and they aren't planning to add Address4 & Address5 along with super_special_code1 thru 5, then I might not bother with the argument. But, if there are changes -- if you want to add an address, or a new special code -- you have to change the code that reads, writes & queries it. That's pretty much exactly what you want to avoid.

If you end up with 10 'code' fields and you want to start formulating queries like "are there any rows with codes 'A', 'B', or 'C'" and now you've got queries with WHERE clauses that have 30 comparisons in them. If someone wasn't really on the ball and dropped comparison (say 'or special_code_8 = 'B')...then you end up with a really wacky bug that can be hard to track down. Or if you add a new special code and update all of your views except one. Stuff like that can be a bitch to find in initial testing, too, since it depends on specific data configurations.

Finally, I'm not saying that he should run out and try to change the db just because he doesn't like it. There's definitely truth to "don't fix it if it ain't broken".. but...

@asuffield said:

If you don't have it from the start, you can't add it later.

WTF?

A big chunk of my career was spent inheriting crappy systems built by morons and turning them into something better. Usually that's by figuring out what the conceptual integrity should have been and then slowly moving it in that direction. Systems run better, maintenance costs go down, flexibility increases, scalability goes up, etc. You certainly can add it later, but it costs a hell of a lot more than just trying to do it right in the first place.

-cw

asuffield

@CodeWhisperer said:

@asuffield said:
There are no obvious places where this table is denormalised.

Wait...what?

Granted, it's been a few years since my DB classes, and I realize there is some debate about exactly what consitututes 1NF, but I don't think that you can look at that table and say "nope, no normalization issues here"...however, leaving behind raw definitions: multiple occurances of the data from the same domain, tales of "Every time a new attribute is needed, a new column is added to the table"...these aren't hallmarks of a well designed database.

Because as we all know, the well designed database looks like:

{

key : string
value : string

}

Thereby eliminating any need for the terrible sin of having to add a new column.

Now, if there are no maintenance issues and they aren't planning to add Address4 & Address5 along with super_special_code1 thru 5, then I might not bother with the argument. But, if there are changes -- if you want to add an address, or a new special code -- you have to change the code that reads, writes & queries it. That's pretty much exactly what you want to avoid.

You have to change the interface regardless. All the rest can be easily automated. No issues here.

If you end up with 10 'code' fields and you want to start formulating queries like "are there any rows with codes 'A', 'B', or 'C'" and now you've got queries with WHERE clauses that have 30 comparisons in them.

Assumption of facts not in evidence. We have no reason to think that these fields are of comparable types or have any kind of structure like this. You're just making up a data structure that you think might exist and assuming that is what is happening here. For all we know, special_code1 is the weight of the apples and special_code2 is their tax rate, in which case such a query would be meaningless. In the absence of any evidence to the contrary, you always assume that no such requirement exists, or you'll end up with a {key, value} table.

@asuffield said:
If you don't have it from the start, you can't add it later.

WTF?

A big chunk of my career was spent inheriting crappy systems built by morons and turning them into something better. Usually that's by figuring out what the conceptual integrity should have been and then slowly moving it in that direction.

I bet a lot of those systems had an is_set and an isnull in them by the time you were done "improving" them. Bikeshedding never works.

EpsilonCool

OK, you decide to add another address to your record. This will involve

Schema changes
Code changes
Testing
Documentation
User interface changing.

If you used a normalised structure then it would be an trivial data change.

Let us suppose that you don't need to fill in every address. This implies that the additional address fields have to be nullable, or at least store empty strings.

You are storing wider information than you need so are increasing network traffic every time you bring back a record.

The ANSI standard says that NULL values are not equal to anything, even another NULL value therefore to test for NULLs you need to use a special function. That means if you want to search for records where field <> value you have to code up field<> value AND field IS NOT NULL or the equivalent. In highly stressed systems this makes the query less efficient.

There are occassions when denormalised data is useful. Normally (no pun intended) this is in applications whose principal use is for data retrieval. Systems where data has to be maintained rarely use them.

PJH

I'm missing something. I must be...

Normalising:

[<name fields> <address fields set 1> <address fields set 2> <address fields set 3>] (My understanding of the OP's currrent schema omitting 'special codes')

normaly(!) results in an extra [<uid> <address fields>] table and reducing the original table to:

[<name fields> <foreign key for address 1> <foreign key for address 2> <foreign key for address 3>]

Unless, of course, you have a third table of

[<foreign key to user> <type of address> <address fields>]

which removes the address references from the original table but introduces its own fuckery.

Adding an extra address to either normalisation scheme I've mentioned will involve:

Schema changes (adding a field to my first try/creating a new type for my second)
Code changes (to cope with new field/to cope with new type)
Testing (See 2)
Documentation (See 1 and 2)
User interface changing. (To cope with the new address type)

I fail to see how normalisation reduces the problem to a "trivial data change" - is there another way of normalising my first example you are thinking of?

Not that there aren't good reasons to normalise in either manner described above depending on circumstance; I just don't see how normalisation can be done with so little impact.

asuffield

@EpsilonCool said:

OK, you decide to add another address to your record.

Which you might not have any intention of doing. I've seen a lot of forms that had "Address of administrative contact", "Address of technical contact", and "Address of billing contact", but very few that had space for an arbitrary number of addresses.

You too are operating under the assumption that the data model has some properties that are not in evidence (and frankly, I cannot imagine why anybody would want a data model that had an unlimited number of addresses per person).

You are storing wider information than you need so are increasing network traffic every time you bring back a record.

You are aware that the 'select' statement accepts parameters other than '*' in the first argument?

@PJH said:

I'm missing something. I must be...

Normalising:
[<name fields> <address fields set 1> <address fields set 2> <address fields set 3>] (My understanding of the OP's currrent schema omitting 'special codes')
normaly(!) results in an extra [<uid> <address fields>] table and reducing the original table to:
[<name fields> <foreign key for address 1> <foreign key for address 2> <foreign key for address 3>]
Unless, of course, you have a third table of
[<foreign key to user> <type of address> <address fields>]
which removes the address references from the original table but introduces its own fuckery.
Adding an extra address to either normalisation scheme I've mentioned will involve:
Schema changes (adding a field to my first try/creating a new type for my second)
Code changes (to cope with new field/to cope with new type)
Testing (See 2)
Documentation (See 1 and 2)
User interface changing. (To cope with the new address type)
I fail to see how normalisation reduces the problem to a "trivial data change" - is there another way of normalising my first example you are thinking of?
Not that there aren't good reasons to normalise in either manner described above depending on circumstance; I just don't see how normalisation can be done with so little impact.

Adding a new type to a enum is not even close to equivalent to adding new tables. Some code might use column numbers directly, in which case that code could be broken. No code could be broken by adding a new type.

CodeWhisperer

@asuffield said:

You have to change the interface regardless. All the rest can be easily automated. No issues here.

Only if you build crappy interfaces. You can design the interface correctly to accept a list of addresses, for instance, without regard to how many of them there are.

@asuffield said:

Assumption of facts not in evidence.

And when this case comes before the courts, I'll be more careful about hypotheticals.

@asuffield said:

We have no reason to think that these fields are of comparable types or have any kind of structure like this. You're just making up a data structure that you think might exist and assuming that is what is happening here

I assumed it was a simple (var)char field holding a single char, for purposes of illustration and in the absence of further information. When the OP tells me i'm wrong, I'll change my hypothesis. Having seen this a dozen times, I don't feel all that bad making the assumption.

@asuffield said:

For all we know, special_code1 is the weight of the apples and special_code2 is their tax rate, in which case such a query would be meaningless. In the absence of any evidence to the contrary, you always assume that no such requirement exists

If that's what the columns do indeed mean, then they're bigger idiots than I imagined and that table really needs to be fixed. However, since they managed to name their address columns vaguely correctly, I chose to assume (gasp) that they hadn't quite ascended to that level of stupidity quite yet. I know, I know, I'm an optimist.

@codewhisperer said:

A big chunk of my career was spent inheriting crappy systems built by morons and turning them into something better. Usually that's by figuring out what the conceptual integrity should have been and then slowly moving it in that direction.

@asuffield said:
I bet a lot of those systems had an is_set and an isnull in them by the time you were done "improving" them. Bikeshedding never works.

Err...the presence of the isnull in a table would be the color of the bikeshed of which you speak. I couldn't give a crap about the letter of the law and whether or not isnull violates 1FN. I do care that a database design is constraining what an organization can or can't do with it's data, how long it takes to respond to business requests, the maintenance cost of the db, the complexity of the queries that need to be run against it, etc. These are all things worth scratching your head over.

If the OP's organization suffers none of those problems, then don't touch it -- but the OP's coworker doesn't necessarily qualify as an good judge of it. I used to work with an org that firmly believed that their systems were perfect (they used that word), and couldn't possibly have been made any better...regardless of the fact that the systems crashed hard twice a week and corrupted data all over the place. I tried to get them to fix things, but they wouldn't have any of it -- the corrupted data was just evidence of their perfection, I guess.

-cw

asuffield

@CodeWhisperer said:

@asuffield said:
You have to change the interface regardless. All the rest can be easily automated. No issues here.

Only if you build crappy interfaces. You can design the interface correctly to accept a list of addresses, for instance, without regard to how many of them there are.

The more generic you make an interface, the more it sucks. Being specific to the requirements of the problem ("three addresses on the form" or whatever the heck they need) is precisely the thing which distinguishes a good interface from a bad one. All the rest is just details of how you implement that specificity.

ammoQ

@asuffield said:

You too are operating under the assumption that the data model has some properties that are not in evidence (and frankly, I cannot imagine why anybody would want a data model that had an unlimited number of addresses per person).

Been there, seen that. One of the projects I'm currently maintaining allows for that. Main reason: A user-maintainable table "address type", though eventually there are only a few address types in use.

@EpsilonCool said:

OK, you decide to add another address to your record. This will involve
Schema changes
Code changes
Testing
Documentation
User interface changing.
If you used a normalised structure then it would be an trivial data change.

It's only trivial if the new address has no special meaning for the program. If it has a special meaning, which is likely, you need 2.-5. anyway.

@CodeWhisperer said:

@asuffield said:
We have no reason to think that these fields
are of comparable types or have any kind of structure like this. You're
just making up a data structure that you think might exist and assuming
that is what is happening here

I assumed it was a simple (var)char field holding a single char, for purposes of illustration and in the absence of further information. When the OP tells me i'm wrong, I'll change my hypothesis. Having seen this a dozen times, I don't feel all that bad making the assumption.

@asuffield said:
For all we know, special_code1 is the weight of the apples and special_code2 is their tax rate, in which case such a query would be meaningless. In the absence of any evidence to the contrary, you always assume that no such requirement exists

If that's what the columns do indeed mean, then they're bigger idiots than I imagined and that table really needs to be fixed.

Less unlikely than you think. In one of my projects, there are database columns named statistic_code_1..statistic_code_10 (or something like that). Our customers use them to group the products in the product database by arbitrary criteria. Those columns are only used in a statistical report and have no meaning for the rest of the program. In fact, I don't even know how they actually use them. Anyway, the list of values is different for each column. (The list of values is of course user maintainable, but for all I know, they do not use the same values for different columns)

matk

@PJH said:

I'm missing something. I must be...

Normalising:
[<name fields> <address fields set 1> <address fields set 2> <address fields set 3>] (My understanding of the OP's currrent schema omitting 'special codes')
normaly(!) results in an extra [<uid> <address fields>] table and reducing the original table to:
[<name fields> <foreign key for address 1> <foreign key for address 2> <foreign key for address 3>]
Unless, of course, you have a third table of
[<foreign key to user> <type of address> <address fields>]
which removes the address references from the original table but introduces its own fuckery.

Ideally you would remove <foreign key for address> fields from the original table, and add a <recid> field to the address table. So, the address table could look like:

addressID

addrs1
addrs2
city
state
zip
recid

where <recid> contains the ID of the original table that the address belongs to. Presto, you can have an unlimited number of addresses for any given row in the original table without having to change anything else in your application as the number of addresses increases.

The_General

@DZ-Jay said:

Does anybody know a good and convincing argument in favor of desiging and using a normalized relational database versus the flat-file-in-sqldb approach many developers take? For example, in the company I work for, many databases have long tables with columns such as:

recid
firstname
lastname
addrs1_1
addrs2_1
city_1
state_1
zip_1
addrs1_2
addrs2_2
city_2
state_2
zip_2
addrs1_3
addrs2_3
city_3
state_3
zip_3
special_code1
special_code2
special_code3
special_code4
special_code5
extra_special_code1
extra_special_code2
extra_special_code3
extra_special_code4
extra_special_code5

-dZ.

I can't comment on the [extra_]special_codes, but an advantage of denormalising the address details is that each distinct address only need be stored once. Hopefully, this way, it's much harder to end up with e.g. "22 Foo Ct, Barville", "22 Foo Court, Barville", "22 Foo Court, Bazville" etc, which really all refer to the same physical address (the last one being a typo - see below).

A single address table also makes it easier to create a list of existing addresses to select from and query on, if you need to do such a thing.

Various tasks are made simpler:
- retrieving and formatting address details (just pass in the address ID)
- looking for duplicate / near-duplicate addresses (anti-typo / standardisation issues)
- following on from this, correcting mistyped addresses or standardising details (e.g. "Street" vs "St")
- grouping together customers who share a given address

The users may not see any direct benefit, but the usual argument about maintainability, reliability and consistency applies.

Changing the schema is also easier. This is from experience. The system I work on is a financial services app, with (among other things) account and customer tables, as well as a separate address details table. The relationship between accounts and customers is many-to-many, so we also have an 'association' table. This table used to have a separate address ID on it, for "this customer's correspondence address when associated with this particular account".

After many years of confusion ("Why don't the association and customer addresses match?" "You used the re-enter option to change it on the association" "Can we link one to the other then? Can we automatically change address X when we alter Y?" "OK, that's a change request; the other thing's a fault...") we finally removed the separate association address column altogether.

Changing SQLs was much easier - essentially, just replacing the old association address with the customer address; nearly all the SQLs already joined in the customer table anyway. Address retrieval was still just as easy - the relevant code just had to be fed its address ID from a different column. With address detail fields on each table, there would have been more to change, and thus more to potentially get wrong.

Another thing we had to do was change the definition of the address table itself (specifically, adding the country ISO code, which became required for a piece of financial reporting that our system does.) This change involved adding precisely one column to the database, instead of one for each set of address fields. Amending the input/output side of the system was correspondingly simpler to do too. Functions working with addresses still only needed to know the address ID.

As others have said, you have to consider the cost of implementing the changes in the first place against the benefits they bring (and how relevant those benefits are to you).

ammoQ

@The General said:

I can't comment on the [extra_]special_codes, but an advantage of denormalising the address details is that each distinct address only need be stored once. Hopefully, this way, it's much harder to end up with e.g. "22 Foo Ct, Barville", "22 Foo Court, Barville", "22 Foo Court, Bazville" etc, which really all refer to the same physical address (the last one being a typo - see below).

I think it's beyond dispute that there should be a table for address details; especially, since in many cases, the same adress is used for general contact, billing and delivery. The more diffcult question is rather

- do we have three columns ADDRESS_ID_GENERAL, ADDRESS_ID_BILLING, ADDRESS_ID_DELIVERY in the customers table or

- do we build a m:n relationship between customers and addresses, using a CUSTOMER_ADDRESS table (with the columns CUSTOMER_ID, ADDRESS_ID, ADDRESSTYPE)

Sunstorm

@The General said:

I can't comment on the [extra_]special_codes, but an advantage of denormalising the address details is that each distinct address only need be stored once. Hopefully, this way, it's much harder to end up with e.g. "22 Foo Ct, Barville", "22 Foo Court, Barville", "22 Foo Court, Bazville" etc, which really all refer to the same physical address (the last one being a typo - see below).

I can't see how normalizing the address details helps remove similar addresses that refer to the same place, unless you somehow put a unique index on a soundex or something. What it would do is remove the need to store a long address string several times, replacing the string with an integer numbers, which would pay off in space if the address stored is longer than 4 ASCII letters and used in more than one place, which is likely the case. On the other hand, it'll also make it considerably more complicated for when someone changes address, because then you'll have to create a new row and change the index. You can't just edit the address as you would before, unless you want all the other users at that address to be magically moved as well.

@ammoQ said:

I think it's beyond dispute that there should be a table for address details; especially, since in many cases, the same adress is used for general contact, billing and delivery. The more diffcult question is rather
- do we have three columns ADDRESS_ID_GENERAL, ADDRESS_ID_BILLING, ADDRESS_ID_DELIVERY in the customers table or
- do we build a m:n relationship between customers and addresses, using a CUSTOMER_ADDRESS table (with the columns CUSTOMER_ID, ADDRESS_ID, ADDRESSTYPE)

I suppose that would depend on whether the spec says that you need three addresses, or any number of addresses, which at this time are three.

dhromed

@Sunstorm said:

@ammoQ said:
I think it's beyond dispute that there should be a table for address details; especially, since in many cases, the same adress is used for general contact, billing and delivery. The more diffcult question is rather
- do we have three columns ADDRESS_ID_GENERAL, ADDRESS_ID_BILLING, ADDRESS_ID_DELIVERY in the customers table or
- do we build a m:n relationship between customers and addresses, using a CUSTOMER_ADDRESS table (with the columns CUSTOMER_ID, ADDRESS_ID, ADDRESSTYPE)
I suppose that would depend on whether the spec says that you need three addresses, or any number of addresses, which at this time are three.

Or, more generically, whether you need to identify an address in some way, as fr example, the billing address.

And if you do, and choose the normal form, then you need another table listing "address types" which are FK'd by the address table, so that you can tell what sort of address you have -- which wiuld require an interface change for something other than a list of any length.

A step further; it depends on how far you want to go in solving this problem.

Finally, it depends on whether you actually have a problem, which, currently, I'm not so sure of.

cursive

Sunstorm

@dhromed said:

@Sunstorm said:
@ammoQ said:
I think it's beyond dispute that there should be a table for address details; especially, since in many cases, the same adress is used for general contact, billing and delivery. The more diffcult question is rather
- do we have three columns ADDRESS_ID_GENERAL, ADDRESS_ID_BILLING, ADDRESS_ID_DELIVERY in the customers table or
- do we build a m:n relationship between customers and addresses, using a CUSTOMER_ADDRESS table (with the columns CUSTOMER_ID, ADDRESS_ID, ADDRESSTYPE)
I suppose that would depend on whether the spec says that you need three addresses, or any number of addresses, which at this time are three.
Or, more generically, whether you need to identify an address in some way, as fr example, the billing address.
And if you do, and choose the normal form, then you need another table listing "address types" which are FK'd by the address table, so that you can tell what sort of address you have -- which wiuld require an interface change for something other than a list of any length.
A step further; it depends on how far you want to go in solving this problem.
Finally, it depends on whether you actually have a problem, which, currently, I'm not so sure of.
cursive

I'm pretty sure that if you're actually going through with defining a table for a list of address types, then yes, now you have a problem.

ammoQ

@dhromed said:

Or, more generically, whether you need to identify an address in some way, as fr example, the billing address.
And if you do, and choose the normal form, then you need another table listing "address types" which are FK'd by the address table, so that you can tell what sort of address you have -- which wiuld require an interface change for something other than a list of any length.
A step further; it depends on how far you want to go in solving this problem.
Finally, it depends on whether you actually have a problem, which, currently, I'm not so sure of.
cursive

As you can see in my proposal, there already is a column ADDRESSTYPE in the table CUSTOMER_ADDRESS (which relates customers to addresses, but doesn't contain the actuall adress); it belongs there, since the same address is (in many cases) both billing address and delivery address, but you probably don't want two identical (except address type) copies of the address in the address table.

clively

@DZ-Jay said:

Does anybody know a good and convincing argument in favor of desiging and using a normalized relational database versus the flat-file-in-sqldb approach many developers take?

...

It seems impossible to me to convince anybody how bad and braindead that mess is, and whenever I propose a more normalized solution, it usually is met with resistance and suggestions of unnecessary "over-engineering". "Things just work," I'm told, "so what's the *real* advantage?". For every one of my arguments regardng performance gains, easier maintenance, conceptual integrity, etc. I am again reminded how everything has worked so far and have been maintainable -- in its own way, so again: why change?

The real problem here isn't in coming up with a good argument; that is easily done. The real problem is managing change. I've been in your shoes and have seen datamodels where it took literally months to add something due to modifying the database, queries, underlying code, and testing because of a similiar flat data structure. If the db had been laid out correctly those changes would have been as simple as adding a row to a table. However the architects and people in power just couldn't get their minds around doing things differently. Another factor was an interest in not making things better. By keeping the flat structure, major proposed changes were always cancelled by the higher ups as too expensive to implement. Which meant these guys had a long term cushy maintenance job.

Changes like this have to come from the top. If you can get the guy in charge (Dev Manager?) to do it, then everyone else will change or leave. If you can't get the guy in charge to do things the right way, then you have to either conform or leave.

Just take a look at the responses so far, here are supposedly decent developers arguing for a flat structure. The main reason is simply that it is extremely easy to visualize. Unfortunately, this tends to trump maintenance and long term stability quite frequently.

Good luck

PJH

@lilakuh said:

@PJH said:
<snip>
Unless, of course, you have a third table of
[<foreign key to user> <type of address> <address fields>]
which removes the address references from the original table but introduces its own fuckery.
Ideally you would remove <foreign key for address> fields from the original table, and add a <recid> field to the address table. So, the address table could look like:
addressID
addrs1
addrs2
city
state
zip
recid

where <recid> contains the ID of the original table that the address belongs to. Presto, you can have an unlimited number of addresses for any given row in the original table without having to change anything else in your application as the number of addresses increases.

So, that would be identical to my second proposed solution then.

asuffield

@clively said:

Just take a look at the responses so far, here are supposedly decent developers arguing for a flat structure.

Congratulations, you have failed to learn how to read. Nobody has said that.

(Not that there's even anything wrong with flat databases like those used in an RDBMS, but the subject hasn't come up in this thread)

DZ_Jay

@clively said:

Changes like this have to come from the top. If you can get the guy in charge (Dev Manager?) to do it, then everyone else will change or leave. If you can't get the guy in charge to do things the right way, then you have to either conform or leave.

And that's what I meant by "help with arguments": How can I "get the guy in charge" (or anybody else, for that matter) to buy into this idea. By the way, as I alluded in my original comment, my intent is not re-engineering the current system (although I would hope that would happen someday, at least piecemeal), but to influence others to *not* use such models when building new applications -- new applications which I may have to maintain. So far, this is the current de facto approach, even for mission-critical applications.

If you've been successful in the past regarding this problem, I would appreciate any pointers in conveying proper arguments in favor of a more relational model.

-dZ.

DZ_Jay

@asuffield said:

@clively said:
Just take a look at the responses so far, here are supposedly decent developers arguing for a flat structure.
Congratulations, you have failed to learn how to read. Nobody has said that.
(Not that there's even anything wrong with flat databases like those used in an RDBMS, but the subject hasn't come up in this thread)

I'm sorry, but that sample table I mentioned in my original post -- the one most people here are arguing is perfectly fine and needs no real modifications -- is in essence a flat structure. I believe that was his point.

dZ.

asuffield

@DZ-Jay said:

If you've been successful in the past regarding this problem, I would appreciate any pointers in conveying proper arguments in favor of a more relational model.

A model is either relational or it is not. There is no scope for "more relational" or "less relational". The model you have shown here is relational. Examples of non-relational models would be hierarchical or object-oriented models (also valid and in common use).

You aren't going to convince anybody of anything when you don't know what you're talking about. You haven't even managed to find any actual problems with what they're currently doing.

asuffield

@DZ-Jay said:

I'm sorry, but that sample table I mentioned in my original post -- the one most people here are arguing is perfectly fine and needs no real modifications -- is in essence a flat structure. I believe that was his point.

Nobody has said it is "perfectly fine". Numerous people have said that there are no apparent problems with it. If you cannot understand the difference, you are not qualified to comment on database design.

There is nothing wrong with flat structures. They may or may not be applicable to a given problem.

matk

Not quite. Your second solution proposes this layout for the users table:

[<name fields> <foreign key for address 1> <foreign key for address 2> <foreign key for address 3>]

Note that you have several "foreign key for address" fields. My solution doesn't have any foreign keys in the users table, but rather the address table has a foreign key back to users. I'm not sure what you meant by "Unless, of course, you have a third table of" - what would your third table contain? Are you referring to a junction table?

Soviut

Agreed. The issue here is scalability. Sure, this solution works "for now". But since the original poster has already been required to add new fields which pretty much duplicate each other, the likelihood of cross-talk becomes more and more apparent. If this app was a one-off, guarenteed never to grow, that's fine, but its obviously not. Higher normalized forms compartmentalizes data, reduces redundancy and overlap, and provides a model that is scalable, diagram-able, and easy to understand. He's thinking about the future while his co-workers are not.

The_General

@Sunstorm said:

@The General said:
I can't comment on the [extra_]special_codes, but an advantage of denormalising the address details is that each distinct address only need be stored once. Hopefully, this way, it's much harder to end up with e.g. "22 Foo Ct, Barville", "22 Foo Court, Barville", "22 Foo Court, Bazville" etc, which really all refer to the same physical address (the last one being a typo - see below).

I can't see how normalizing the address details helps remove similar addresses that refer to the same place, unless you somehow put a unique index on a soundex or something. What it would do is remove the need to store a long address string several times, replacing the string with an integer numbers, which would pay off in space if the address stored is longer than 4 ASCII letters and used in more than one place, which is likely the case. On the other hand, it'll also make it considerably more complicated for when someone changes address, because then you'll have to create a new row and change the index. You can't just edit the address as you would before, unless you want all the other users at that address to be magically moved as well.

I didn't mean that you'd get the de-dup functionality for free by denormalising - just that it's easier to do de-dup-ing, close-matching etc (whatever algorithm you choose) if you've got all the address details on one table.

With regard to editing, a direct edit option is useful for correcting wrongly-entered addresses, where "magically moving" all users of that address is exactly what you want, but of course it's also a good idea to provide the option of selectively transferring links from one address to another.<FONT size=4>

</FONT>

If one individual customer is changing address, then yes, you do have to create a new address row in the background and amend the address ID, but it's not that hard, and when you've built the interface for this once you can re-use it. Also, when the new address is entered, you can use the same de-dup checks on it.

The system I work on has all of these functions. The close-match checking is pretty simplistic, but then it doesn't force you to choose one of the close matches, it just pops up a list and suggests you check.

PJH

@lilakuh said:

Not quite. Your second solution proposes this layout for the users table:
[<name fields> <foreign key for address 1> <foreign key for address 2> <foreign key for address 3>]

That would be my first solution you have there.

Note that you have several "foreign key for address" fields. My solution doesn't have any foreign keys in the users table, but rather the address table has a foreign key back to users. I'm not sure what you meant by "Unless, of course, you have a third table of" - what would your third table contain? Are you referring to a junction table?

My third table is identical to what you proposed - an address per record with a foreign key to the user, with no address foreign keys in the user table.

asuffield

@Soviut said:

Higher normalized forms compartmentalizes data, reduces redundancy and overlap, and provides a model that is scalable, diagram-able, and easy to understand.

I only need a "synergy" and I'll have a bingo.

Database normal forms are nothing more than well-known solutions to common problems. They are not some magic recipe for database design. Like every solution in software, they should be applied only to their respective problems; applying them when the problem does not exist merely complicated things without reason.

He's thinking about the future while his co-workers are not.

He's starting with a solution and looking for a problem to justify it. That does qualify as "thinking", but it's salesman thinking, not engineering. It is no surprise that his co-workers, who probably have some actual engineering experience, rejected it out of hand as being unjustified.