Meaningful Keys

snoofle

We need to add a new key column to contain the date something happened to several of our db tables. Four out of five of us decided that the column should contain a date (type) value. One person said (exact quote) "I am not in favor of meaningful keys. Instead, we should use a sequence and a mapping table. For each new entry, put the date in the mapping table and put the newly generated sequence number in the data table, and then do joins when the information is needed."

Mind you, the tables we are dealing with are under 20 columns so it's not like adding one more column is a big deal. Also, whether the column contains a date or number, it's still one more column. However, for a human looking at the data, which would you rather see in a column titled 'DateXyzHappened': 20110519 or 123456789?

Lorne Kates

That depends. By key column-- do you mean that all data has to be unique?

If no, then you're right. Just add another column for the date. (Assuming this isn't a case where there are 50 Date[whatever]Happened, and that it should be normalized off to a history table).

But if it is meant to be a unique key (though it doesn't sound that way), then yes, the one guy is right. Add an int identity column, because there is no guarentee that you'll have a unique datetime in each row.

Master_Chief

@snoofle said:

We need to add a new key column to contain the date something happened to several of our db tables. Four out of five of us decided that the column should contain a date (type) value. One person said (exact quote) "I am not in favor of meaningful keys. Instead, we should use a sequence and a mapping table. For each new entry, put the date in the mapping table and put the newly generated sequence number in the data table, and then do joins when the information is needed."

After which you told him to head back to marketing, right?

Jaime

There's plenty of benefits to moving the dates out to a different table. First, if you use a native date type, it's most likely implementated as a floating point and using a floating point for a key could blow up in your face. Second, with a mapping table you can add a lot of attributes to each day like holiday/working day, fiscal year, quarter, or month and do some very useful aggregations.

On the other hand, it may be extra work for no benefit. However, I certainly wouldn't choose to use a date in the table simply because it would make the table easier to read.

PJH

@snoofle said:

@Them said:
I am not in favor of meaningful keys.

Sounds like he's in favour of id's purely for the sake of having id's. "Why have only one table, when you can have two?"

It's akin to having a two column table of ISO3166 country codes with an autoincrement key field, when simply putting the ISO3166 in the foreign table to begin with would save a join and some headaches.

snoofle

@Lorne Kates said:

That depends. By key column-- do you mean that all data has to be unique?
If no, then you're right. Just add another column for the date. (Assuming this isn't a case where there are 50 Date[whatever]Happened, and that it should be normalized off to a history table).
But if it is meant to be a unique key (though it doesn't sound that way), then yes, the one guy is right. Add an int identity column, because there is no guarentee that you'll have a unique datetime in each row.

It's a non-unique key, and there would only ever be one such key (column),

dtech

@Jaime said:

Second, with a mapping table you can add a lot of attributes to each day like holiday/working day, fiscal year, quarter, or month and do some very useful aggregations.

This can be done without a mapping table. Furthermore your approach is not normalized since duplicate information is stored. The information about dates is unique to each date and as such the date should be the key, not some stupid key from a non-related table.

Example:

id	date	isHoliday	FiscalYear
840	2010-09-04	0	Q3-2010
870	2010-09-04	0	Q3-2010
886	2010-09-04	0	Q3-2010
1050	2010-12-31	1	Q4-2010

Proper way:

date	isHoliday	FiscalYear
2010-09-04	0	Q3-2010
2010-12-31	1	Q4-2010

Jaime

@dtech said:

@Jaime said:
Second, with a mapping table you can add a lot of attributes to each day like holiday/working day, fiscal year, quarter, or month and do some very useful aggregations.

This can be done without a mapping table. Furthermore your approach is not normalized since duplicate information is stored. The information about dates is unique to each date and as such the date should be the key, not some stupid key from a non-related table.

Example:

id date isHoliday FiscalYear
840 2010-09-04 0 Q3-2010

870 2010-09-04 0 Q3-2010

886 2010-09-04 0 Q3-2010

1050 2010-12-31 1 Q4-2010

Proper way:

date isHoliday FiscalYear
2010-09-04 0 Q3-2010

2010-12-31 1 Q4-2010

Depending on database implementation, your model may be using a float as a foreign key. Both may be approporiate under different circumstances. Also, the key isn't from the other table, it is defined in the date table and is referenced from the other table, so it is as normalized as any table that uses surrogate keys.

boomzilla

@snoofle said:

We need to add a new key column to contain the date something happened to several of our db tables.

I'm having trouble understanding what sort of "key" this is. Part of the primary key? A foreign key? Or just some information (maybe that you want to index for later access)? Due to underlying data types, I suppose there could be a difference if you throw an index on there (like what Jaime was saying). That all seems super dependent on the implementation and the amount of data, though.

I guess that storing the same date in multiple rows violates some ultra-pure normalization fetish, but it sure seems like extra work for no real benefit, and maybe even a net negative benefit, depending on how you then have to use it, etc.

lpope187

The one advantage of a integer keyed table is that you don't need to deal with magic dates for certain conditions. Example, what if you want to track whether a date is "Not Applicable" versus "Hasn't Occurred Yet". With a integer keyed table, you add a type column with (Normal, Not Applicable, and Hasn't Occurred Yet) and then leave the date column null for anything not of type Normal. With a date keyed table, you'd have to resort to magic dates like 9999-12-31 to designate Hasn't Occured Yet and then a null foreign key to designate Not Applicable or vice versa. The integer keyed approach is commonly seen in data warehouses and can be overkill for most scenarios. It all really depends on what level of tracking you need.

fatbull

@Jaime said:

if you use a native date type, it's most likely implementated as a floating point

Wait. What?

blakeyrat

@Jaime said:

First, if you use a native date type, it's most likely implementated as a floating point and using a floating point for a key could blow up in your face.

... what the fuck databases are you working with?

@Jaime said:

Second, with a mapping table you can add a lot of attributes to each day like holiday/working day, fiscal year, quarter, or month and do some very useful aggregations.

If you need that data, you can have SQL calculate it for you and shove it into a View. Which would be a much, much cleaner way of implementing it. IMO, at least... I know everybody has their own favorite DB techniques.

@Jaime said:

However, I certainly wouldn't choose to use a date in the table simply because it would make the table easier to read.

This I agree with.

EDIT: Oops, kind of redundant reply. Sorry.

belgariontheking

On our project, we have a big huge (well I thought it was huge) table full of every date from like 1000 to .. 2020?

SELECT MIN(calendar_date) , MAX(calendar_date), COUNT(*) FROM motherfucking_calendar_table
01-JAN-1000 12.00.00 AM 17-SEP-2020 12.00.00 AM 13778

Those among you with a calculator or some brains will notice that in 1020 years, there are 372555 (or so) days, not that 13778.

Anyway, the point of this table is to quickly tell a bunch of different info about a certain day, like what fiscal week/month it belongs to, and what day of the year it is.

Personally I think it's overkill but that's the reason it's used in our project.

Xyro

@belgariontheking said:

Anyway, the point of this table is to quickly tell a bunch of different info about a certain day, like what fiscal week/month it belongs to, and what day of the year it is.

?????

Why can't this be computed on demand with a function or sproc or view or something??

belgariontheking

@Xyro said:

@belgariontheking said:
Anyway, the point of this table is to quickly tell a bunch of different info about a certain day, like what fiscal week/month it belongs to, and what day of the year it is.
?????
Why can't this be computed on demand with a function or sproc or view or something??

Your confusion echoes mine. I don't like it any more than you do.

Recent developers have taken to putting the date right on tables in addition to creating a foreign key to the calendar table, so at least we don't have to acknowledge the calendar table most of the time. In a couple of tables, that date was part of the partition strategy so it was significantly quicker to filter on that date than to join to the calendar table.

Jaime

@blakeyrat said:

@Jaime said:
Second, with a mapping table you can add a lot of attributes to each day like holiday/working day, fiscal year, quarter, or month and do some very useful aggregations.
If you need that data, you can have SQL calculate it for you and shove it into a View. Which would be a much, much cleaner way of implementing it. IMO, at least... I know everybody has their own favorite DB techniques.

You can calculate holidays? Easter is possible, but too messy to even think about attempting. Things like Christmas and the Fourth of July are always the same calendar date, but the business holidays associated with them change, sometimes on a whim.

locallunatic

@Jaime said:

You can calculate holidays? Easter is possible, but too messy to even think about attempting. Things like Christmas and the Fourth of July are always the same calendar date, but the business holidays associated with them change, sometimes on a whim.

Of course you can, and with business holidays you just need to check if the holiday lands on Saturday (so it's Friday off) or on Sunday (so it's Monday off).

boomzilla

@blakeyrat said:

@Jaime said:
Second, with a mapping table you can add a lot of attributes to each day like holiday/working day, fiscal year, quarter, or month and do some very useful aggregations.

If you need that data, you can have SQL calculate it for you and shove it into a View. Which would be a much, much cleaner way of implementing it. IMO, at least... I know everybody has their own favorite DB techniques.

As many have mentioned, it also depends on the types of occasions you're talking about. Also, not all parts of the enterprise might observe the same days, with no obvious logic behind it (i.e., for reasons external to the app).

Serpentes

@locallunatic said:

@Jaime said:
You can calculate holidays? Easter is possible, but too messy to even think about attempting. Things like Christmas and the Fourth of July are always the same calendar date, but the business holidays associated with them change, sometimes on a whim.

Of course you can, and with business holidays you just need to check if the holiday lands on Saturday (so it's Friday off) or on Sunday (so it's Monday off).

Calculating Easter programmatically is insane. If you need to track Easter, you really just are going to have to store a list of Easter dates in a table somewhere. The simplified version of Easter date selection ("first Sunday after first full moon after March 21") is inadequate, because the real moon has nothing to do with it. Instead, there are a series of tables and corrections that determine the placement of easter. The process, called computus, is staggeringly complex, especially if you need to compute Easter for arbitrary years that may be some time in the past or future, as there are two cyclical corrections (the "solar equation" and "lunar equation") that also factor into the cycle and can and do cancel out. Furthermore, there are a number of "special corrections" to fix problems in the tables; the most recent of these moved the date of Easter for the Catholic Church in 1690. In all, the process of determining the date of Easter IS a repeating cycle ... one that repeats every 5,700,000 years. Although well before 5.7My has passed, changes to the length of the year, lunar month, and day will break the current system. Also, receiving a request to write holiday-tracking code robust over several million years would be an immediate sign that new employment is necessary.

Xyro

@Serpentes said:

Easter

:( This sort of nonsense is what keeps us purchasing expensive SAP contracts. :(

boomzilla

@Serpentes said:

Calculating Easter programmatically is insane.

Yes, also, if you're a school or something, your spring break (Easter Vacation in my day) may or may not be based at all on Easter. Not to mention that the Eastern Orthodox Church is different than everybody else.

locallunatic

@Serpentes said:

@locallunatic said:
@Jaime said:
You can calculate holidays? Easter is possible, but too messy to even think about attempting. Things like Christmas and the Fourth of July are always the same calendar date, but the business holidays associated with them change, sometimes on a whim.

Of course you can, and with business holidays you just need to check if the holiday lands on Saturday (so it's Friday off) or on Sunday (so it's Monday off).

Calculating Easter programmatically is insane. If you need to track Easter, you really just are going to have to store a list of Easter dates in a table somewhere. The simplified version of Easter date selection ("first Sunday after first full moon after March 21") is inadequate, because the real moon has nothing to do with it. Instead, there are a series of tables and corrections that determine the placement of easter. The process, called computus, is staggeringly complex, especially if you need to compute Easter for arbitrary years that may be some time in the past or future, as there are two cyclical corrections (the "solar equation" and "lunar equation") that also factor into the cycle and can and do cancel out. Furthermore, there are a number of "special corrections" to fix problems in the tables; the most recent of these moved the date of Easter for the Catholic Church in 1690. In all, the process of determining the date of Easter IS a repeating cycle ... one that repeats every 5,700,000 years. Although well before 5.7My has passed, changes to the length of the year, lunar month, and day will break the current system. Also, receiving a request to write holiday-tracking code robust over several million years would be an immediate sign that new employment is necessary.

Hmm, didn't realize that Easter was so complex. I guess I'm lucky on which holidays I need to track (as I just need not a business day cause of holiday, which means I don't need to care about that one).

PJH

@Serpentes said:

Calculating Easter programmatically is insane.

No it's not. It's not that simple, but it can (and has) been done: http://en.wikipedia.org/wiki/Computus#Algorithms

boomzilla

Not exactly a "holiday," but the day federal income taxes were due (personal, at least, I think corporate stuff is often handled differently) was pushed out due to a recently (a few years ago) declared holiday in DC.

Serpentes

@PJH said:

@Serpentes said:

Calculating Easter programmatically is insane.
No it's not. It's not that simple, but it can (and has) been done: http://en.wikipedia.org/wiki/Computus#Algorithms

There are means of computing Easter, to be sure. And yes, this is very non-optimized, because you'd really store the substeps as variables like Wikipedia shows. That hides the weirdness of the formula, though. In the end, Easter algorithms wind up being akin to this (given year Y):

month = floor ((((19(Y mod 19) + (floor (Y / 100)) − (floor ((floor (Y / 100)) / 4)) − (floor (((floor (Y / 100)) − (floor (((floor (Y / 100)) + 8) / 25)) + 1) / 3)) + 15) mod 30) + ((32 + 2((floor (Y / 100)) mod 4) + 2(floor ((Y mod 100) / 4)) − ((19(Y mod 19) + (floor (Y / 100)) − (floor ((floor (Y / 100)) / 4)) − (floor (((floor (Y / 100)) − (floor (((floor (Y / 100)) + 8) / 25)) + 1) / 3)) + 15) mod 30) − ((Y mod 100) mod 4)) mod 7) − 7(floor (((Y mod 19) + 11((19(Y mod 19) + (floor (Y / 100)) − (floor ((floor (Y / 100)) / 4)) − (floor (((floor (Y / 100)) − (floor (((floor (Y / 100)) + 8) / 25)) + 1) / 3)) + 15) mod 30) + 22((32 + 2((floor (Y / 100)) mod 4) + 2(floor ((Y mod 100) / 4)) − ((19(Y mod 19) + (floor (Y / 100)) − (floor ((floor (Y / 100)) / 4)) − (floor (((floor (Y / 100)) − (floor (((floor (Y / 100)) + 8) / 25)) + 1) / 3)) + 15) mod 30) − ((Y mod 100) mod 4)) mod 7)) / 451)) + 114) / 31)

day = ((((19(Y mod 19) + (floor (Y / 100)) − (floor ((floor (Y / 100)) / 4)) − (floor (((floor (Y / 100)) − (floor (((floor (Y / 100)) + 8) / 25)) + 1) / 3)) + 15) mod 30) + ((32 + 2((floor (Y / 100)) mod 4) + 2(floor ((Y mod 100) / 4)) − ((19(Y mod 19) + (floor (Y / 100)) − (floor ((floor (Y / 100)) / 4)) − (floor (((floor (Y / 100)) − (floor (((floor (Y / 100)) + 8) / 25)) + 1) / 3)) + 15) mod 30) − ((Y mod 100) mod 4)) mod 7) − 7(floor (((Y mod 19) + 11((19(Y mod 19) + (floor (Y / 100)) − (floor ((floor (Y / 100)) / 4)) − (floor (((floor (Y / 100)) − (floor (((floor (Y / 100)) + 8) / 25)) + 1) / 3)) + 15) mod 30) + 22((32 + 2((floor (Y / 100)) mod 4) + 2(floor ((Y mod 100) / 4)) − ((19(Y mod 19) + (floor (Y / 100)) − (floor ((floor (Y / 100)) / 4)) − (floor (((floor (Y / 100)) − (floor (((floor (Y / 100)) + 8) / 25)) + 1) / 3)) + 15) mod 30) − ((Y mod 100) mod 4)) mod 7)) / 451)) + 114) mod 31) + 1

I hope you comment appropriately! Also, quite a few of the published Easter algorithms are wrong, or only work for a set window of 100-200 years or so. I'm not sure if that one above has an "expiration date", although I think it's actually robust, excepting the dates changed by the Catholic Church manually. Of course, Easter isn't the only moveable holiday, especially if you're working worldwide. Diwali is based on the actual lunar calendar. Eid al-Fitr is linked to the Islamic calendar, which is also a lunar calendar. However, the observed date of Eid al-Fitr sometimes varies by a day between countries because of differences in lunar phase visibility; also, periodically Eid al-Fitr occurs twice within a single Gregorian calendar year.

If you're doing something where these events matter to you, it's probably in everyone's best interests to just put them in a table.

PJH

@Serpentes said:

However, the observed date of Eid al-Fitr sometimes varies by a day between countries because of differences in lunar phase visibility

I was thinking of mentioning Ramadan at some point, and how the relevant dates are actually decided on the date at the whim (well not quite) of one person . Something I found slightly weird one year when I was out in the middle east when Eid was supposed to be one day, then it was delayed (I didn't realise at the time, since all the previous times I'd had cause to be out there during Ramadan, Eid was on the expected day.)

Serpentes

@PJH said:

@Serpentes said:
However, the observed date of Eid al-Fitr sometimes varies by a day between countries because of differences in lunar phase visibility
I was thinking of mentioning Ramadan at some point, and how the relevant dates are actually decided on the date at the whim (well not quite) of one person . Something I found slightly weird one year when I was out in the middle east when Eid was supposed to be one day, then it was delayed (I didn't realise at the time, since all the previous times I'd had cause to be out there during Ramadan, Eid was on the expected day.)

Yep. In most Islamic countries, the months are not a fixed length. They are either 29 or 30 days long, but that length is not determined beforehand. If the thin crescent moon cannot be seen after the sunset on the 29th day of the month, then the month doesn't end for an extra day. Sometimes this is caused by astronomy, but bright western skies or simply weather can make the month longer; it is based on literal observations, approved by the governing Islamic authority (unless they ignore the real observations and just make a declaratory statement). Eid al-Fitr is the first day of the month after Ramadan. Thus, if Ramadan runs long, the holiday gets bumped back a day. Ramadan is subject to that sort of adjustment less often than random other months, due to its importance, but it still happens sometimes.

Other countries, including Egypt and Indonesia, have established rules sets for their calendars, although they haven't been entirely strict about sticking to their own rules.

Some Muslim communities outside the Islamic world use a fixed Islamic calendar, called the Fatimid calendar, that avoids all these problems, but it's not recognized anywhere that has Islam as a state religion. Microsoft products, like Office, convert between Gregorian and Islamic calendar dates using a custom algorithm that winds up being the same as the Fatimid calendar. In the US, the media tends to pretend that things in the Islamic world happen according to the Fatimid calendar, which would make bookkeeping nice, but is far from a safe assumption.

Dorus

Using non-meaningful keys, like ID fields is always wrong. A database contains data about reality and a ID field never matches with something real.

Beside, ID fields hurt performance, to get your data you need to do a join, this is always more expensive then have the date stored in the right table directly. Not to mention that dates (especially with ms precision) are unique anyway. Not that date fields should have a unique constraint, there is no reason you cant have 2 different events on the same date.

However, and i quote you "We need to add a new key column to contain the date something happened to several of our db tables." This indicate you need to add the same date to several tables. That is a normalization problem by default. If that's the case you need to identify what data connect these tables, and make a new table that use the connecting data as PK, with a new date field that's not part of the PK where the new data is added.

This is what's meant with the 3NF "Every non-prime attribute is non-transitively dependent on every candidate key in the table." Every sane database design should comply to that.

Zemm

@locallunatic said:

Of course you can, and with business holidays you just need to check if the holiday lands on Saturday (so it's Friday off) or on Sunday (so it's Monday off).

Around here (Australia for those who can't look left) if the holiday falls on a weekend then it is moved to the Monday. If Christmas is on a Saturday then both the Monday and Tuesday are holidays (due to the Boxing Day holiday).

However if Anzac day falls on a Saturday there's no substitute holiday (there is if it falls on a Sunday).

And this year Anzac day fell on Easter Monday (both public holidays) so that's another collision! (I made a thread in GD about having five consecutive days off)

Then there are local public holidays (for example local show holidays which are different in local council areas, and there's state-based holidays) so you'll probably need to determine a fairly fine location. Where I work and live are different areas for show holidays (I would get the Ekka holiday at home and the Gold Coast Show holiday at work).

The only way to allocate holidays is from a pre-determined list/table, not programmatically. If you had an algorithm that could accurately predict holidays then I'm sure you could make some slight modifications and get it to predict the stock market. :-)

Jaime

@Dorus said:

Beside, ID fields hurt performance, to get your data you need to do a join, this is always more expensive then have the date stored in the right table directly.

Speaking of normalization and performance in the same breath is a bit weird. Normalization has always been about data integrity. Good performance often requires denormalization.

@Dorus said:

Using non-meaningful keys, like ID fields is always wrong. A database contains data about reality and a ID field never matches with something real.

How do you propose a database handle situations where there is no natural key, like a list of people? I've found that only only statements that are always wrong are statements that contain the word "always". Unfortunately, this rule nullifies itself.

@Serpentes said:

If you're doing something where these events matter to you, it's probably in everyone's best interests to just put them in a table.

But before you can put them in a table you still have to figure out the dates somehow - so you're not solving that problem, you're just moving it to a different domain, at the additional cost of now also having to maintain the table to make sure your precomputed holiday dates don't run out unexpectedly.

Dorus

@Jaime said:

Speaking of normalization and performance in the same breath is a bit weird. Normalization has always been about data integrity. Good performance often requires denormalization.

I have yet to see a example of this. Good performance always require stuff like good indexes, it hardly ever require denormalization. Poorly normalized databases either suffer data integrity problems, or have so much checking going around that it hurt performance so much that normalization wouldn't had made a difference.

Imagine you store your employees living address both in the employee table and the pay-check table, to spare inner join. Now you need to add a check that these 2 address columns always update at the same time, or you run into trouble when a employee moves and only 1 field is update (thus your data is bogus).

@Jaime said:

How do you propose a database handle situations where there is no natural key, like a list of people? I've found that only only statements that are always wrong are statements that contain the word "always". Unfortunately, this rule nullifies itself.

How do know what person belongs to what row if you have 5 times mister John Doe in your table, and 1 of them stands in front of you and ask you to change his name to Karl Doe? You are probably going to ask his birthday, but then still 2 John Doe's remain, and there are no guarantees anything else you ask him will make him unique in your database. Not to mention how much fun it is to make a web form for this. "If name is not unique, ask for birthday, then ask for ..., then ask for ...".

First and last name can hardly be considered as a candidate key, since multiple people can have the same name. Some organizations rely on birthday, but even that fails at times. In the end you will need to add something to a user that also exists in the real world. A good PK would be a username, or, for example my school resort's to student numbers. A student number is, other then a identity field, something that exists in the real world. My student number is printed on my student ID, all letters the school send me, and also a valid username when i log into the school website.

Scarlet_Manuka

@Zemm said:

And this year Anzac day fell on Easter Monday (both public holidays) so that's another collision! (I made a thread in GD about having five consecutive days off)

Yes, I was going to mention this year's holidays as an example of why you can't just compute them programatically - for those who didn't read Zemm's post in GD (I haven't), the Anzac Day holiday was kept on Monday and the Easter Monday holiday was on Tuesday. Would've been even more fun if Anzac Day had been on Easter Sunday. :)

@Anonymouse said:

@Serpentes said:
If you're doing something where these events matter to you, it's probably in everyone's best interests to just put them in a table.

But before you can put them in a table you still have to figure out the dates somehow - so you're not solving that problem, you're just moving it to a different domain, at the additional cost of now also having to maintain the table to make sure your precomputed holiday dates don't run out unexpectedly.

The difference is that the problem may be solvable in the other domain. For instance, you can wait until the year's public holiday dates have been gazetted by the appropriate branch of government, and then look them up and put them into your table. Good luck trying to do that algorithmically.

@Dorus said:

@Jaime said:
@Dorus said:
Using non-meaningful keys, like ID fields is always wrong. A database contains
data about reality and a ID field never matches with something real.
How do you propose a database handle situations where there is no natural key, like a list of people? I've found that only only statements that are always wrong are statements that contain the word "always". Unfortunately, this rule nullifies itself.
How do know what person belongs to what row if you have 5 times mister John Doe in your table, and 1 of them stands in front of you and ask you to change his name to Karl Doe? You are probably going to ask his birthday, but then still 2 John Doe's remain, and there are no guarantees anything else you ask him will make him unique in your database. Not to mention how much fun it is to make a web form for this. "If name is not unique, ask for birthday, then ask for ..., then ask for ...".

If he's standing in front of you, you can look to see what differences there are between the records in your database, and then ask questions based on that. Harder to do on a webform, I admit. But you seem to be arguing against your own point here - this is why an ID field is good, because it guarantees that there will be something unique about that user's record, and furthermore you know which field will be unique. So now you just have to ask him for his ID, and you know which record to update.

@Dorus said:

First and last name can hardly be considered as a candidate key, since multiple people can have the same name. Some organizations rely on birthday, but even that fails at times. In the end you will need to add something to a user that also exists in the real world. A good PK would be a username, or, for example my school resort's to student numbers. A student number is, other then a identity field, something that exists in the real world. My student number is printed on my student ID, all letters the school send me, and also a valid username when i log into the school website.

Are you deliberately trying to undermine your own argument? Do you think that your student number represented something in the real world before your school allocated it to you? Your student number is a prime example of the kind of ID field you are saying is not necessary and "always wrong" to use. Once your student number has been assigned to you, of course it can be used on many things. That is the point of an ID field, so that anything that needs to refer to a specific person can use that ID and be guaranteed that it always refers to only the intended person.

boomzilla

@Dorus said:

A good PK would be a username, or, for example my school resort's to student numbers. A student number is, other then a identity field, something that exists in the real world. My student number is printed on my student ID, all letters the school send me, and also a valid username when i log into the school website.

OK, so we can't create our own id fields unless we print them on cards?

Dorus

@Scarlet Manuka said:

If he's standing in front of you, you can look to see what differences there are between the records in your database, and then ask questions based on that. Harder to do on a webform, I admit. But you seem to be arguing against your own point here - this is why an ID field is good, because it guarantees that there will be something unique about that user's record, and furthermore you know which field will be unique. So now you just have to ask him for his ID, and you know which record to update

Good, so we agree that rows in a database should be unique based on something that's unique in the real world.

Now i chalange you to ask mister datetime what's his id. If you can't, then it's probably not a good idea to add a ID to your event's date.

@Scarlet Manuka said:

If he's standing in front of you, you can look to see what differences there are between the records in your database, and then ask questions based on that.

Yes, but there is no gurantee there's even a diference between those records since you did not put a constraint on your database that enforces unique records other then the ID field. And how does mister John Doe knows his ID? It was generated for him in the background for him when he first ordered a marble from your company, to lazy to fill in his real name in the request form.

Or are you going to comunicate him his ID after he placed his first order? Seems like you are now building your business logic around your database instead of the other way around.

@boomzilla said:

@Dorus said:
A good PK would be a username, or, for example my school resort's to student numbers. A student number is, other then a identity field, something that exists in the real world. My student number is printed on my student ID, all letters the school send me, and also a valid username when i log into the school website.
OK, so we can't create our own id fields unless we print them on cards?

Well, if that's how you decide to identify your rows: Yes. Most websites use usernames to have a unique idenitfier, but we all know the fun with helpdesk support numbers, customer numbers, tracking numbers etc. All these are perfectly valid solutions, and very much printed somewhere.

Nelle

@Dorus said:

@Jaime said:
Speaking of normalization and performance in the same breath is a bit weird. Normalization has always been about data integrity. Good performance often requires denormalization.
I have yet to see a example of this. Good performance always require stuff like good indexes, it hardly ever require denormalization.

yes you are right, that is why data warehouses, olap cubes and such are so unpopular

boomzilla

@Dorus said:

Good, so we agree that rows in a database should be unique based on something that's unique in the real world.
@boomzilla said:
@Dorus said:
A good PK would be a username, or, for example my school resort's to
student numbers. A student number is, other then a identity field,
something that exists in the real world. My student number is printed on
my student ID, all letters the school send me, and also a valid
username when i log into the school website.

OK, so we can't create our own id fields unless we print them on
cards?

Well, if that's how you decide to identify your rows: Yes. Most websites use usernames to have a unique idenitfier, but we all know the fun with helpdesk support numbers, customer numbers, tracking numbers etc. All these are perfectly valid solutions, and very much printed somewhere.

So...if we enforce natural keys whenever possible, what happens when it changes? Suppose that like the federal government, against explicit statute language, you use people's social security numbers as your PK? It's possible for people to change them (e.g., after being a victim of identity theft). There are many other situations that are more domain specific. But either way, if that was your PK, then you have to update all references to it when it changes.

moz

@Dorus said:

Using non-meaningful keys, like ID fields is always wrong. A database contains data about reality and a ID field never matches with something real.

I'm not sure if you only mean this in the context of snoofle's situation, but the second sentence suggests not. ID fields come into their own if you need to store a references to a few rows in other tables (say, table Z) temporarily for some reason, as Z only needs a table name field and a number field, rather than needing fields for the primary key of each table it can store rows from.

PJH

@boomzilla said:

you use people's social security numbers as your PK?

As the federal government has no doubt found out by now, SSNs are not good candidates for primary keys, not because there's a one-many connection between people and SSNs that you imply, but because there's sometimes a many-many connection between them. The most well known, I think, being Hilda Schrader Whitcher's when lots of people decided to use hers.

boomzilla

@PJH said:

@boomzilla said:
you use people's social security numbers as your PK?
As the federal government has no doubt found out by now, SSNs are not good candidates for primary keys, not because there's a one-many connection between people and SSNs that you imply, but because there's sometimes a many-many connection between them. The most well known, I think, being Hilda Schrader Whitcher's when lots of people decided to use hers.

Yes, but it's still done a lot, which is why I used it as an example (also it fits the "printed on a card" nonsense). As I said, it's easy to think of other examples in many different domains, especially once you get away from tables that identify people. Obviously, we can create usernames for people who need to log into a system. This is not really any different than using a GUID or a number from a sequence, conceptually. And I don't think that just because someone has it printed on a card, it magically becomes "OK, because it's something from the real world."

Dorus

@Nelle said:

yes you are right, that is why data warehouses, olap cubes and such are so unpopular

Data for a date warehouse always comes from a primary system that does enforce constraints and data integrity. Usually a normalized database. The whole point of data warehouse is to skip those steps in favour of (read) performance. Even a data warehouse will have a certain degree of normalization. Most of the performance gains come from ignoring business rules and constraints because those are already enforced by the first layer

Even then, this is not a very common as you can also get good report data from a normalized database, a view and a index will dramaticly speed this up without losing normalization. (It does slow down write actions, just like denormalization would, but at least you won't need any insane data integrity checking).

@boomzilla said:

So...if we enforce natural keys whenever possible, what happens when it changes

Last time i checked there was no reason for a PK to be unable to change. It only had to be unique.

@boomzilla said:

There are many other situations that are more domain specific. But either way, if that was your PK, then you have to update all references to it when it changes.

Because "ON CASCADE UPDATE" is sooo difficult to use. You are talking about a FK btw, not PK.

@__moz said:

@Dorus said:
Using non-meaningful keys, like ID fields is always wrong. A database contains data about reality and a ID field never matches with something real.

I'm not sure if you only mean this in the context of snoofle's situation, but the second sentence suggests not. ID fields come into their own if you need to store a references to a few rows in other tables (say, table Z) temporarily for some reason, as Z only needs a table name field and a number field, rather than needing fields for the primary key of each table it can store rows from.

I'm advocating that reference fields should store "meaningful data".

Beside, i'm not exactly sure what you try to say. Table Z needs a FK that refers to a table, certainly not a column with a table name (WTF). Also, what's wrong with needing the fields of the primary key of the referred tables? Unless your database designer is notepad, most of them let you drag a arrow from one table to the other, and add the required columns for you. Not to mention the data in those PK columns contain data that directly refers to data in table Z. If you store a ID instead of the PK, you will need to join table Z and the other's every time you use it. If you store the PK instead, this can be avoided half the time.

@PJH said:

@boomzilla said:
you use people's social security numbers as your PK?
As the federal government has no doubt found out by now, SSNs are not good candidates for primary keys, not because there's a one-many connection between people and SSNs that you imply, but because there's sometimes a many-many connection between them. The most well known, I think, being Hilda Schrader Whitcher's when lots of people decided to use hers.

If storing people that use fake SSNs in your database has a high priority, using SSN as a PK might not fit. However, storing many-many relations is hardly a issue in databases, you just need a wider PK.

boomzilla

@Dorus said:

@boomzilla said:
But either way, if that was your PK, then you
have to update all references to it when it changes.

Because "ON CASCADE UPDATE" is sooo difficult to use. You are talking about a FK btw, not PK.

Oh, yeah, let's inject performance issues just to keep to a notion of natural key purity. Yes, I was talking about having to update all of the FK references to your PK. So, in fact I was talking about both.

@Dorus said:

Using non-meaningful
keys, like ID fields is always wrong. A database contains data about
reality and a ID field never matches with something real.

Sorry, but this is still stupid. A database is a tool to be used by an application. If using arbitrary keys improves the development, maintenance, performance or function of the database, then forcing the use of a natural key seems silly. What if your natural key is composite? Now, so are your FKs. That certainly can have performance implications. Theoretical purity is nice, but it rarely survives the real world.

Dorus

@boomzilla said:

Oh, yeah, let's inject performance issues just to keep to a notion of natural key purity. Yes, I was talking about having to update all of the FK references to your PK. So, in fact I was talking about both.

So now you want fast write performance instead of read performance? Joining all those tables each time you need to do a read is a lot more work then updating 2-3 values with on cascade update when you change one.

@boomzilla said:

Sorry, but this is still stupid. A database is a tool to be used by an application. If using arbitrary keys improves the development, maintenance, performance or function of the database, then forcing the use of a natural key seems silly. What if your natural key is composite? Now, so are your FKs. That certainly can have performance implications. Theoretical purity is nice, but it rarely survives the real world.

Can you explain how this improves development or maintenance? Last time i checked normalization especialy helped development and maintenance. As for performance, the more data you put in 1 table, the less joins you need. Thuss more performance. This is especially nice if you remember normalization just allowed you to put more data in one single table.

boomzilla

@Dorus said:

@boomzilla said:
Oh, yeah, let's inject performance issues just to keep to a notion of natural key purity. Yes, I was talking about having to update all of the FK references to your PK. So, in fact I was talking about both.

So now you want fast write performance instead of read performance? Joining all those tables each time you need to do a read is a lot more work then updating 2-3 values with on cascade update when you change one.

I'm generally interested in some balance of write and read performance, depending on the circumstances. If all you need is just that simple reference value, then I guess your comment makes sense. But assuming you need anything else (which is probably pretty often--at least it is for me), you're going to have to do the join anyways. Also, what could / should be updating a single column in a single row can suddenly balloon into updating thousands or millions of other rows. Not to mention all of the locks that would create. I don't hate my users that much (yet).

@Dorus said:

@boomzilla said:
Sorry, but this is still stupid. A database is a tool to be used by an application. If using arbitrary keys improves the development, maintenance, performance or function of the database, then forcing the use of a natural key seems silly. What if your natural key is composite? Now, so are your FKs. That certainly can have performance implications. Theoretical purity is nice, but it rarely survives the real world.

Can you explain how this improves development or maintenance? Last time i checked normalization especialy helped development and maintenance. As for performance, the more data you put in 1 table, the less joins you need. Thuss more performance. This is especially nice if you remember normalization just allowed you to put more data in one single table.

You're all over the place here. I guess it's possible that you deal with situations where the reference key is all that you need. Then yeah, I guess I can understand how the performance would be better for you. But if you need that data, then you need that data. Sometimes you need further joins, too, in order to wade through the hierarchy of objects / tables / data / whatever. My point is that in this case, you've probably made the DB use a lot more data in its indices (due to composite keys vs a simple key), which also means it has to look at more information to do the join.

And an ON CASCADE UPDATE is not a magical device. If you don't see the performance implications of that, you're not paying attention.

Nelle

@Dorus said:

@Jaime said:
Speaking of normalization and performance in the same breath is a bit weird. Normalization has always been about data integrity. Good performance often requires denormalization.
I have yet to see a example of this. Good performance always require stuff like good indexes, it hardly ever require denormalization.

@Dorus said:

@Nelle said:
yes you are right, that is why data warehouses, olap cubes and such are so unpopular
Data for a date warehouse always comes from a primary system that does enforce constraints and data integrity. Usually a normalized database. The whole point of data warehouse is to skip those steps in favour of (read) performance. Even a data warehouse will have a certain degree of normalization. Most of the performance gains come from ignoring business rules and constraints because those are already enforced by the first layer
Even then, this is not a very common as you can also get good report data from a normalized database, a view and a index will dramaticly speed this up without losing normalization. (It does slow down write actions, just like denormalization would, but at least you won't need any insane data integrity checking).

i still do not agree with the notion that denormalization is "hardly ever required".

i've worked on projects where databases were in 3NF, all indexes were optimised for the executing queries, tables partitioned, servers top-notch, and still we had to resort to denormalisation in order to create reports for managers who want to see EVERY FIELD from EVERY database on ONE BLOODY SCREEN. (and preferably formatted as a spreadsheet, but NOT exported to Excel, because then it is not REAL TIME, although they would be happiest if we would connect the Excel to the database, which I know is possible by the way, but please don't tell THEM that)

and judging from some stories, this is not at all uncommon ...

blakeyrat

@Dorus said:

Can you explain how this improves development or maintenance? Last time i checked normalization especialy helped development and maintenance. As for performance, the more data you put in 1 table, the less joins you need. Thuss more performance. This is especially nice if you remember normalization just allowed you to put more data in one single table.

What is this magical database system you're using where joins affect performance? Joins only slow you down if:

1) Your PKs are ridiculously huge (first_name, last_name, DOB, user_name)

2) You're joining in tables that are irrelevant to the query (joining in "car_owned", then never referring to it in another clause)

Have you actually benchmarked any of this, or are you just echoing some bearded Unix geek who was an expert in OS/400 databases in 1986, and just assuming that's still relevant?

EDIT: this thread is ridockulous. Is it just full of trolls? I'm starting to think so.

blakeyrat

@Nelle said:

i still do not agree with the notion that denormalization is "hardly ever required".

i've worked on projects where databases were in 3NF, all indexes were optimised for the executing queries, tables partitioned, servers top-notch, and still we had to resort to denormalisation in order to create reports for managers who want to see EVERY FIELD from EVERY database on ONE BLOODY SCREEN. (and preferably formatted as a spreadsheet, but NOT exported to Excel, because then it is not REAL TIME, although they would be happiest if we would connect the Excel to the database, which I know is possible by the way, but please don't tell THEM that)

and judging from some stories, this is not at all uncommon ...

That still sounds like "hardly ever required" to me. How much of your database's traffic was dedicated to the denormalized view? Did it ever hit 1/10th of 1%?

It would be stupid to denormalize tables based on a requirement that, statistically speaking, never happens, and can easily be implemented with a view into the normalized tables.

Nelle

@blakeyrat said:

@Nelle said:
i still do not agree with the notion that denormalization is "hardly ever required".

i've worked on projects where databases were in 3NF, all indexes were optimised for the executing queries, tables partitioned, servers top-notch, and still we had to resort to denormalisation in order to create reports for managers who want to see EVERY FIELD from EVERY database on ONE BLOODY SCREEN. (and preferably formatted as a spreadsheet, but NOT exported to Excel, because then it is not REAL TIME, although they would be happiest if we would connect the Excel to the database, which I know is possible by the way, but please don't tell THEM that)

and judging from some stories, this is not at all uncommon ...

That still sounds like "hardly ever required" to me. How much of your database's traffic was dedicated to the denormalized view? Did it ever hit 1/10th of 1%?

It would be stupid to denormalize tables based on a requirement that, statistically speaking, never happens, and can easily be implemented with a view into the normalized tables.

in these cases you can not use database traffic as a measurement or when you do, you have to weight it with an "importance" factor.
for example, when busy worker bees hit the database 20000 times and everything goes snappy, and the queen bee hits the database only once and has to wait 5 seconds for its data, you'll still get to hear how "horribly slow" the application is and your own queen bee is going to order you to "fix" it.

Jaime

@Dorus said:

As for performance, the more data you put in 1 table, the less joins you need. Thuss more performance. This is especially nice if you remember normalization just allowed you to put more data in one single table.

Head asplode.

Normalization almost never creates more columns, it almost always pushes data out to other tables. You aren't talking about the benefits of normalization, you are talking about the benefits of natural keys. The debate between intelligent and surrogate keys is far from over and far from one-sided. I don't have a problem with intelligent keys when they are appropriate, but I do have a problem with your insistence that surrogate keys are never the answer. Your arguments also reek of premature optimization.

blakeyrat

@Nelle said:

in these cases you can not use database traffic as a measurement or when you do, you have to weight it with an "importance" factor.

for example, when a busy worker bees hit the database 20000 times and everything goes snappy, and the queen bee hits the database only once and has to wait 5 seconds for its data, you'll still get to hear how "horribly slow" the application is and your own queen bee is going to order you to "fix" it.

So grow a fucking spine and explain to your queen bee that she's an idiot.

Then stop trying to fix people problems with technology.