Primary Key/Clustered Index



  • Okay,



    I generally don't post much worthwhile stuff, but here's one. Something
    I find a bit irksome about SQL Server, In the table designer (yes, I
    use it, so what!) if you select a field to be your primary key, it
    creates a clustered index. My understanding of clustered indexes was
    that they should be used to reduce the number of records that need to
    be scanned, for example (to make it relavent to message boards), you
    have a table of posts. You might have a clustered index on the posts
    table using TopicID as its key, rather than the default PostID. What
    this means is that the rows will no longer be physically arranged by
    their primary key, but rather by their topic (i.e. when you list the
    posts in a topic, you narrow the number of rows you have to look at,
    using another index on a column like PostTime to sort the (much
    shorter) list)



    My question is: What is the reason for SQL Server using a clustered
    index on a primary key, and are there situations where this is a strong
    advantage? (Yes, it should be indexed, thats not in question...)


  • ♿ (Parody)

    First (and you seemed to expect this), you should get in the habbit of writing DDL instead of using Enterprise Manager. As the system grows, it becomes migration hell to move between environments (Dev, Test, UAT, Prod, etc) without DDL scripts and change management. Another advantage is that you have control and can explicitly see everything done.

    Second, fields do not exist in a relational database. Columns do. There is a fundamental difference between the two terms. Ditto for rows versus records. Usage of the proper ensures a strong mental model of the domain; think mixing up your derivatives and integrals in calculus.

    That said, a clustered index will physically order the rows as they are inserted/updated based on the column(s) indexed. This also happens to be why only one clustered index is permitted per table. SQL Server defaults to a clustered index on the primary key because PK's are unique (therefore ensuring an even spread of rows across pages with no "hotspots") and because PK's don't change very often (or shouldn't, at least).

    As you indicated, a TopicID is a better candidate for your clustered index because you'll often be pulling an entire Topic worth of posts. Since they're close together physically, it saves a lot of IO. But if you don't have such a requirement, you're usually best sticking your clustered index on the PK so you've got a good random distribution across pages.



  • @Alex Papadimoulis said:

    First (and you seemed to expect
    this), you should get in the habbit of writing DDL instead of using
    Enterprise Manager. As the system grows, it becomes migration hell to
    move between environments (Dev, Test, UAT, Prod, etc) without DDL
    scripts and change management. Another advantage is that you have
    control and can explicitly see everything done.


    It's a hard habit to break, when everyone around you is in the same habit, but it tends to really come back to haunt you when you need to build scripts for changes to tables... At prior employers they had a rule against using anything but DDL scripts, becuase the script that you used to change the schema had to be added to source control, and also needed to be combined later with other scripts when an update was sent to the data center.

    Second, fields do not exist in a relational database. Columns do. There is a fundamental difference between the two terms. Ditto for rows versus records. Usage of the proper ensures a strong mental model of the domain; think mixing up your derivatives and integrals in calculus.

    I don't know why, but I've always had difficulty with this terminology (I tend to get it mixed up frequently, More along the lines of "I know what I mean", hence the inconsistent use in my own post.)

    That said, a clustered index will physically order the rows as they are inserted/updated based on the column(s) indexed. This also happens to be why only one clustered index is permitted per table. SQL Server defaults to a clustered index on the primary key because PK's are unique (therefore ensuring an even spread of rows across pages with no "hotspots") and because PK's don't change very often (or shouldn't, at least).

    As you indicated, a TopicID is a better candidate for your clustered index because you'll often be pulling an entire Topic worth of posts. Since they're close together physically, it saves a lot of IO. But if you don't have such a requirement, you're usually best sticking your clustered index on the PK so you've got a good random distribution across pages.


    Yep, Makes sense.



  • Second, fields do not exist in a relational database. Columns
    do. There is a fundamental difference between the two terms. Ditto for
    rows versus records. Usage of the proper ensures a strong mental model
    of the domain; think mixing up your derivatives and integrals in
    calculus.




    I've never been formally schooled in databasing, nor have the kind of
    in-depth knowledge displayed here so often, but in my experience I've
    come to define Field as the intersection of a row and column in a
    database, and a Record as a Row With Data. Synonyms mostly, but Field
    and Record because it's a database table, not yer average table within
    whatever context or environment.



    The derivatives vs integrals comparisin means nothing to me. 8-|


  • ♿ (Parody)

    @dhromed said:

    I've never been formally schooled in databasing, nor have the kind of in-depth knowledge displayed here so often, but in my experience I've come to define Field as the intersection of a row and column in a database, and a Record as a Row With Data. Synonyms mostly, but Field and Record because it's a database table, not yer average table within whatever context or environment.

    The derivatives vs integrals comparisin means nothing to me. 8-|

    A lot of (non-database) people say "rows/records, fields/columns, tomato/tomatah". But there really are fundamental differences between the two. They're not synonymous at all. A non-calc example would be comparing a "type" (or "class") with an "object". Similar, but fundamentally different.

    Correction for your terminology:

    • The intersection of a Row and a Column is a Value.
    • A Field, on the other hand, is an application-specific part of a Record.
    • A Row cannot, by definition, exist without data. Tables need not have rows, but a row must contain data.

    And as an FYI, here's some more differences on Records vs Rows, Fields vs Columns. I wrote this in some other post here  ....

    • Records are definied within the application (think COBOL or C-structs reading from binary files). Rows are definied in the database.
    • Records do not have consistant structure. In file systems, flags within a file record determine what data follow All Rows within a table are the same.
    • Records are ordered. Rows have no order.
    • An empty file (where you store records) contains nothing: zero bytes on disk. An empty table still has columns, permissions,  constraints, etc.
    • Most importantly, records contain other records (or a pointer to another record in another file). Rows cannot contain other rows and *should not* contain pointers to other rows (many programmers make this mistake by putting pointers in their rows known as GUID or IDENTITY)
    • Fields are ordered, columns are not
    • Fields cannot be "null" (what combination of bytes would you use to represent an integer null?); they can only contain a "dummy value" known by the application (999999999).
    • Fields can contain other records. Columns are always scalar (many programmers also make the mistake of the "CSV" column)


  • @Alex Papadimoulis said:




  • Most importantly, records contain other records (or a pointer to another record in another file). Rows cannot contain other rows and *should not* contain pointers to other rows (many programmers make this mistake by putting pointers in their rows known as GUID or IDENTITY)


  • Ah, my head.. :)

    In reference to the above - if you're storing addresses in an Addresses table, with the primary key being an autoincrementing identity column, does using that identity column in the Orders table to locate the related address violate the above rule?  Or, what about a self referencing constraint?  You have an Offices table, for instance, with a "Central Office" column.  The CO column requires a value that exists within the Office table.

    I guess I'm asking what you consider a "pointer" in terms of the database.

    Thanks.



  • @bmschkerke said:

    @Alex Papadimoulis said:



  • Most importantly, records contain other records (or a pointer to another record in another file). Rows cannot contain other rows and *should not* contain pointers to other rows (many programmers make this mistake by putting pointers in their rows known as GUID or IDENTITY)


  • Ah, my head.. :)

    In reference to the above - if you're storing addresses in an Addresses table, with the primary key being an autoincrementing identity column, does using that identity column in the Orders table to locate the related address violate the above rule?  Or, what about a self referencing constraint?  You have an Offices table, for instance, with a "Central Office" column.  The CO column requires a value that exists within the Office table.

    I guess I'm asking what you consider a "pointer" in terms of the database.

    Thanks.



    I was going to ask the same thing.

    Also, I appreciate your discussion about the differences between "field" and "column," "record" and "row," etc.  Because I wanted to know what the hell you were going on about, I read up on the subject and feel that I have really deepened my knowledge of the relational model.  I saw things that brought back horrible, terrible memories from some of my more theoretical math classes in college.  So, again, thanks for spurring me on to learning!

    That said, I don't understand why you make such a big deal whenever someone slips up and says "record" when they meant "row."  With absolutely no intention of offending you, it seems a little odd.  Like if you were sitting next to someone, reviewing some poorly written code, and he said, "what do you supposed he means to be doing with this class here" as he very clearly pointed to an object.  Chances are, he's not a total idiot and he realizes the difference between an object and a class, so there's no need to scream, "That's an object, not a class!  They really are very different things!" every time he slips.

    Anyway, like I said, I sincerely am not trying to be rude or condescending.  It's just that I find your behaviour in this instance puzzling, that's all.  Also, please realize that I realize that it's proabably my own lack of experience that's causing your behavior to seem puzzling to me.  I'm willing to bet I'll understand when I've got more experience.



  • ♿ (Parody)

    @bmschkerke said:

    I guess I'm asking what you consider a "pointer" in terms of the database.

    A pointer is something that does not exist in the real-world problem domain and are not exposed to end-users of the system. 

    @bmschkerke said:

    In reference to the above - if you're storing addresses in an Addresses table, with the primary key being an autoincrementing identity column, does using that identity column in the Orders table to locate the related address violate the above rule? 

    Possibly. Why do you need an IDENTITY column? Addresses are standardized and will usually a DUN numbers. If not, it's really a problem for the business to solve. Generally, they solve this with (Cust_No + Address_Seq).


  • ♿ (Parody)

    @UncleMidriff said:

    I don't understand why you make such a big deal whenever someone slips up and says "record" when they meant "row" ... I find your behaviour in this instance puzzling, that's all. 

    Programming is one of the few professions where any one can grab a compiler and make a living writing code. Plumbers, barbers, and even insurance CSRs are required to maintain a licence to do their day-to-day job. But unlike a bad haircut or a leaky faucet, bad programming can cause massive financial loss and even death due; I've personally seen the former and have heard quite a few anecdotes about the latter.

    Over the years, I've come to realize that bad programming (at the level I mentioned) is rarely done by a lazy or incompetent programmer; it's done by programmers who have absolutely no idea what they are doing yet eagarly do it anyway. Worse still, they don't even realize that they don't know what they're doing.

    To my point, when I see people use the wrong terminology, it implies that they really don't understand the basics, yet are trying to do it anyway. Mathmeticians don't mix up sines and integrals; carpenters don't mix up biscuit joints and bevels; and database programmers shouldn't mix up fields and columns.

    And usually, I've find that when people don't know the basics, they are still doing it anyway as a professional. Myself included. I can only be thankful that I didn't cause too much financial loss (only an unmaintainable system here and there). I only wish someone "yelled" at me earlier!



  • @Alex Papadimoulis said:

    @UncleMidriff said:

    I don't
    understand why you make such a big deal whenever someone slips up and
    says "record" when they meant "row" ... I find your behaviour in this
    instance puzzling, that's all. 

    Programming is one of the few professions where any one can grab a compiler and make a living writing code. Plumbers, barbers, and even insurance CSRs are required to maintain a licence to do their day-to-day job. But unlike a bad haircut or a leaky faucet, bad programming can cause massive financial loss and even death due; I've personally seen the former and have heard quite a few anecdotes about the latter.

    Over the years, I've come to realize that bad programming (at the level I mentioned) is rarely done by a lazy or incompetent programmer; it's done by programmers who have absolutely no idea what they are doing yet eagarly do it anyway. Worse still, they don't even realize that they don't know what they're doing.

    To my point, when I see people use the wrong terminology, it implies that they really don't understand the basics, yet are trying to do it anyway. Mathmeticians don't mix up sines and integrals; carpenters don't mix up biscuit joints and bevels; and database programmers shouldn't mix up fields and columns.

    And usually, I've find that when people don't know the basics, they are still doing it anyway as a professional. Myself included. I can only be thankful that I didn't cause too much financial loss (only an unmaintainable system here and there). I only wish someone "yelled" at me earlier!



    I see..that makes sense.  Thanks for the explanation.

    About pointers:  If you have the time and the inclination, could you give an example of a few tables with rows that contain pointers and what you would do to make them not suck?  Even though you posted about it twice, I'm still having trouble understanding exactly what it is that you mean.  It almost sounds to me like you're saying that meaningless primary keys are a bad thing, but I figure if that's what you meant you would have come right out and said it already.  If that is what you mean, I don't know what to think, because I've heard convincing arguments for both sides of that argument.

  • ♿ (Parody)

    @UncleMidriff said:

    It almost sounds to me like you're saying that meaningless primary keys are a bad thing ... I've heard convincing arguments for both sides of that argument.

    Yes, that's pretty much another way of putting it. Non-domain keys are by definition meaningless and functionaly act as row pointers. Pointers (according to the information rule of RDMS) are non-relational.

    When I first started hacking around in a database, I had no idea what I was doing and ended up adding a sequence on every table (this was back in the day when DBs did not easily support AUTOID columns). It just seemed the logical way to do things.

    After studying relational theory and it's predecessors (CODASYL, IMS, etc), it became pretty clear how the relational model is a much better way of doing things. If you're interested in the topic, Elsmari and Nathe have a good textbook that goes in much detail about non-relational models.

    You'll find that those who advocate pointer-based data modeling (IDENTITY, GUID, etc) are not seasoned data modelers nor do they have an understanding of the history of data modeling. If they were, they would have run into the same problem with their methods that the creators of the RM had back in the late 60's.



  • @Alex Papadimoulis said:

    To my point, when I see people use the
    wrong terminology, it implies that they really don't understand the
    basics, yet are trying to do it anyway. Mathmeticians don't mix up
    sines and integrals; carpenters don't mix up biscuit joints and bevels;
    and database programmers shouldn't mix up fields and
    columns.




    I'm sure a lot of the confusion spurs form the programming side. Take,
    for example, ADO, which allows interaction between the application and
    a RDBMS. ADO refers to a set of rows as a recordset, and a set of
    values as a list of fields.



    I'm defending myself in this instance, I understand rows and columns,
    and actually understand some of the finer differences of rows vs
    records and columns vs fields, now thanks to you, but with APIs still
    refering to things using the wrong terminology, it's very easy to get
    tripped up.



    Your opinion on IDENTITY fields is very interesting, and I have often
    thought about just that, because it seems to be pretty much a habit of
    the IT industry in general: Have a row of information? tack an ID on it
    so we can get to it later. Its almost a mantra, though in some cases,
    some kind of ID number would be useful: Take the example of a table to
    customers. Using first and last name as the identifying piece of
    information will not work, becuase there's always more than one John
    Smith. Another piece of information is necessary, such as a phone
    number. In this case, it could be simpler to select by an arbitrary
    assigned value to look up a particular customer record. Also, in
    tables, such as something like order history, the foriegn key would
    require 3 pieces of information. It seems more efficient to store an
    IDENTITY value, or some other unique value such as a GUID or hash of
    the 3 pieces of identifying information (though a hash can, by its very
    nature, collide...), In this case, it seems better just to go with
    IDENTITY values. Am I wrong?



  • ♿ (Parody)

    @Mike R said:

    I'm sure a lot of the confusion spurs form the programming side.

    Absolutely. The relational model just doesn't "make sense" from a procedural programming perspective. This is why it took 20 years for them to evolve out of earlier models.

    @Mike R said:


    ADO refers to a set of rows as a recordset, and a set of values as a list of fields.

    I wondered that, too. I'm glad to see ADO.NET changed it, but perhaps one reason behind the original naming conventions was the generalized nature of ADO. Anyone could write a provoider from any source (from AD to to MAPI) and in many of those sources, tables are not the mechanism by which data is stored.

    Another possibility is that "field" sounds more "techie" than "column". Tables are, afterall, those things in Word and HTML.

    @Mike R said:


    IDENTITY ... seems to be pretty much a habit of the IT industry in general

    Some one else mentioned that to me a while a go, and I think it's an exposure thing. On big systems developed by experienced, data specialists, you'll never see an autonumber or equivilent as the system. These same experienced data folks don't write in general publications; they have their own set of magazines that us non-data specialists would never read. The people who do write in the general publications don't fall in the category of specialist; they seem to be more generalist. And you know the jack of all / master of none rule ...

    @Mike R said:

    It seems more efficient to store an IDENTITY value, or some other unique value such as a GUID or hash of the 3 pieces of identifying information (though a hash can, by its very nature, collide...), In this case, it seems better just to go with IDENTITY values. Am I wrong?

    Exactly; it "seems" better, and "seems" more efficient. This is why the early data models did it this way.

    The nearest "cousin" to the relational model is CODASYL (aka "network database"), with the most popular platform being IDMS. I think Oracle has/had a CODASYL db as well. Anyway, all records a CODASYL database have a DBKEY, which was a system generated hash based on their physical storage. Any record in the database could be retreived simply by knowing it's DBKEY; yoou didn't even need to know it's type (Customer, Product, etc). One of the huge advantages to this approach is that it was rediculously fast. The reason we don't have CODASYL databases anymore are because the of the inherent problems with pointer-based models: data integrity, pointer-chains, etc.

    When you use pointers (i.e. GUID, IDENTITY, HASH, etc) in a relational database, you are essentially playing CODASYL. Worse, you're not using a CODASYL database to do do it, so you have to try to reinvent all the features they stuck in CODASYL databases to make pointer-based models work. As the system grows, it gets horribly ugly and the data begins to corode. And when it comes time for HIPPA/SOX/etc audits, whooo-boy: you don't even want to know how many hundreds-of-dollars an hour and how many hours it takes for data consultants charge to fix your system and bring it to complaince.

    One of my favorite "massive financial loss" stories comes from a procedures system developed by a reputable consulting companies (for <$100k) that used the "ID everwhere" model (they obviously didn't bring in any data specialists). When the company failed an ISO audit, they had to pay over $500,000 to fix the data in this system. The data corrosion came from an incredibly tiny bug in the procedural code that, had they done the database right, would have only caused an error message to popup instead of data corruption.

    Whenever I see people advocating for pointer-models, I think back to Santayana: Those who cannot remember the past are condemned to repeat it.



  • @Alex Papadimoulis said:

    @Mike R said:

    I'm sure a lot of the confusion spurs form the programming side.

    Absolutely. The relational model just doesn't "make sense" from a procedural programming perspective. This is why it took 20 years for them to evolve out of earlier models.

    @Mike R said:


    ADO refers to a set of rows as a recordset, and a set of values as a list of fields.

    I wondered that, too. I'm glad to see ADO.NET changed it, but perhaps one reason behind the original naming conventions was the generalized nature of ADO. Anyone could write a provoider from any source (from AD to to MAPI) and in many of those sources, tables are not the mechanism by which data is stored.

    Another possibility is that "field" sounds more "techie" than "column". Tables are, afterall, those things in Word and HTML.

    @Mike R said:


    IDENTITY ... seems to be pretty much a habit of the IT industry in general

    Some one else mentioned that to me a while a go, and I think it's an exposure thing. On big systems developed by experienced, data specialists, you'll never see an autonumber or equivilent as the system. These same experienced data folks don't write in general publications; they have their own set of magazines that us non-data specialists would never read. The people who do write in the general publications don't fall in the category of specialist; they seem to be more generalist. And you know the jack of all / master of none rule ...

    @Mike R said:

    It seems more efficient to store an IDENTITY value, or some other unique value such as a GUID or hash of the 3 pieces of identifying information (though a hash can, by its very nature, collide...), In this case, it seems better just to go with IDENTITY values. Am I wrong?

    Exactly; it "seems" better, and "seems" more efficient. This is why the early data models did it this way.

    The nearest "cousin" to the relational model is CODASYL (aka "network database"), with the most popular platform being IDMS. I think Oracle has/had a CODASYL db as well. Anyway, all records a CODASYL database have a DBKEY, which was a system generated hash based on their physical storage. Any record in the database could be retreived simply by knowing it's DBKEY; yoou didn't even need to know it's type (Customer, Product, etc). One of the huge advantages to this approach is that it was rediculously fast. The reason we don't have CODASYL databases anymore are because the of the inherent problems with pointer-based models: data integrity, pointer-chains, etc.

    When you use pointers (i.e. GUID, IDENTITY, HASH, etc) in a relational database, you are essentially playing CODASYL. Worse, you're not using a CODASYL database to do do it, so you have to try to reinvent all the features they stuck in CODASYL databases to make pointer-based models work. As the system grows, it gets horribly ugly and the data begins to corode. And when it comes time for HIPPA/SOX/etc audits, whooo-boy: you don't even want to know how many hundreds-of-dollars an hour and how many hours it takes for data consultants charge to fix your system and bring it to complaince.

    One of my favorite "massive financial loss" stories comes from a procedures system developed by a reputable consulting companies (for <$100k) that used the "ID everwhere" model (they obviously didn't bring in any data specialists). When the company failed an ISO audit, they had to pay over $500,000 to fix the data in this system. The data corrosion came from an incredibly tiny bug in the procedural code that, had they done the database right, would have only caused an error message to popup instead of data corruption.

    Whenever I see people advocating for pointer-models, I think back to Santayana: Those who cannot remember the past are condemned to repeat it.



    But how does using an indentity column or other form of meaningless primary key on a table lead to data corruption?  I'm just not seeing it.  The only thing in my admittedly limited experience that I have seen cause problems is when someone decides not to use a meaningless primary key.  That has inevitably led to severe problems because whatever meaningful bit of data they chose for the primary key needed to be changed in some way later on in the life of the project.  Granted, just because a few yahoos before me couldn't pick a proper, meaningful primary key doesn't mean that meaningful primary keys are bad, and I'm not trying to imply that.  But I just can't see how meaningless primary keys, by their very nature, lead to data corruption.

  • ♿ (Parody)

    @UncleMidriff said:


    I just can't see how meaningless primary keys, by their very nature, lead to data corruption.

    They realized long ago that, no matter how good the programmers are, you still will end up with defects in procedural "middle-tier" code. If it's the occasional error message or crash, it's rarely a big deal. When the defect causes data corruption (especially historical/complaince data), you have serious problems.

    One of the biggest innovations with the RDBMS was that data integrity is handled in the database. Instead of relying on the middle tier, the RDBMS was responsbile for enforcing business rules on whether or not data is valid: everything from birthdays occuring at least eighteen years ago to orders having existing products.

    When you use pointers to relate data instead of the data iteself, you lose the ability to maintain referential integrity. It's hard to see it in smaller systems. One of the simplist example comes from the case I mentioned (the $500k bug) and was a direct result of the inability to have DI at the database:

    Everytime a procedure was changed, a new revision of the procedure was created and the old revision became inactive. Inactive procedures could not be changed and a trigger was used to enforce this ( IF EXISTS(SELECT * FROM deleted WHERE Status='Inactive') ROLLBACK ).

    Procedures had user-maintinable discrete domains (picklists, dropdowns) of various things. For example, Key_Output_Variables := (Id int, Variable varchar(15), Active_Ind char(1)). The procedure table referenced a KOV with a foreign key constraint.

    If the user wanted to rename a KOV, the system would check if any procedures used the KOV and, if so, create a new KOV and inactive the old one, thereby ensuring all old KOV remain in the system. However, there was a bug in this maintence module that caused it to update the KOV anyway (in some cases, if the user did it in a certain way). There were a handful of test cases to test this but none of them caught the peculiar way of changing the KOV.

    After five years of being in production, auditors discovered this by comparing paper and electronic copies of inactive revisions. The only way the could correct the problem (and pass ISO) was to take three years worth of tape backups, paper copies, etc, hire an army of temps and try to reconcile the thousands upon thousands of procedures.

    Had they not used an ID pointer, the trigger would rejected the update before it snowballed into half a million. This is pretty much how it usuaully happens: a minute bug in the Middle tier slowly corrupting historical data.

    All in all, if you use these non-relational techniques, you end up trying to have to put in gobs of work-around code (the bandaid for this problem was to put in a trigger that queried the meta-tables and used dynamic sql to check what tables used that the value) which (since it's lots of code) will be inherently prone to defects. But the true nightmare begins when you try to build a data warehouse.

    Again, you see these problems in large systems. But, a lot of times, systems start out small and become large.



  • I don't need an identity column, but didn't have a suitable primary key
    replacement.  Duplicate addresses are to be accepted, if
    entered.  We have an entity that has up to five addresses. 
    Rather create columns of StreetAddress1, StreetAddress2, City, State,
    ZipCode five times I moved it into a separate table.  Since each
    entity can have multiple addresses I cannot use the entity table's
    primary key as an address primary key.  We deal primarily with
    individuals and so don't track DUN numbers. 



    The customer number and sequence would work, possibly.  However, I
    have linked this table to other tables, not all of which share a
    similar logical structure.  (Not all addresses are related to
    customers, for example.) 



    Should each table have a separate Addresses table, so that it can be
    managed using a parent table's keys?  Or denormalize it if they're
    so closely linked, even if the information might often be null for the
    optional addresses?



  • @Alex Papadimoulis said:

    @UncleMidriff said:


    I just can't see how meaningless primary keys, by their very nature, lead to data corruption.

    They realized long ago that, no matter how good the programmers are, you still will end up with defects in procedural "middle-tier" code. If it's the occasional error message or crash, it's rarely a big deal. When the defect causes data corruption (especially historical/complaince data), you have serious problems.

    One of the biggest innovations with the RDBMS was that data integrity is handled in the database. Instead of relying on the middle tier, the RDBMS was responsbile for enforcing business rules on whether or not data is valid: everything from birthdays occuring at least eighteen years ago to orders having existing products.

    When you use pointers to relate data instead of the data iteself, you lose the ability to maintain referential integrity. It's hard to see it in smaller systems. One of the simplist example comes from the case I mentioned (the $500k bug) and was a direct result of the inability to have DI at the database:

    Everytime a procedure was changed, a new revision of the procedure was created and the old revision became inactive. Inactive procedures could not be changed and a trigger was used to enforce this ( IF EXISTS(SELECT * FROM deleted WHERE Status='Inactive') ROLLBACK ).

    Procedures had user-maintinable discrete domains (picklists, dropdowns) of various things. For example, Key_Output_Variables := (Id int, Variable varchar(15), Active_Ind char(1)). The procedure table referenced a KOV with a foreign key constraint.

    If the user wanted to rename a KOV, the system would check if any procedures used the KOV and, if so, create a new KOV and inactive the old one, thereby ensuring all old KOV remain in the system. However, there was a bug in this maintence module that caused it to update the KOV anyway (in some cases, if the user did it in a certain way). There were a handful of test cases to test this but none of them caught the peculiar way of changing the KOV.

    After five years of being in production, auditors discovered this by comparing paper and electronic copies of inactive revisions. The only way the could correct the problem (and pass ISO) was to take three years worth of tape backups, paper copies, etc, hire an army of temps and try to reconcile the thousands upon thousands of procedures.

    Had they not used an ID pointer, the trigger would rejected the update before it snowballed into half a million. This is pretty much how it usuaully happens: a minute bug in the Middle tier slowly corrupting historical data.

    All in all, if you use these non-relational techniques, you end up trying to have to put in gobs of work-around code (the bandaid for this problem was to put in a trigger that queried the meta-tables and used dynamic sql to check what tables used that the value) which (since it's lots of code) will be inherently prone to defects. But the true nightmare begins when you try to build a data warehouse.

    Again, you see these problems in large systems. But, a lot of times, systems start out small and become large.



    So, in this system, the values in the Variable column of the KOV table are supposed to be unique, and thus it could have been used as a primary key.  Also, the users can change the value in the Variable column of a given row whenever the heck they please.  Provided that these assumptions are correct, here's what I would have done:

    Since the value of Variable for a given KOV is expected to change, and since I wouldn't want to change every row in every table that referenced a particualr KOV everytime someone changed the value of Variable for that KOV, I would have used an Id column as the primary key.  Then I would have added a trigger that would whine if someone tried to add or change a KOV with the same Variable value as an existing KOV.  To keep a history of changes, I would have added another table, named something like "KOVChangeLog," in which I would store the Id and old Variable values of the changed KOV, as well as the user who changed it and the time at which it was changed.

    When someone changes a KOV, the system would check to see if any procedures reference that KOV.  If so, the KOVChangeLog table would have added to it the Id and old Variable value of that KOV along with the user responsible for and the time of the change, and then that KOV would be updated.  If not, that KOV would just be updated.  The trigger would ensure that the user wouldn't be able to change that KOV to a KOV that already exists, and the meaningless Id column used as the primary key would ensure that every row in every table referencing that KOV wouldn't have to be updated everytime that KOV might be changed.

    I apologize if the above is a huge WTF for any reason, and I offer the excuse that I haven't been at this very long.

    Another question:  how is the use of pointers "non-relational" according to the Information Rule?  I'm assuming here that you are talking about Codd's 12 Rules.

    @Wikipedia.org said:

    The information rule:
    All information in the database to be represented in one and only one way, namely by values in column positions within rows of tables.


    Perhaps I'm just dense, but I don't see how using meaningless primary keys violates this rule.

  • ♿ (Parody)

    @bmschkerke said:


    The customer number and sequence would work, possibly.  However, I have linked this table to other tables, not all of which share a similar logical structure.  (Not all addresses are related to customers, for example.) 

    Should each table have a separate Addresses table, so that it can be managed using a parent table's keys?  Or denormalize it if they're so closely linked, even if the information might often be null for the optional addresses?

    You're on the right track by putting it in a (single) addresses table. Two things to think about are orphaned addresses (an address without a Customer, Vendor, etc) and shared addresses (an address shared between a Customer and a Vendor). Both may or may not be problematic depending on your problem domain.


  • ♿ (Parody)

    @UncleMidriff said:


    I would have used an Id column as the primary key.  Then I would have added a trigger that would whine if someone tried to add or change a KOV with the same Variable value as an existing KOV.  To keep a history of changes, I would have added another table, named something like "KOVChangeLog," in which I would store the Id and old Variable values of the changed KOV, as well as the user who changed it and the time at which it was changed.

    This is almost identical to the approach that was used. The problem came with implementing this the requirement  "Renaming a KOV: if an inactive procedure exists that uses the original name, then a new KOV is created and assigned to all active/pending procedures that used the original. If no inactive procedure exists, the original KOV is renamed." The actual requirement and system was a bit more complicated, but suffice it to say that a loophole existed that allowed KOV to change despite having procedures that could not change. The procedure didn't know the KOV changed because the pointer to it remained the same.

    As far as triggering the KOV table, the general practice with DI is to have the referencing table complain about changes, not the referenced table. There are good reasons for this, the strongest being it's much less prone to errors and scalability. You don't want to have to modify a table everytime another table wants to reference it.

    Also, a thing to note about audit trails: they should be stored externally to the database. It's the same principle as keeping backups and originals on the same disk.

     

    @UncleMidriff said:

    Since the value of Variable for a given KOV is expected to change, and since I wouldn't want to change every row in every table that referenced a particualr KOV everytime someone changed the value of Variable for that KOV

    Why not? That's *exactly* what you want to happen, because that's *exactly* what is happening in "real world". If the user changes something here and expects it to change there, then it effectively changes in both places. DRI actions make this so much easier to accomplish.

    @UncleMidriff said:


    Another question:  how is the use of pointers "non-relational" according to the Information Rule?  I'm assuming here that you are talking about Codd's 12 Rules.

    Whoops, actually, I was thinking of another one, the "Guaranteed Access rule."  And now that I look up the "12 rules" myself, I don't think it's very clear reading it as a bullet point either:

    - All data should be accessible without ambiguity. This can be accomplished through a combination of the table name, primary key, and column name.

    However, the no pointers thing is really a fundamental component of the relational model. Unlike heirarchical and network databases (the two predecessors to relational), data are referenced by their value, not pointers. That was actually one of the largest critisisms of the relational model back in its day, especially due to the limited disk and processing power of computers.



  • [Crap -- posted this as a new topic, too. Someone please delete that one? My apologies.]

    First of all, if you're going to tell people to use correct vocabulary, you should get it right yourself. Columns, rows, and tables are all terms borrowed from the spreadsheet world. The only one you avoided using was 'cell', and used 'field' instead. You are correct that there is no order to them, but in spreadsheets, there is -- which is why re-using the same terms is a bad idea.

    Database values, database variables, database domains (headers).
    Relation values, relation variables, relation domains (headers), relation body.
    Tuple values, tuple variables, tuple domains (headers).
    Attributes, domains, values.

    Those are the terms you should be using. A database value is a named set of relation variables. A relation variable always has a value that is a set of tuples whose headers all match the relation variable's type's header. And so forth.

    But do we care? No.

    As to autonumber fields, Alex, you're again wrong. They're called surrogate keys -- that is, candidate keys added to a relation in order to simplify things. You may have any number of candidate keys on a relation header, each with any number of fields. Each one is its own unique constraint on the tuples of the relation's body. One of them is picked (usually, but optionally) as the primary key. If you do pick one as a primary key, you'll use those same attributes in all referencing relations. But you're not required to. If you have two different unique ways of addressing a row, you can use either of them, interchangibly, anywhere in the database. An autonumber field is just one example of defining a surrogate key, that is, an invented candidate key.

    Codd's rules require that any value in the database be addressable by the relation variable name and values for at least one candidate key, plus the attribute name for which you want to retrieve the value. The idea was that everything was "out in the open". You can see autonumber values, you can manipulate them. They are not the same thing as pointers, which are hidden and opaque.

    In CODASYL, pointers were external to the tuple, not internal. When you had a tuple, you didn't have the pointer -- the pointer was not part of the information. It was physical, not logical. You had to follow pointers around to get rows. That's not what autonumbers are about. When you get a tuple from a relation that uses autonumbers, you have the number. It's part of the information. Nothing is hidden. You can change that number without breaking pointers -- most database systems will automatically update all FK fields for you, though that's not required. What is required is merely that the database system enforce constraints, and in the case of FK's, that should be "values of these attributes must match values of those attribtes". That is, a DBMS is only required to reject a COMMIT if you change the PK of a row currently referenced elsewhere. The rest is kindness.

    In CODASYL, you had to navigate the records by following pointers around between rows in various tables. A master row would maintain a pointer into its first child, the children would have pointers to each other, and then there'd be another pointer back to the master row. That's what pointers are about. Order, physical, opaque. With autonumbers, we're talking about "find where this is true" -- no order, no physical aspect, and it's not opaque. It's a key, not a pointer.

    The problem you described wasn't a problem of autonumbers at all. The problem was the rest of the code. You failed to maintain a proper set of candidate keys on your lookup tables -- one for the surrogate (autonumber) attribute, yes, but another for the text attribute that was being changed. Don't blame autonumbers for your mistakes. Autonumbers don't naturally lead to data corruption as such. If you forget to properly model the rest of the data, then yes, there can be problems -- but that'd be the case regardless. (Actually, someone else reading your description understands you to have been making entire copies of the lookup table, under a different name each time? If so, you were -way- off course, and I'm not sure I can properly explain how much.)

    While speaking of misunderstanding about the relational model, we should also address the CSV point you brought up earlier. CSV attributes are, in fact, allowed by the relational model. The requirement is merely that they be treated atomically. That is, the value "A,B,C" is not the same as the value "A,B", and any tuples containing such a value are just one tuple each, not two or three. The relational model does not, however, prevent you from defining functions such as IS_SUBSET_OF that would allow you to do joins between relations across CSV columns, if you should so desire. You should be able to do, for example, SELECT A.*, B.* FROM A INNER JOIN B ON IS_SUBSET_OF(B.FIELD_B, A.FIELD_A);. The requirement merely prohibits you from treating a single tuple as if it were several, automatically breaking up a CSV attribute into non-CSV values for several virtual tuples. In fact, a lot of work has gone into making sure the relational model is properly understood to allow complex types, not just 'text', 'date', and 'integer'. Arrays, sets, images, polygon descriptions, etc. are all theoretically allowed, so long as you define proper functions for manipulating them. CSV is just one such example, albeit probably a slow one. (CSV is mainly human-readable -- you could come up with slightly better general-purpose set or array types to use.)

    Autonumbers are:
    a) faster, in many cases, than comparing text values
    b) smaller, less footprint
    c) allowed by relational theory
    d) single field vs. multi-field FK's, thus faster
    e) allow you to change uniqueness constraints in the referenced relation without changing the headers for every referencing relation -- particularly useful when requirements change slightly or previously-unknown outlier cases are discovered

    You should always remember to add proper constraints to the referenced relation, of course. But there's more to constraints than merely using a primary key built out of existing attributes; making sure sums remain positive, that things balance, that certain dates always maintain a certain relationship between them (date_of_death > date_of_birth, for example), etc.

    I've seen people try to get by without autonumbers when they had improperly identified candidate keys. Relations with 16 fields making up the primary key -- and they were lucky, this particular system allowed NULL states for attributes in the primary key (this is normally a bad thing, as NULL != NULL, which should make such rows impossible to address.) Despite this, we still had cases where we needed to enter two tuples with the same PK value, as distinct tuples. And it creates a mess -- every relation referencing this one now has 16 fields devoted just to that. That's a lot of data, and it's still wrong.

    You fear autonumbers because of pointers, but you misunderstand the distinction. Come to think of it, if autonumbers are so evil ... why do we use, instead, autonumbers provided to use by the government? Social security numbers, federal tax id's, etc. are all just forms of autonumbers in their database. And on top of that, theirs aren't necessarily unique. (There have been documented cases of two people having the same SSN and DOB, yet being entirely unrelated otherwise. Don't trust SSN to be unique.)

    Grab a copy of 'Foundation for Object/Relational Databases -- The Third Manifesto' (C.J. Date and Hugh Darwen) if you haven't already. They'll clear up your misconceptions for you, though they'll add a few too (their opinions on inheritance should be ignored, for example.) The relational model has fewer requirements than many people believe, and is therefore much more flexible than they've thought.


  • ♿ (Parody)


    @unordained said:


    Database values, database variables, database domains (headers).
    Relation values, relation variables, relation domains (headers), relation body.
    Tuple values, tuple variables, tuple domains (headers).
    Attributes, domains, values.

    Those are the terms you should be using. A database value is a named set of relation variables. A relation variable always has a value that is a set of tuples whose headers all match the relation variable's type's header. And so forth.

    If I were talking about relational theory, that would be the proper terminology. In the context of a relational database, tables, rows, columns, domains, etc. are the proper terminology according to everything from Codd's earliest papers on the topic to Date's latest edition of his Databases book.


    @unordained said:


    They're called surrogate keys -- that is, candidate keys added to a relation in order to simplify things.

    A cursory read of Date's INTRODUCTION TO DATABASE SYSTEMS discusses the surrogate key as a system-generated placeholder to be used in the place of large multi-value keys and keys with missing information. A further reading on the topic (Date, Elmasri/Navathe, etc) will place IDENTITY/GUID/AUTOID in the category of "artificial key." Surrogate Keys exist outside of the semantic model where as artificial keys are a part of it.

    @unordained said:


    In CODASYL, pointers were external ... you had to follow pointers around to get rows ... navigate the records by following pointers around between rows in various tables

    True, but you end up mimicking this approach by using the ID approach:

    Products := (Prod_Id, SKU, Unit_Price, ... )
    Customers := (Cust_Id, Cust_Num, Cust_Name, ...)
    Orders := (Ord_Id, Ord_Num, Ord_Date, ...)
    Ordered_Products := (Ord_Prod_Id, Ord_Id, Prod_Id)
    Picks := (Pick_Id, Ord_Prod_Id, Pick_Date)

    Now tell me when SKU#12345 was picked for cust#12345. You have to walk the pointer chain.

    @unordained said:

    The problem you described wasn't a problem of autonumbers at all. The problem was the rest of the code.

    In cases like ths it's very easy to lay blame. Is it the robber or the guy who left the window open who's at fault for stolen jewelry? The only way to implement RI in such a situation is to have the referenced table know the business rules of *all* referencing tables before commiting an update.

    @unordained said:

    CSV attributes are, in fact, allowed by the relational model. The requirement is merely that they be treated atomically.

    How could anyone possibly consider Comman Separated Values to be an atomic value? The name itself implies multiple values.

    @unordained said:


    Autonumbers are:
    a) faster, in many cases, than comparing text values
    b) smaller, less footprint
    c) allowed by relational theory
    d) single field vs. multi-field FK's, thus faster
    e) allow you to change uniqueness constraints in the referenced relation without changing the headers for every referencing relation -- particularly useful when requirements change slightly or previously-unknown outlier cases are discovered

    Three of your five "pros" are related to physical considerations. Design first, optimize later.

    Besides, who's to say that a database has to use the same physical storage for referenced keys? SQL Server does, but consider databases designed for size and performance: Teradata, Sybase, etc store keys only once and internally hash referenced values.

    (c) is debatable and I have yet to see a case of (e) that was a big deal at all. Find me a business that allows duplicate production Customer Numbers and Order Numbers, and I'll show you one that won't stay in business very long.

    @unordained said:

    I've seen people try to get by without autonumbers when they had improperly identified candidate keys. Relations with 16 fields making up the primary key ... That's a lot of data, and it's still wrong.

    Wow. If they were having so much trouble coming up with a data model that they had to have a 16-column primary key, I really do believe that there were many more fundamental things wrong with the system as a whole. But I do agree that something is seriously wrong with that approach.

    @unordained said:

    If autonumbers are so evil ... why do we use, instead, autonumbers provided to use by the government? Social security numbers, federal tax id's

    Numbers provided by someone are not "autonumbers." Autonumbers are not verifiable and have no rules regarding their generation. Of course, SSN is a pretty poor identifier. Privacy, etc, aside, there's a example of problems you get when you don't design verifiable identifiers.



  • I would reply about the customer numbers -- but you're missing an FK between orders and customers. However, the result would be a join, which is a relational operation. Relational joins have nothing to do with following pointers. We're talking sets and values. The difference is astronomical. Most likely though, from your naming convention, I would have ...

    SELECT PICK_DATE FROM PICKS INNER JOIN ORDERED_PRODUCTS ON ORDERED_PRODUCTS.ORD_PROD_ID = PICKS.ORD_PROD_ID INNER JOIN ORDERS ON ORDERS.ORDER_ID = ORDERED_PRODUCTS.ORDER_ID INNER JOIN PRODUCTS ON PRODUCTS.PROD_ID = ORDERED_PRODUCTS.PROD_ID WHERE ORDERS.CUST_ID = '12345' AND PRODUCTS.SKU = '12345';.

    Did any of that involve functions crawling through a database, with IF statements and fetch requests going out based on pointers? Were there even any sub-selects? No. That would be the old world of hierarchical and network (terrible name) databases, where you -did- have to do such things manually, by pointer. This is fundamentally different -- and the joins would be the same regardless of whether we were using autonumbers or some other candidate key. It's all logically the same. Is the distinction still not clear? Consider spreadsheets: they use pointers. When you address a cell, it's by location. You can't go to a row and just type a different row number in and have everything update cleanly. The row number is physical, and is external to the row data. That's not the case with candidate keys in a relational database. The key is part of the tuple, always. Oracle, however, does have a facility involving row pointers, as real pointers, with all the issues that creates. Very different.

    As to CSV being atomic -- databases are scalar, when you have an expression like (database_a == database_b). You consider them in their entirety, as scalar values. Relations can be scalar too -- (relation_a == relation_b or IsSubsetOf(relation_a, relation_b)). Everything's a value. In fact, the only reason this is even an issue at all is 'multi-value' database systems that existed back in 'the day', and that Codd and Date and so forth react against. As I described, the issue there was stuffing several values into a single field, but then treating them as entirely separate. If you stuff, say, several dates into a single field, you could join and treat the field as if it always had a single date, and you'd get a match if any of the dates matched. (I could provide an SQL example to illustrate what Codd wanted -not- to have happen, if required.) That's wrong. What isn't wrong is joining using a function such as "is one of", or anything that treats the whole value as a whole value -- of course, a lot of database products don't let you define UDFs properly, so you can't always make use of the theory.

    If you read Codd and Date carefully, you see that they have no problem at all with the concept of storing entire relations as values inside another relation. They think it'd be a pain in the ass to deal with, but not wrong. Values are values, regardless of how complex they are. (If you passed an array to a function, and this somehow resulted in the function always automatically being called exactly once for each of the values in the array, that'd be like the multi-valued databases. What I'm saying is that you can pass arrays to functions, and the function receives the whole array. It's the same concept. One value, but complex. Functions and tuples are related in that you can see functions as always only taking one value, which is sometimes a tuple of several values, and returning another value, which might itself be a complex value.)

    As to duplicate customer numbers and order numbers -- none do. That's the point. They're candidate keys, and therefore unique. But why would you ask that, as it implies the legitimate use of autonumbers? In fact, if you were avoiding using autonumbers ... what would you use as a PK for orders? Customer number (or other identifier) plus date? What if they order several times on the same date? The list of items? That's not stored in the tuple itself, so you'd have to have scalar (but internally multi-value) lists of item SKU's as part of the FK ... that'd work, sort of, maybe, but be nasty regardless. The timestamp when they ordered? Well, assuming different departments of the same company can't physically call at the same time (exactly, in theory) and place distinct orders ... that's exactly why autonumbers are used.

    All relations always contain a unique set of tuples. So if you can't seem to find something sufficiently unique, the entire tuple is always a candidate key. You can always find it if you know the whole thing -- it's stupid, but it's true. The point of a candidate key is being able to uniquely identify a tuple by less than the entire tuple. You don't have to use autonumbers, but using the entire row as a PK also seems rather ... [pick your own pejorative.]

    I'm not sure what you mean about verifiable identifiers -- data in a database is checked for validity, not correctness. Verification == correctness, so it's out of our realm? Could you elaborate?

    And yes, I suggested physical (rather than logical) reasons to go with autonumbers. Why? Because logically, they're the same. It's all optional. But if you have a choice, and all other things are equal, you might as well go with a physically faster solution. Logically, it's all the same. Autonumbers are neither prohibited nor required. I'm not saying anyone must go with autonumbers, I'm saying they're not evil.



  • @Alex Papadimoulis said:

    @unordained said:

    CSV attributes are, in fact, allowed by the relational model. The requirement is merely that they be treated atomically.

    How could anyone possibly consider Comman Separated Values to be an atomic value? The name itself implies multiple values.

    "Comma Separated Values" in no way implies the nature of that data or the relationships between that data.  Dates and timestamps are considered atomic/scalar values when they consist of multiple component parts.  A mathematical vector is a better example, because it has individual parts that mean something independently, because those parts are comma separated, and mainly, because it makes the most sense when looked at as a whole.  Thus, while it may not fit a rigorous definition of atomic/scalar, a set of values can be treated as a single value, so long as proper functions are provided for doing so.



  • @Alex Papadimoulis said:

    A cursory read of Date's INTRODUCTION TO DATABASE SYSTEMS discusses the surrogate key as a system-generated placeholder to be used in the place of large multi-value keys and keys with missing information. A further reading on the topic (Date, Elmasri/Navathe, etc) will place IDENTITY/GUID/AUTOID in the category of "artificial key." Surrogate Keys exist outside of the semantic model where as artificial keys are a part of it.

    I tried reading and interpreting that several different ways, and none of them made much sense.  Even by your own definitions (and I haven't read the original sources you're referencing), it sounds to me like surrogate keys are the same as artificial keys, so the last sentence just doesn't make sense.  Could you make the distinction more clear?


  • ♿ (Parody)

    @unordained said:


    Did any of that involve functions crawling through a database, with IF statements and fetch requests going out based on pointers? Were there even any sub-selects? No. That would be the old world of hierarchical and network (terrible name) databases, where you -did- have to do such things manually, by pointer. This is fundamentally different

    OK -- I'll concede that there is a fundamental difference with CODASYL pointers and ID. But the practice of automatically assigning *every* row an AUTOID is conceptually the same as assigning every record a DBKEY.

    @unordained said:


    Oracle, however, does have a facility involving row pointers, as real pointers, with all the issues that creates. Very different.

    This is one of the many reasons I have the I-Hate-Oracle-Club. I'm not sure which is worse ... that, or it's horrible thing known as nested types.


    @unordained said:

      ... what isn't wrong is joining using a function such as "is one of", or anything that treats the whole value as a whole value ...

    We can agree to this as well. I think we would both benefit, however, by saying "CVS columns are horrible things." Any indication that a practice like tblCalendar <http://thedailywtf.com/ShowPost.aspx?PostID=22355> is acceptable may encourage newbies to use it.


    @unordained said:

    In fact, if you were avoiding using autonumbers ... what would you use as a PK for orders? Customer number (or other identifier) plus date?

    For orders, I would use Order_Number. For customers, I would use Customer_Number. I would not add Order_Id or Customer_Id.

    @unordained said:

    I'm not sure what you mean about verifiable identifiers -- data in a database is checked for validity, not correctness. Verification == correctness, so it's out of our realm? Could you elaborate?

    Standard stuff, part of the business domain: check digits, duplicity (i.e., no number like 788888942), etc.

    @unordained said:

    Logically, it's all the same. Autonumbers are neither prohibited nor required. I'm not saying anyone must go with autonumbers, I'm saying they're not evil.

    I have yet to see a business requirement defining an identifier that an autonumber would fulfil. Therefore, when I see IDENTITY/AUTOID/GUID, I assume that it's part of a schema in which *every* row has one as a key (regardles of Order_Num, etc). The key is never exposed to the user and is used for all FK's, etc.  In my experience, this the practice that newbie programmers take.



  • @unordained said:

    Actually, someone else reading your
    description understands you to have been making entire copies of the
    lookup table, under a different name each time? If so, you were -way-
    off course, and I'm not sure I can properly explain how much.





    You guys are way above my head.  I'm one of those "jack of all
    trades, master of none" folks that has to implement things and "just
    get it done."  You're the folks called in to clean up the mess I
    make when my employer decides they want to spend some cash and time on
    the problem rather than doing it correctly from the beginning.  I
    would love to know all that you do about database theory, but I
    don't.  I do try to learn - thus my original question and my
    continuing to follow this post.  (Actually I've considered sending
    in some of my own code from the first projects I worked on for the
    Daily WTF. :p)



    That said, I wanted to clarify a comment I made earlier, since I think
    it's being used incorrectly.  I had asked if I should separate
    addresses into logical tables based on who their parent table was to
    avoid using identity values.  (CountyAddresses, OrderAddresses,
    CustomerAddresses vs Addresses.)  The information wasn't
    duplicate, but the only reason the data would not be conjoined into a
    single table would be that their parents would be different
    tables.  This wasn't suggested by Alex anywhere - this was an
    extrapolation on my part while I was throwing what he said in my
    head.  Even when I suggested it it seemed more of an "object
    oriented" way to handle the issue rather than a "relational" or "set"
    method.


  • ♿ (Parody)

    @insignis said:

    @Alex Papadimoulis said:

    A cursory read of Date's INTRODUCTION TO DATABASE SYSTEMS discusses the surrogate key as a system-generated placeholder to be used in the place of large multi-value keys and keys with missing information. A further reading on the topic (Date, Elmasri/Navathe, etc) will place IDENTITY/GUID/AUTOID in the category of "artificial key." Surrogate Keys exist outside of the semantic model where as artificial keys are a part of it.

    I tried reading and interpreting that several different ways, and none of them made much sense.  Even by your own definitions (and I haven't read the original sources you're referencing), it sounds to me like surrogate keys are the same as artificial keys, so the last sentence just doesn't make sense.  Could you make the distinction more clear?

    A surrogate key is one that acts behind the scenes. Teradata uses a surrogate keys in place of the actual keys when referencing tables. Surrogate keys are not visible to the user.

    An artificial key is one that does not exist in the real world but is added to identify the entity. It is visible to the user.



  • CSV columns -- I've had to remove one from our own database when one of our programmers (now gone) decided to go that route. Yes, we avoid it at all costs. But I don't think it's so much because it's a fundamentally -wrong- thing to do so much as because we don't write UDF's to properly handle those cases. And efficiency-wise, our DBMS can't index a CSV string such that it can quickly join to it. (And in the example you link to, which I'd seen before, there's a limit to the length of the string -- which means there's an arbitrary limit to the number of items, which is bad. Then again, they also use really random logic, with -1 for 'not', for understanding the field, which makes a UDF almost required for consistency. And I'm sure they didn't have one.) So yes, generally, it's a bad idea, particularly for newbies. They'll just screw it up. In theory, it's allowed, if you're very very careful. That's all I cared to point out.

    So you would have order_number and customer_number, which are likely autonumbered? I agree that if you already have a valid PK, there's no reason to go adding an autonumber just for the hell of it ... except ... well, maybe for consistency. If you know you'll always avoid multi-field PK's, if you always know the datatype of the PK ... a lot of functions can be more easily re-used (and are simpler to code) across a large project. But purely from a database standpoint (ignoring any code that might be running against it), it's not useful to add more surrogate keys, no. But then there's also inheritance, having tables 'extended' into other tables, and ... yeah. That's for another day, I think we can all agree.

    Thanks for being such a good sport about it. Hadn't had caffeine yet.



  • @Alex Papadimoulis said:

    I have yet to see a business
    requirement defining an identifier that an autonumber would fulfil.
    Therefore, when I see IDENTITY/AUTOID/GUID, I assume that it's part of
    a schema in which every row has one as a key (regardles of Order_Num,
    etc). The key is never exposed to the user and is used for all FK's,
    etc.  In my experience, this the practice that newbie programmers
    take.





    Our file numbers are created by taking the largest order number
    created, adding one, and then prefixing the resulting number with a
    three letter abbreviation that represents the office handling the
    file.  Not quite an autonumber, but close. :)  We could use
    an autonumber column and combine it with the office column and achieve
    the same effect.  We don't care about holes in the order numbers.



    I have to admit that using an autonumber column in a design that uses
    them very little feels odd.  In the design I'm currently working
    on the only autonumber columns are the ones being used for the
    Addresses table.  Relationships exist everywhere else, which makes
    me try and find a relationship between Addresses and <everything
    else>.



  • @bmschkerke said:


    Our file numbers are created by taking the largest order number created, adding one, and then prefixing the resulting number with a three letter abbreviation that represents the office handling the file.

    I hope you're not getting the 'largest order number created' from a SELECT against the table. That's prone to errors (or rather, conflicts) in multi-user environments where several long-running transactions might try to insert new order numbers near-simultaneously. Both would see the same largest number, both would take that plus one, and ... then you'd get a PK conflict. Most DBMS's have some sort of separate concept of generator, etc. that acts outside the context of transactions to avoid such conflicts. People who care about gaps will refuse to use them, but for you, that wouldn't be an issue. But then, maybe that's what you meant ...



  • "Prone" is understating it.



    No, the file number is calculated in the stored procedure which is
    responsible for adding orders, and it returns that value to the
    client.  I don't think SQL Server has anything like Oracle's
    "nextval."  I haven't had any major problems with the stored
    procedure, and I think it's safe because stored procedures run inside a
    transaction, but if that's not the case, please, tell me now. :)






  • Clustered indexes affect the way data is physically stored on the disk
    by the DBMS. That's why you can only have one per table. I suspect that
    since primary keys are "usually" the most used column for joining, you
    want accesses to those rows to be super fast, hence you want the data
    to be laid out on disk to facilitate that, hence the clustered index.


  • ♿ (Parody)

    @tufty said:


    You use an artificial key. "customer number". And there is only a semantic difference between "customer number" and "customer id". It's an artificial value generated by some function which is guaranteed to produce a unique key. Whether it's done by autoincrementing an integer, pulling values from an randomised set of distinct values, generating an uuid, picking bits of the domain data with randomised values added, or some other method entirely is completely irrelevant. It's artificial, and can't be derived from the domain data.

    My point was that the rules for generation of Customer Numbers come from business requirements: "Customer Numbers are seven digits with no leading Zeros and no more than three repeating numbers." That's a very typical requirement, and I have yet to personally see a business requirement that an IDENTITY would fulfill.

    @tufty said:

    There's actually nothing wrong with that. One of the main benefits of artificial keying is that it decouples the keying from the problem domain

    They are one in the same. Data sets already have keys: Customer_Nums, Order_Nums, etc. There is no need to add non-domain keys (Customer_Id, Order_Id).

    @tufty said:

    when the inevitable edge case(s) turn up, all one has to do is remove a unique constraint from some combination of table columns, and not completely rethink your primary keys, resulting in a complete refactor of everything that uses the database

    You honestly want to plan for the requirement: "make it so we can have two different customers with the same customer number"? It makes as much sense to plan for that as it does to plan for "rolling back" to the Julian calendar.


  • ♿ (Parody)

    @tufty said:


    How, for example, would you propose to create a unique constraint for an address? To encode my address uniquely without some artificial id, one would have to use a compound key using all the columns in the table

    Depends on the scenario. I've used DUNS in many B2B applications. Other times, Cust_Num + Addr_Seq. In utilities systems, address keys are heirarchical and based on census numbers.

    @tufty said:

    No. As I said, if there is already a single guaranteed unique key, use it ...

    So we agree then ;-). I am trying to disabuse those who believe that, given Order_Num and SKU, that OrderDetails should look like " OrderDetail_GUID, Order_GUID, Product_GUID, ..."

    @tufty said:

    I might suggest that the cases where the domain data provides a single, guaranteed unique key for every table in your model are in the minority.

    It depends on what you consider minority. As far as big systems (not Access-sized), you'll never find a bank that delivers the spec "each customer will have a unique bank account number" to IT for developement. On small systems, you're going to have to sacrifice a lot; good specs are one of the sacrifices.

    @tufty said:

    7 columns out of 8 used to generate the primary key

    If you need that many columns for something like an address table, I think you'll be having a lot more problems than just that. If a relation was naturally complex, I'd probably use a surrogate.


  • ♿ (Parody)

    @tufty said:

    This forum software makes ickle baby jesus cry more than (nearly)any of the WTFs posted here. I'm sure it's great under ie, but under Safari it's a bag of festering sick. Simon

    When I was looking for forums software, everyone was like, "you gotta go with Community Server! It's written by ... and is all .NET ... it's great!"

    These people obviously never used this software. It's really, really bad. Like, unbelivably bad. And the version I'm using now is a **huge** improvement over the old version I had before.

    @tufty said:

    How would you model a table of addresses (handling differing countries, of course)? Seriously, I'm interested to know. If you can get it down to less than 7 columns and still have it manageable, I'm all ears.

    I meant 7 columns as a primary key. I have never had a need to seriously model international addresses, so I cannot tell you offhand. The first thing I would do is look for an ISO/IEC standard for International Addresses. I don't believe one of these exists, but I'd use what I could (like ISO-3166 for country codes). For a key, I'd use Address Owner Id ( Customer Num, Vender Num, etc) plus a sequence number.



  • @Alex Papadimoulis said:

    My point was that the rules for
    generation of Customer Numbers come from business requirements:
    "Customer Numbers are seven digits with no leading Zeros and no more
    than three repeating numbers." That's a very typical requirement, and I
    have yet to personally see a business requirement that an IDENTITY
    would fulfill..


    Allow me to share an anecdote:



    I once consulted for a company with a critically understaffed
    development department and a database which contained several million
    customers.  The marketing department came up with a campaign which
    would require a bunch of items snail mailed to a certain subset --
    numbering in the thousands -- of those customers.  Well, one of
    the guys in that department fudged their Excel spreadsheets and we had
    a few thousand letters go out with errors in the customer code. 
    This resulted in quite a number of customers being sent letters
    referencing the wrong customer code.  Typically only two or three
    customers would have the same code, but it was quite significant.



    When the impact of this problem was discovered, a voice from on high
    boomed out: From this day forth the customer service department shall
    identify customers by their last name as well as by customer code!



    Fortunately the primary customer key was maintained behind the scenes
    by one of those devilish identity columns.  As the flood of calls
    were pouring in, the development department quickly altered the unique
    constraint to include the customer last name.  Next, they simply
    archived the wrong customer codes and backfilled the codes that the
    customers actually knew.  By the end of the day everything was
    good.



    Domain constraints can change in an instant.  How would your large systems handle such a swift change to its key structure?


  • ♿ (Parody)

    @Chris F said:

    From this day forth the customer service department shall identify customers by their last name as well as by customer code!

    I don't believe your story. The only scenario I can see this happening is in a small little shop with an access database where they know every customer.

    Honestly. A company with that many customers cannot have managers that incompetent. You don't change your entire business model to rememedy a mistake a few postcards would solve. That's like spilling a drop of paint on the carpet and deciding to paint the entire carpet to match.

    Secondly, relying on the uniquness of a last names across a million customers would be a total disaster. Just amongst several hundred people you're statistically gaurnteed to have the same last name. And over the lifetime of the system, since you totally threw your data integrity away.

    And finally, what of the application? How possibly could a system that handles millions of customers be architected that all that is required to change is a UNIQUE constraint? Just on a very basic app (certainly not one for millions of customers), you'd need to change the UI (where they input Cust#), the middle tier classes ( GetCustomerByCode, etc), the database Stored Procedurs which assume only one cust/code.

    I call bullshit. No way this happened in a company of the size you described.



  • @Alex Papadimoulis said:

    Honestly. A company with that many customers cannot have managers that incompetent.


    I submit that you do not have an open mind =).



    @Alex Papadimoulis said:
    You don't change your entire business
    model to rememedy a mistake a few postcards would solve. That's like
    spilling a drop of paint on the carpet and deciding to paint the entire
    carpet to match.


    How is this a change to the entire business model?  From the point
    of view of every department except development, this change is
    imperceptible.  There were essentially zero business processes
    that needed to change, because customer service already asked for the
    person's last name for verification purposes.



    In some companies the public admission of a company's mistake is
    anathema, to be avoided at all costs.  If development can wave
    that wand and make the problem go away, the management would jump on it
    every time.



    @Alex Papadimoulis said:
    Secondly, relying on the uniquness of a
    last names across a million customers would be a total disaster.


    You misunderstand.  It was not relying solely on the last name,
    since it included the previously unique customer codes as well. 
    New customer codes were indeed generated in the same manner that they
    always had been.



    @Alex Papadimoulis said:
    And finally, what of the application?
    How possibly could a system that handles millions of customers be
    architected that all that is required to change is a UNIQUE
    constraint?  Just on a very basic app (certainly not one for
    millions of customers), you'd need to change the UI (where they input
    Cust#), the middle tier classes ( GetCustomerByCode, etc), the database
    Stored Procedurs which assume only one cust/code.


    ROFL.  I wish the application had a middle-tier.  Your
    estimation ability is bedazzled by the number of customers and is not
    anywhere grounded in reality.



    You're right that the UI needed to change -- the app detected the
    unique constraint violation and said 'hey, duplicate customer
    code'.  This needed to be changed to include the new constraint
    information, of course.  But that was purely informational and
    wasn't critical to the day's operations.



    No stored procedures assumed only one customer code!  Like I said,
    everything db-side used the identity column.  The customer code
    was used for display and user search purposes only.  Development
    could hardly care that it existed.



    @Alex Papadimoulis said:
    I call bullshit. No way this happened in a company of the size you described.


    Hahaha, so what size company did I describe?  How many millions a
    year do you think this company makes?  And how many millions of
    customers were in the database?  I'd love to hear your guesses.



     The only parts of my anecdote you find unbelievable were either
    factual misunderstandings or completely subjective guesses.



  • Unordained -- that was one of the best posts on the topic that I have seen, anywhere.  And I have read several posts on "natural" versus "meaningless" key values.
     
    Thank you.

  • ♿ (Parody)

    @tufty said:

    > When I was looking for forums software, everyone was
    > like, "you gotta go with Community Server! It's written by
    > ... and is all .NET ... it's great!"

    well, .NET. and, dare I say it, "try before you buy"

    > And the version I'm using now is a **huge** improvement
    > over the old version I had before.

    Yes. Still shit, though, isn't it?

    To be fair, I did do some testing. I wrote a script to import my .TEXT blog posts in, created some additional forums, posted a few times, and it seemed to work fine. The bugs it has are mostly minor, but there are *so* many. The exceptions logging table is a joke; there are literally 1,000's of exceptions logged each day.


    @tufty said:


    Yep. because country codes never change, do they? Countries never split and become entirely separate countries with entirely separate ISO codes. And ISO3166 codes don't overlap, either, do they? (hint: all of previous 3 statements are untrue)

    All that aside, ISO3166 is not a _really_ bad choice. It's only broken for the first and third items above, the second is an issue however you you key your country table. I should mention that I've seen country tables keyed on country _name_ before...

    That's interesting to hear. I've never had to do an international addresses table, so I haven't come across it. Either way, I'd do a quite a bit more research before jumping in.

    @tufty said:


    Okay, so that blows out the possibility of cutomers sharing an address, and also adds one of those oh-so-devilish 'sequence numbers'. How you intend to generate that in a transaction safe way without using an 'identity' or similar is beyond me.

    You're right, but I can't immagine why someone would want such a requirement. Really, I can't. Why would customers share addresses?

    As far as transaction-safe, I can't immagine this being a problem. This would not be a heavy-insert table and, even with page-level locks, (Cust_Num + Addr_Seq) would be clustered, minimizing the probability of problems occuring even further.

    In any case, in the scenario given, I would not object to usage of an IDENTITY/SEQUENCE/AUTONUM.


  • ♿ (Parody)

    That's certainly an interesting application, indeed a very specialized one.

    I was trying to fathom how odd that would be if it was a normal Entity-Address scenario; when Bob moves out of his rommate's pad, they'd have some interesting shipping scenarios.


Log in to reply