Database design... use XML column?

Disclaimer : I'm not a DBA nor have I built many databases from scratch.

Alright, I'm looking down the barrel of a system redesign... The part I need some added brain power is on the database side of things, I hope someone can offer some insignt or at least point out flaws.

The basic premise is taking data files from various customers, the only real commonality is they will all contain address information. (Each customer has data that will be used to produce a physical product and then mailed.)

My current plan would be to have a single data table where all customer data could be stored, thus I would have named columns for address related info, and one XML column for customer specific data. Each customer would have a schema for their XML data. The reason I'm considering XML is I want to stay away from column names like "CustomerData1", "CustomerData2", etc...

Finally, I would need to be able to load multiple customer files, and store them as "Jobs" so our production department would be able to have a work queue, something we lack entirely right now.

Here is a -basic- and off the cuff attempt at a possible DB structure

Customers { uniqueID, name, etc...}

Data {dataID, customerID, Address1, Address2, City, State, etc... XML}

Jobs { jobID, customerID, itemCount }

Customer_Jobs { jobID, dataID }

Question : Is this a decent use for the XML column datatype? If not, how would you build a table to store data with dissimilar pieces?

boomzilla

I guess my question would be, what sort of "customer data" do you expect to receive, and what exactly are you going to do with it? From your description, it sounds totally arbitrary, such that you probably aren't going to use it in your application, other than maybe to display it.

If it's reallly that arbitrary, you could have another table something like:

CustomerData{ customer_id, name, value }

My first goal would be keeping as much of the data records together as possible for all the odd ad hoc reporting I get asked to do.

Right now our structure is each customer has a complete set of tables just for them.. i.e Customer1_preprocess, Customer1_production, Customer1_history, Customer1_errors, etc...

It makes reporting a pain, also we only work out of each customers "production" table, so we can't queue work..

Speaking to your question directly : The data is usually arbitrary, but each customer works slightly different. Also the table you suggested appears to be just a name/value pair.. which XML seems to represent.

Besides XML columns being pretty new (relative to the age of data bases in general) is there any reason not to use them for this type of application?

boomzilla

@Abbacabba said:

My first goal would be keeping as much of the data records together as possible for all the odd ad hoc reporting I get asked to do.
Right now our structure is each customer has a complete set of tables just for them.. i.e Customer1_preprocess, Customer1_production, Customer1_history, Customer1_errors, etc...
It makes reporting a pain, also we only work out of each customers "production" table, so we can't queue work..

Well, it shouldn't be too hard to simply join the customer data table.

@Abbacabba said:

Speaking to your question directly : The data is usually arbitrary, but each customer works slightly different. Also the table you suggested appears to be just a name/value pair.. which XML seems to represent.
Besides XML columns being pretty new (relative to the age of data bases in general) is there any reason not to use them for this type of application?

I guess I would answer your question with another question. Since you're using a database, why wouldn't you use the database to store the data, rather than stuffing XML in there? Unless perhaps your customers send you the XML, and you simply send it back at some point. Yes, I described the table that way supposing that since you were proposing XML, that's probably the sort of thing you'd need. If you're handing the data off to some application that wants XML, then I suppose that your XML column makes more sense, since it's just some opaque data as far as the database is concerned.

I guess my problem is that I'll most likely be the "DBA" and Developer for this.<o:p></o:p>

From my developer point of view, writing a class and functions for that say a customerData class would be made very easy by having essentially *one* type of data row, and always knowing any data line to produce is simply in the Data table.<o:p></o:p>

So db table Data {a,b,c,d,XML} becomes <o:p></o:p>

class customerDataLine
{<o:p></o:p>

string a,b,c,d; // Address data, db column name is the same as the code variable name.<o:p></o:p>

xmldocument x; // "blob" data from the db easily worked with using the schema and standard xpath, LINQ(C#), or language specific XML class. <o:p></o:p>

};<o:p></o:p>

........................................<o:p></o:p>

what I'm trying to avoid is this type of structure:

DB :

CustomerA_addressData
CustomerA_propritaryData
CustomeB_addressData
CustomerB_propritaryData<o:p></o:p>

or any similar setup.<o:p></o:p>

<o:p></o:p>

From my perspective at this point it seems as if the DB design has potential. And the application code that will tie into it will be made more consistent across customers using the XML datatype...<o:p></o:p>

<o:p></o:p>

All this becomes really important because I'm the only
developer here, and our customer base might explode this year... I need a
system where I can quickly add customers to the DB structure, and also code up their
processing on the application side very quickly.

zedhex

If I had a buck for every piece of god-awful code I've seen that started like this I'd have, ... well, enough to get seriously drunk..)

The best advice I can give is start reading up on basic database design. Don't ever get the idea that db design is something you can just <<do>> without some very serious thought. As a basic first step, you need to model all the data entities and their relationships. Find where they have commonalities and differences. Then draw them up as a schema and start imagining use-cases. If you schema can handle all your various use cases, you have your first candidate. From reading your ideas, I think you will have complete mess if you try putting disparate data into one single table. Much better to abstract out the commonalities into one master table, and then create specializations in related tables. Then add a specialization field in the master table (maybe call it "CustomerType"?) and use it to ensure that the correct customer data goes into the respective specialized table. That way you will find that your data more correctly models your problem space.

I've gotten the rest of my schema mapped out, and it followed a pretty easy design... Its just this damn main data issue, its the only real point in the process where data will be "unique" per product. Like I stated in my previous post(s), my developer side is wanting to jam it into a single table, the DB design side says there is most likely a better design... I'm just trying to find it.

The only problem I was running into was how best to store the actual customers data everything else is pretty much the same for all customers.

I thought about haveing the "main" or "master data" have a column that would "point" to the customers unique table, but is this a good/bad thing?

Example :

Data {dataID(PK), customerID, address1,address2, etc..., customer_data_table} (or have a seperate table... XXXX { something(PK), customerID, ProductID, customerDataTable} )?

CustomerA_Data

{ dataID(FK), customerDataID(PK), customerdata1, customerData2, customerData3}

Is this what your talking about? (again just an ugly quick try)

dhromed

@Abbacabba said:

Speaking to your question directly : The data is usually arbitrary, but each customer works slightly different. Also the table you suggested appears to be just a name/value pair.. which XML seems to represent.

XML is not specifically designed for name/value pairs, and the DB table does a far better job. Boomzilla's original idea is the sanest approach.

zedhex

@Abbacabba said:

I've gotten the rest of my schema mapped out, and it followed a pretty easy design... Its just this damn main data issue, its the only real point in the process where data will be "unique" per product. Like I stated in my previous post(s), my developer side is wanting to jam it into a single table, the DB design side says there is most likely a better design... I'm just trying to find it.

The only problem I was running into was how best to store the actual customers data everything else is pretty much the same for all customers.

I thought about haveing the "main" or "master data" have a column that would "point" to the customers unique table, but is this a good/bad thing?

Example :

Data {dataID(PK), customerID, address1,address2, etc..., customer_data_table} (or have a seperate table... XXXX { something(PK), customerID, ProductID, customerDataTable} )?

CustomerA_Data

{ dataID(FK), customerDataID(PK), customerdata1, customerData2, customerData3}

Is this what your talking about? (again just an ugly quick try)

Thats pretty much correct. Using master table/child tables is often a reasonable approach when designing a solution based on data specializations. I tend to use it a lot, especially in combination with a Linq-est object model. Just derive your business classes from a common base class. The base clase would be hydrated from the Master table, and the child tables would be based on one of the Child tables. You would have to add your own data-layer to control selectively loading the correct data, but that isn't too much work. Extending the schema is then just a matter of adding a new child table and corresponding class. All the other code would be blissfully unaffected.

Grovesy

@boomzilla said:

CustomerData{ customer_id, name, value }

I'm sure I've seen several WTF's on here with said database design...<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p></o:p>Anyhow, I would be very cautious of trying to create a 'generic' database to hold any old piece of data (why build something generic on top of something generic?)... Unless very very well designed and thought out you can quickly find yourself being the subject of an article on here. It’s sometimes easier to have ‘lots of tables’ and write out longhand queries to pull out the data than to have a single ‘generic’ data structure.<o:p></o:p>I would highly recommend going out and picking up a good database theory book, such as Chris Dates Introduction to databases.<o:p></o:p>

lpope187

@Grovesy said:

@boomzilla said:
CustomerData{ customer_id, name, value }

I'm sure I've seen several WTF's on here with said database design...Anyhow, I would be very cautious of trying to create a 'generic' database to hold any old piece of data (why build something generic on top of something generic?)... Unless very very well designed and thought out you can quickly find yourself being the subject of an article on here. It’s sometimes easier to have ‘lots of tables’ and write out longhand queries to pull out the data than to have a single ‘generic’ data structure.I would highly recommend going out and picking up a good database theory book, such as Chris Dates Introduction to databases.

Here's my 2 cents. This type of table is fine if it is used for human reference only and is not used programmatically. If you need to adjust program behavior based on these values, profile the data and develop a different data model.

Outlaw_Programmer

Once place that I used to work had a DB like this. There was a UserConfiguration table that had 2 columns: key and xmlData. If that wasn't bad enough, it wasn't even XML. We would serialize objects as XML via the XMLEncoder class and store that in the database. When you want to read this object, the data access layer would look up the XML and use XMLDecoder to give you back the original object. It was almost a clever idea except for the fact that the XML generated was unreadable. God help you if you ever needed to see what a user's preferences were by just trying to look at the database.

They chose this approach because there was literally no way of knowing what kind of data they wanted to store. The system was billed as "completely customizable" and you could write your own tools and GUIs for it. Rather than have each client create their own tables, they just provided a service like UserConfiguration.saveObject(String key, Object o). Again, it was pretty convienient to use but a complete mess if you were actually trying to read the database.

I'm actually building out the seperate tables design right now to see how it works..

Here is what I'm looking at right now :

products { id(PK), customerID(fk), description }

customer_jobs { id(PK), jobID(fk), customerID(fk), productID(fk), address1, address2, city, state, postalCode,country, customerDataID(fk), status } <-- "main data table"

customer_customerDataTable {customerID(fk), productID(fk), customerDataTableName} <--- PK will most likely be autogenerated.

CustomerA_P1 {id(pk), a, b, c, d, e} <--- customer A, product 1 table

CustomerA_P2 {id(pk), a, b, f, g, h, z} <-- customer A, product 2 table

CustomerB_P1 {id(pk), j, e, c, y, p} <---- customer B, product 1

Still a WTF design?

And also :

Here is an example of afew data lines I might receive (comma delimited):

Customer1 : 123 Main St., SomeCity, TX, 12345-1234, MemberID, Message, MemberPin

Customer2 : 987 1st Street, City2, VA, 54321, eyeColor, Height, Weight, Sex

Customer 3 : 21 Jump Street, New York, NY, 22222, badgeNo, Classification, Rank, DepartmentNo, DeskNo, licenseToKill, status

As you can see, all the data records contain mailing address information, but each customer also sends in number of other pieces of information for a record. Also the number of pieces of data beyond address info a customer sends is customer specific, one might have 3 fields while another might have 12 or 20. (in the above examples Customer3 has many more fields than 1 or 2, and no customer sends in the same data items.)

		<!-- controls -->&nbsp;<a href="http://www.dbforums.com/editpost.php?do=editpost&amp;p=6318716" class="" name="vB::QuickEdit::6318716"></a> </p>

zedhex

@Abbacabba said:

I'm actually building out the seperate tables design right now to see how it works..

Here is what I'm looking at right now :

products { id(PK), customerID(fk), description }

customer_jobs { id(PK), jobID(fk), customerID(fk), productID(fk), address1, address2, city, state, postalCode,country, customerDataID(fk), status } <-- "main data table"

customer_customerDataTable {customerID(fk), productID(fk), customerDataTableName} <--- PK will most likely be autogenerated.

CustomerA_P1 {id(pk), a, b, c, d, e} <--- customer A, product 1 table

CustomerA_P2 {id(pk), a, b, f, g, h, z} <-- customer A, product 2 table

CustomerB_P1 {id(pk), j, e, c, y, p} <---- customer B, product 1

Still a WTF design?

And also :

Here is an example of afew data lines I might receive (comma delimited):

Customer1 : 123 Main St., SomeCity, TX, 12345-1234, MemberID, Message, MemberPin

Customer2 : 987 1st Street, City2, VA, 54321, eyeColor, Height, Weight, Sex

Customer 3 : 21 Jump Street, New York, NY, 22222, badgeNo, Classification, Rank, DepartmentNo, DeskNo, licenseToKill, status

As you can see, all the data records contain mailing address information, but each customer also sends in number of other pieces of information for a record. Also the number of pieces of data beyond address info a customer sends is customer specific, one might have 3 fields while another might have 12 or 20. (in the above examples Customer3 has many more fields than 1 or 2, and no customer sends in the same data items.)

Yes, it is still a WTF. One of the main precepts that E.F.Codd came up with (he was the guy who originally came up with the idea of relational databases) was that all fields in a table should relate directly to the PK. This means that your CustomerJobs table really should not have address data in it. Far better to have separate Customer, CustomerAddresses (in case your customers have more than one address - I know all mine have), Products and then specialized Customer tables as mentioned previously. The you'd add your CustomerJobs table as a junction between the Customers and Products tables. It also looks as if you will need to parse and validate your data to load the database correctly (EyeColor??? - why do they want you to know that?).

Looking at your example data is just depressing - why is it that customers never have the first clue about organizing data? - oh yeah, cos they aren't tech people.

But we don't have that excuse.

lpope187

@zedhex said:

Yes, it is still a WTF. One of the main precepts that E.F.Codd came up with (he was the guy who originally came up with the idea of relational databases) was that all fields in a table should relate directly to the PK. This means that your CustomerJobs table really should not have address data in it. Far better to have separate Customer, CustomerAddresses (in case your customers have more than one address - I know all mine have), Products and then specialized Customer tables as mentioned previously. The you'd add your CustomerJobs table as a junction between the Customers and Products tables. It also looks as if you will need to parse and validate your data to load the database correctly (EyeColor??? - why do they want you to know that?).

This is one of those gray areas between theory and practice. Most definitely I would not have the job reference any address associated with customer directly, I'd copy the address entry from the customer and associate it with the job directly. There are several reasons I'd do this. One is for drop shipments - I buy something but I want it shipped to my sister (a new address entry associated only with that order). Another reason is location moves - either a company changes their headquarters or gets purchased and absorbed. You need the original shipoing address for tax and auditing purposes, but any new orders go to the updated address on the customer master. The choice of adding those actual address fields to the job somewhat eliminates the accidental rewrite of shipping history by updating the master record - so in this case having the address fields directly on the job might be appropriate.

@Abbacabba said:

Here is an example of afew data lines I might receive (comma delimited):

Customer1 : 123 Main St., SomeCity, TX, 12345-1234, MemberID, Message, MemberPin
Customer2 : 987 1st Street, City2, VA, 54321, eyeColor, Height, Weight, Sex
Customer 3 : 21 Jump Street, New York, NY, 22222, badgeNo, Classification, Rank, DepartmentNo, DeskNo, licenseToKill, status

From what I can tell, this looks like product specific attributes. For example, customer A sells clothes so they need size and color attributes but customer B sells stationary so they need style and letterhead attributes. Those product specific attributes are specific to each product, customer and job line item and are displayed on the workorder. Is this a correct understanding of the sample data?

Your are correct in your assumption that they are product specific attributes.

I've been working on a real schema in SQL server 2005 and found a good and/or working way to represent this while following up 1NF, 2NF & 3NF.

I really just had to supress my developer wants and go at it as a good DB design. I was -enticed- by the thought of just having "one table to rule them all" (I know that was lame).

right now I have my Jobs, customers, products, data records all split apart and using intermediate tables for joining them together, no more mashing things all together.

Grovesy

@lpope187 said:

This is one of those gray areas between theory and practice. Most definitely I would not have the job reference any address associated with customer directly, I'd copy the address entry from the customer and associate it with the job directly. There are several reasons I'd do this. One is for drop shipments - I buy something but I want it shipped to my sister (a new address entry associated only with that order). Another reason is location moves - either a company changes their headquarters or gets purchased and absorbed. You need the original shipoing address for tax and auditing purposes, but any new orders go to the updated address on the customer master. The choice of adding those actual address fields to the job somewhat eliminates the accidental rewrite of shipping history by updating the master record - so in this case having the address fields directly on the job might be appropriate.

no reason that you would need to replicate the fields in different tables, we could do something like...

customer {customer#, name...}

pk {customer#}

address {address#, postcode, addressline1....}

pk {address#}

customer_address {customer#, address#, address_type, active}

pk {address#, customer#, address_type}

fk {customer# customer.customer#, address# address.address#};

job {job#, customer#, location#, start_date, estimated_end_date}

pk {job#}

fk {customer# customer.customer#, location# customer_address#}

job_completed {job#, actual_end_date, completed_by, notes}

pk {job#}

fk {job# job.job#}

type address_type {main, secondary, shipping, satelite}

customer_data {customer#, name, value}

pk {customer#, name}

fk {customer# customer.customer#}

This would preserve customer address history (all addresses and all historic addresses), and would allow us to compose the following queries. all jobs and the address of the job, all customers at a certain address, all jobs at a certain address, all jobs for a customer at a certain address.