Database structure (relational) for variable amount of custom data per object

daniel_c_w

In general: An object has a variable amount of custom user data. An administrator can set up the textfields, and the user can type in any text.

example: additional infos (haircolour, favourite movie, etc) in user profiles.

My approach would be to set up a table with one field to referecne the object (user), one or more fields to reference the data (ID, type, etc), and one field to hold the data.

The approach of a coworker of mine would be to alter the table (its structure) every time an administrator sets up (or deletes) a custom field.

Initially I cringed about my coworkers approach, but I didn't know why. The more I think about it, the less I think, that my coworker's approach has any flaw. The only reason I dislike it, is because I think, that an apllcation should only alter data, and not the data modell. But hat thought is rather dogmatic.

Do you have any arguments, pro or contra, about these two approaches?
Or does anybody have a better idea?

I would never have the application alter the database schema.

dhromed

@daniel c w said:

My approach would be to set up a table with one field to referecne the object (user), one or more fields to reference the data (ID, type, etc), and one field to hold the data.

You don't mention the second table that holds the field-value pairs plus FK to the object table, but I assume that you have it, because having the admin change the data model is on exactly the same footing as having the application modify it: you just automate the WTF, not mitigate it.

If you have a variable amount of keys belonging to a single object, you "always" get the two-table model. I say "always", because in truth I recently implemented a language fields table with fields [key]-[nl]-[en], which is theoretically a variable amount of languages and goes against the relational model, but hey, these languages are "never" going to change (that is, not within the next two or three years for this project), so on this tiny scale it's manageable.

Any larger + assured field changes => two-table model, no question.

ammoQ

You approach is easy to set up, but you will have a hard time if you ever need to query that data, e.g. "which user is blond and likes Star Wars Episode 1 best"?

Your coworker's approach makes such queries easier, but it means that all statements to access the table have to be dynamically created at run time. Which makes it a bit harder to use; I don't think that tools like Hibernate are much help in such a setup.

A possible third alternative is to have one large field in the table, that contains all the additional data e.g. in an XML string. But this isn't easy to query, either.

@ammoQ said:

You approach is easy to set up, but you will have a hard time if you ever need to query that data, e.g. "which user is blond and likes Star Wars Episode 1 best"?
Your coworker's approach makes such queries easier, but it means that all statements to access the table have to be dynamically created at run time. Which makes it a bit harder to use; I don't think that tools like Hibernate are much help in such a setup.
A possible third alternative is to have one large field in the table, that contains all the additional data e.g. in an XML string. But this isn't easy to query, either.

While a field-value mapping table is not the easiest thing to query, it is definitely possible, because I've done it with arbitrary filters on any field you want. As for the XML idea, I think that might actually be the best way to go. With modern databases, you can have XQuery built right into your relational queries.

lpope187

As the others have stated, modifying the schema is a bad idea. Personally I'd go with the key/value table design over the xml column. But that's primarily because if you ever need to bulk process the data in the column (like extracting the info to a data warehouse), the xml column will be much slower and complex. Take the deletion of a custom field definition that you mentioned. This would be a very easy query in plain SQL for the key/value pair approach, but I can't even begin to think about the syntax needed to delete the data in the xml field approach.

@lpope187 said:

As the others have stated, modifying the schema is a bad idea. Personally I'd go with the key/value table design over the xml column. But that's primarily because if you ever need to bulk process the data in the column (like extracting the info to a data warehouse), the xml column will be much slower and complex. Take the deletion of a custom field definition that you mentioned. This would be a very easy query in plain SQL for the key/value pair approach, but I can't even begin to think about the syntax needed to delete the data in the xml field approach.

I don't know what database you are using, but if it has XML support (SQL Server and oracle both have it), this would be as simple as a single XQuery.

tgape

@daniel c w said:

My approach would be to set up a table with one field to referecne the object (user), one or more fields to reference the data (ID, type, etc), and one field to hold the data.

The approach of a coworker of mine would be to alter the table (its structure) every time an administrator sets up (or deletes) a custom field.

An SQL database row takes up a certain amount of minimum space based on the number of columns it has, and possibly the types of the columns. (NULL columns may take up less space than a non-null varchar with 0 characters, depending on the SQL database. But it at least takes up a bit to say whether or not it is null.) As such, your coworker's solution will not scale. Further, the larger the database becomes, the more time it takes to alter the table, and the more temporary space it will need to do it.

If you need to be able to find entries which do not have an arbitrary additional tag, your approach won't work. One alternative may be to use a hierarchical or object-oriented database. For example, I've seen an LDAP database ~~kludged~~ set up to handle arbitrary data of that sort. Basically, it had one custom attribute added, we'll call it "data". All values of this attribute were of the form '<custom tag>=<value>'. While this did take additional space for every custom field used, it required no space for any custom fields not used, for any particular entry.

I've also seen an SQL database set up with one column being defined as a potentially ginormous blob; custom fields were glued together in here via an XML overlay. While this did allow searches to look for "this but not that", MySQL indexes were not helpful on any complicated queries, so they were slow as, well, something really slow.

Of course, one needs to choose ones database based on the overall need of your data; this one aspect may not be sufficient to change databases.

RaspenJho

XML is very useful in this scenario. Here is a little test script (for SQL Server 2008) that plays around with the XML.

declare @t table (
	Id			int not null,
	Username	nvarchar(64) not null,
	CustomData	xml
)

insert @t (Id,UserName,CustomData) 
values	(
		1, 
		'testuser@test.com', 
		'<data><CustomData Name="HairColor" Value="White" /><CustomData Name="HomePage" Value="http://thedailywtf.com" /></data>'
		),
		(
		2, 
		'testuser2@test.com', 
		'<data><CustomData Name="HairColor" Value="Red" /><CustomData Name="HomePage" Value="http://thedailywtf.com" /></data>'
		),
		(
		3, 
		'testuser3@test.com', 
		'<data />'
		)

select	T.*
from	@t	T
cross apply CustomData.nodes('/data/CustomData[@Name="HairColor"]') HairColor(T)
cross apply CustomData.nodes('/data/CustomData[@Name="HomePage"]') HomePage(T)
where	HairColor.T.value('@Value', 'nvarchar(64)') in ('White', 'Red')
  and	HomePage.T.value('@Value', 'nvarchar(64)') = 'http://thedailywtf.com'

Nice example. I think XML is really going to be the easiest, cleanest and most elegant solution which is available in most DBMS.

Jaime

I think XML really is overkill here.

Here is the pre-XML "standard solution":

CREATE TABLE UserProfiles
(
UserID int NOT NULL PRIMARY KEY,
UserName varchar(200) NOT NULL
)

CREATE TABLE Properties
(
PropertyID int NOT NULL PRIMARY KEY,
PropertyName varchar(30) NOT NULL,
)

CREATE TABLE UserProperties
(
UserID int NOT NULL,
PropertyID int NOT NULL,
PropertyValue varchar(200) NOT NULL
)

INSERT UserProfiles VALUES (1, 'Bob')
INSERT UserProfiles VALUES (2, 'Mary')

INSERT Properties VALUES (1, 'HairColor')
INSERT Properties VALUES (2, 'HomePage')

INSERT UserProperties VALUES (1, 1, 'White')
INSERT UserProperties VALUES (1, 2, 'http://thedailywtf.com')
INSERT UserProperties VALUES (2, 1, 'Red')
INSERT UserProperties VALUES (2, 2, 'http://thedailywtf.com')

In the App GUI, a user chooses to search for Red hair and TDWTF home page. The UI needed to know the names and IDs for HairColor and HomePage to build the form. The app then issues a query similar to the folowing:

SELECT
*
FROM
Users
WHERE
  UserID IN (SELECT UserID FROM UserProperties WHERE PropertyID = 1 AND PropertyValue = 'Red')
    AND
  UserID IN (SELECT UserID FROM UserProperties WHERE PropertyID = 2 AND PropertyValue = 'http://thedailywtf.com')

It's pretty easy to generate standard search clauses using this schema.

RaspenJho

Just as a side note (I'm not sure when they added it for SQL server), you can shorten this script by inserting multiple rows at once...

INSERT UserProfiles VALUES 
    (1, 'Bob'), 
    (2, 'Mary')

INSERT Properties VALUES 
    (1, 'HairColor'), 
    (2, 'HomePage')

INSERT UserProperties VALUES 
    (1, 1, 'White'), 
    (1, 2, 'http://thedailywtf.com'), 
    (2, 1, 'Red'), 
    (2, 2, 'http://thedailywtf.com')

Lingerance

@RaspenJho said:

you can shorten this script by inserting multiple rows at once...

Seeing as how MySQL does the same thing, it's probably a common feature, if not part of the standard.