Database design question

crbd98

<FONT class=fixed_width face="Courier, Monospaced">Hello All,

I would like your perspective on a (re)design for data aggregation.

This is the general scoop:
- We have many "data providers" grouped in categories.
- Providers in the same category sent data with the same structure.
- Providers can be added/removed dynamically.
- The data provided is linked to other objects in our database.

In the current system, each instance of a data providers writes data
in a separate table. Since we have many providers, we end up with tons
of tables whose names are not known a priori. As a result, we have to
use name conventions for tables, which is not a clean approach.

I considered having only one table for all providers and have a typed
xml column for the actual data. That column is associated with a
schema collection that contains the schemas for all metadata
providers.

>From the architecure standpoint, this approach simplifies a lot my

scenario. However, I have some concerns:

1. Bulk: Since this table will be very large (possibly millions of
rows).
2. Load: I need to search, add, update, modify entries on this table
from multiple providers. Therefore, there is an issue of contention.
3. Performance: XML is usually very slow.

I would appreciatte your comments in this matter.

What do you think about this approach?
How you compare both designs?
Are there any provisions or techniques to mitigate the concerns with
the simgle xml approach?

Kind regards
CD
</FONT>

asuffield

You're trying to make a database do the job of a server daemon. This is fundamentally insane.

ammoQ

XML in the database is IMO not a good idea, unless it's simply "payload" (just like a blob) and doesn't need to be searched, sorted, etc.

From your description of the problem, I'd try to go with one table per category; if the structures for categories are very similar, you could go with just one table with the superset of all necessary columns.

LoztInSpace

Like ammoQ says, it sounds like you want a single table with the columns for all providers and then an extra 'providerID' column? Alternatively one table per category with perhaps a view across all categories if you need to see the whole lot in one go.

I would never, ever go the XML route you have just suggested.

RaspenJho

@crbd98 said:

<FONT class=fixed_width face="Courier, Monospaced">Hello All,

I would like your perspective on a (re)design for data aggregation.

This is the general scoop:
- We have many "data providers" grouped in categories.
- Providers in the same category sent data with the same structure.
- Providers can be added/removed dynamically.
- The data provided is linked to other objects in our database.

In the current system, each instance of a data providers writes data
in a separate table. Since we have many providers, we end up with tons
of tables whose names are not known a priori. As a result, we have to
use name conventions for tables, which is not a clean approach.

I considered having only one table for all providers and have a typed
xml column for the actual data. That column is associated with a
schema collection that contains the schemas for all metadata
providers.

>From the architecure standpoint, this approach simplifies a lot my

scenario. However, I have some concerns:

1. Bulk: Since this table will be very large (possibly millions of
rows).
2. Load: I need to search, add, update, modify entries on this table
from multiple providers. Therefore, there is an issue of contention.
3. Performance: XML is usually very slow.

I would appreciatte your comments in this matter.

What do you think about this approach?
How you compare both designs?
Are there any provisions or techniques to mitigate the concerns with
the simgle xml approach?

Kind regards
CD
</FONT>

Check out SQL Server 2005's new Service Broker system. It fits your requirements perfectly, assuming you are on (or have access to) SQL 2K5 of course.

NickBloodh

If you're not fixed on a certain DBMS, you could have a look at PostgreSQL's inheritance capabilities.

Like (basically, RTFM!):

CREATE main_table (id serial, provider int, data text);

CREATE first_provider INHERITS main_table;

CREATE second_provider INHERITS main_table;

This will give you all entries from all inheriting tables when SELECTing from main_table.

Could reduce the needed changes to what you have right now...

I agree to the others' opinion regarding XML as anything but pure payload.
It's not too hard to "unroll" the real data if it's not too deeply nested.

Cheers,

Nick

JvdL

1. Bulk: Since this table will be very large (possibly millions of rows).
Wouldn't worry about that: millions of rows is not much for a decent database.

2. Load: I need to search, add, update, modify entries on this table from multiple providers. Therefore, there is an issue of contention.
That's hard to judge without knowing the load on your system, but again, a decently configured DB should be able to handle a 100 or so "upserts" per second, while maintaining decent select response time.

3. Performance: XML is usually very slow.
If it's primarily generating XML from the provider data, and putting that into one un-indexed column as "payload", that should not be slow. In fact, generating XML + one column insert is likely to be faster than inserting multiple (indexed) columns. That said, if you're planning to search on the XML content, you should not take this route.

You say that the provider data is linked to other data in the system, I assume by relational references - in that case XML is out of the question.

Surely different categories of provided data have something in common? If not I would suggest using different schemas (and applications) altogether instead of different tables.

My tip is to do a little bit of application analysis. Try to find the common patterns in the different data structure categories. You should be able to fit it into one or two base tables that cover 80% of the provided data. Check out (ask your users and log application queries) which data is most relevant - it's not likely to be a lot more than the 80% common ground you'll already have.

Put a layer in between provider input and the database that massages the data and puts it in the 80/20 tables, properly indexed and related to the other data in your schema. For the remainder of the data you can either choose your XML approach (if it's not going to be heavily searched) or use an auxiliary column table with essentially 3 columns: 1 FK to the "80/20" record with primary data, 1 FK to a logical column name (from the provider category), and 1 column for the data. Depending on circumstances, you can have various auxiliary table columns for multiple data types, or just one auxiliary with multiple data columns for each data type