@Aaron said:
@Giskard said:
I don't know about you guys.... but if I had to do this...
Would making a second table for the historical data with all the same columns as the first and the additional date column make more sense?
One table gets live data, fast indexes without junk in the way, no convoluted queries and to copy it to the second table you can do it in an INSERT (...) VALUES (date, (SELECT ...)) statement.
'Historical' table can probably even live with reduced indexing if you don't report much (or at all)
Several problems with this:
- No way to automatically enforce data integrity; requires triggers or application logic that could be broken, buggy, or accidentally removed or disabled;
- All of the above needs to be maintained if the table schema is ever changed;
- Completely hoses several common methods of replication and ETL;
- Makes queries slower and more complicated when you actually want to see the disabled/deleted rows (and there must be some instances when you do, otherwise why keep them?);
- Requires that you maintain a separate "deleted" table for every child relationship (horrible).
If there's some reason for you to have a "graveyard" table, for example offline synchronization, that's fine, but don't raise the dead. A tombstone is not an undo state. Seems a lot of programmers and database newbies make this mistake.
The materialized and/or indexed view approach is by far the most reliable and efficient.
With points 1 and 2, yeah it does require some care to maintain, but any SQL bug that slips in will most likely mean some missing audit data instead of a bug in live data. Still crappy, but usually not as deadly
Point 3... yeah.... no fight there.
Point 4: I've never yet had to write anything that mixes live data with logged data, so reports just run on the second tables. No convoluted mix of tables required. Might not always hold true, but has been holding in the business case I've used it for
Point 5: Ah, I will admit that would be disgustingly nasty, and the one case where we needed that kind of info I actually used the date column strategy. Later to find out we didn't even need that part of the data retained, so it just deletes the rows.
I guess I didn't realize I kind of used both. Also, for the use we have, replication of the 'deleted' or 'logged' data actually isn't that important and can be excluded from any replication sets.
My main concern with the information I've dealt with was always making sure live data was entirely separate from historical data, but I'm probably just thinking too defensively.