Data Change Auditing




  •  <meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta name="ProgId" content="Word.Document"><meta name="Generator" content="Microsoft Word 12"><meta name="Originator" content="Microsoft Word 12"><link href="file:///C:%5CDOCUME%7E1%5Ckswanton%5CLOCALS%7E1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_filelist.xml" mce_href="file:///C:%5CDOCUME%7E1%5Ckswanton%5CLOCALS%7E1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_filelist.xml" rel="File-List"><link href="file:///C:%5CDOCUME%7E1%5Ckswanton%5CLOCALS%7E1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_themedata.thmx" mce_href="file:///C:%5CDOCUME%7E1%5Ckswanton%5CLOCALS%7E1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_themedata.thmx" rel="themeData"><!--​[if gte mso 9]>

    Looking for opinions here.  I'm in the initial stages of a database design with a requirement for change auditing/logging.  I'm basically analyzing four different options and thought others here might be interested in sharing their own experiences, disasters, success stories…wtfs.. etc etc.

    Option 1: History Tables

     For each table that needs to be audited, create another table that matches the schema to keep each version of the row as it changes over time.

    Pros: Easy, Fast, same code that loads records from the primary tables can be used to load historical data as the schema matches.
    Cons: Potentially doubles the number of tables in the database,  lots of data, history tables schema must be kept in sync with primary tables

    Option 2:  Same tables, versioned

    For each table that needs to be audited, add some type of version identifier.  Maybe 'VersionNumber' or something, where the highest or 'NULL' is the 'Current' - This is not my idea, so don't flame me too much here.  I think this is doomed to fail if attempted.

    Pros: ahhhh… no doubling of tables? – same code to load records can be used regardless if the record is an ‘Audit’ record, or the ‘current’ record, data schema automatically stays in sync (in comparison to option 1)
    Cons: Queries become much more complex throughout the application, lots of data.

    Option 3: Change Rollup (Just the new data at the column level)

    A history table with 5 columns, ChangedTable, ChangedColumn, UserName, DateChanged, NewValue

    Pros: Single (or far less) tables to store history, less data
    Cons: More complex to load a record in a specific point in history, slower (assumed), Lose <meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta name="ProgId" content="Word.Document"><meta name="Generator" content="Microsoft Word 12"><meta name="Originator" content="Microsoft Word 12"><link href="file:///C:%5CDOCUME%7E1%5Ckswanton%5CLOCALS%7E1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_filelist.xml" rel="File-List"><link href="file:///C:%5CDOCUME%7E1%5Ckswanton%5CLOCALS%7E1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_themedata.thmx" rel="themeData"><link href="file:///C:%5CDOCUME%7E1%5Ckswanton%5CLOCALS%7E1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_colorschememapping.xml" rel="colorSchemeMapping"><style> </style>referential integrity with the data (however might not be a huge issue) </m:defjc></m:rmargin></m:lmargin></m:dispdef></m:smallfrac></m:brkbinsub></m:brkbin></m:mathfont></m:mathpr></w:word11kerningpairs></w:dontvertalignintxbx></w:dontbreakconstrainedforcedtables></w:dontvertaligncellwithsp></w:splitpgbreakandparamark></w:dontgrowautofit></w:useasianbreakrules></w:wraptextwithpunct></w:snaptogridincell></w:breakwrappedtables></w:compatibility></w:donotpromoteqf></w:validateagainstschemas></w:punctuationkerning></w:trackformatting></w:trackmoves></w:worddocument></xml>

    Option 4: Store the entire record, in some type of serialized fashion (XML Serialization)

    Pros: Easy
    Cons: Lots of data, if the object model changes over time can cause issues when deserializing unless all of the historical data was updated to match upon a new version being deployed.

    <o:p> </o:p>

    I know which option I'm leaning towards.  I've done this on a smaller scale before.  The current design is currently around 250 tables, but a portion of those don't need to be audited as they are lookup data or whatever.  Just figured I'd see what others have experienced before moving forward.



  •  Option 5:  SQL Server 2008 with Change Data Capture enabled.  

    IMO, Option 1 is good for things that have gone to steady state.  Example: Orders.  An order gets placed and changes until it is either shipped or canceled.  At which point it goes to the history table.  I've seen this in ERP systems to help speed up the MRP process.

    Option 2 is what you see in data warehouses and despite your misgivings actually works very well.  The only bad part is that most of your queries will need to have WHERE IsCurrent = 1 or such in them.  The advantage here is rollback.  Delete the most current record and set IsCurrent = 1 for the previous record and your done.  It's also a good choice if the client needs to view but not change previous versions - think document/electronic form version control.

    Option 3 just sucks.  Complex to load, Complex to query, Complex to rollback.

    Option 4 sucks as well.  XML in a relational database just seem wrong not to mention the poor performance and the fact that one table can't do it all if you actually want to use DTD to enforce constraint.

     

    I've just finished a large application that used option 2.  It basically is a large workflow enabled document / form version control library, so it makes sense.  Without knowing anymore details, I'd probably go either Option 1 or Option 2.



  • Option 3 sucks completely. Even if "DateChanged" also includes the time in seconds, fast updates on the same row within the same seconds make it literaly impossible to recreate the state of a table at a specific point in time. At least you would need an update counter so you know a) which changes happened together and b) which changes happened first (when timestamps are equal).

    Option 4 might be ok if you do not really need to query the data by value, just by timestamp.

    Option 2 makes all queries more complicated. I would only consider that if I had a data access layer for all queries that access those tables that hides the nasty details. It's hard to enforce the integrity of the history records with this option - a faulty update statement could affect them as well. In all other options, the tables can be put into a "insert only mode" that doesn't allow updates or deletes.

    Option 1 is my favorite. You might want to create another schema for those history tables, so they don't flood the table list in tools etc.

     

    BTW, Oracle offers features in the current version that take care of most of that stuff...

     



  • @ammoQ said:

    Option 3 sucks completely. Even if "DateChanged" also includes the time in seconds, fast updates on the same row within the same seconds make it literaly impossible to recreate the state of a table at a specific point in time. At least you would need an update counter so you know a) which changes happened together and b) which changes happened first (when timestamps are equal).

    Option 4 might be ok if you do not really need to query the data by value, just by timestamp.

    Option 2 makes all queries more complicated. I would only consider that if I had a data access layer for all queries that access those tables that hides the nasty details. It's hard to enforce the integrity of the history records with this option - a faulty update statement could affect them as well. In all other options, the tables can be put into a "insert only mode" that doesn't allow updates or deletes.

    Option 1 is my favorite. You might want to create another schema for those history tables, so they don't flood the table list in tools etc.

    I agree with your analysis but would also like to add that Option 2 will most likely slow the performance of the app down.  It will bloat the tables and indexes of the "current" tables and make the queries more complicated, not just for the application but for the query analyzer.  It will make things like backups, replication and restoration much slower and require more frequent ANALYZEs on that data so the query analyzer can figure out the version column more accurately.  It requires establishing some kind of mutex for each record so that there are not concurrency issues with version numbers.  Also, if you're using mysql it's a total no-go as mysql will never use more than one index per table so you end up resorting to full scans for every single query.  It's esentially building an MVCC transaction system on top of an exisiting RDBMS which is really quite stupid.  If you're going to do a history, at least do it "offline" in a separate table, or use an RDBMS that can handle it natively.

     

    Ultimately, it all depends on what you are trying to accomplish.  If the history is going to be used a lot, there may be a less WTFy way to implement it.  If it's really only an auditing record, I would just use something like a log file on disk, as that is much, much better suited to being continuously appended to if you don't need random access to a certain revision frequently.



  • IMHO the best version is as lpope187 already stated, SQL server 2008 with versioning...

    That being said, I have the same problem in the current project and unfortunately have the 2005, so I did some googling and found this thing:

    http://www.codeproject.com/KB/architecture/LisRecordVersioning.aspx 

    Its Option 2. works nice and so farn and i haven't noticed any performance issues.



  • OK here's my 3penny worth. My qualifications? 22 years in IT, many many years of Sybase and other.

    Firstly, you say you have a "requirement for change auditing/logging". If that is all, then all of your suggested options are a bit over the top. If all you need to do is to audit "who changed what and when", then you just need to write log records with some key details and changed details to a log table (do it with triggers on each table). No, you won't be able to restore or roll back data from it (well, you might be able to), but that isn't stated as a requirement. Handy Hint: Don't use the word "audit" in your table name, otherwise the powers-that-be will never ever ever let you delete data from it, even if it's 25 years old and totally useless. Call it "event log" or something technical, to indicate that it's *your* (ie IT) data.

    Otherwise, I reckon that options 1 and 2 are ok, but you should choose depending on what you're doing. If you have many many versions of an item, and need to be able to restore or roll back, then that will need a different design to one where you are just storing off previous versions for safety's sake or for occasional historical lookups.

    If I had to do this, and there were a lot of calls for restoring or looking at time snapshots, I'd probably use Option 2, with 'from' and 'to' dates. You then have to include 'to' date as an additional column in the unique index of each table, and in the where-clause of each select. Set up a conventional 'end-of-time' date (something like 1 jan 9999), and that will be the value of the 'to' date for a current record. When inserting, a new version, make sure you update current version's 'to' date to now() before inserting new record with 'end-of-time' value, else you'll get conflicts. You say queries become much more complex - not true. Just one extra line in your where-clause. If you're worried about data volumes hitting performance, you can periodically archive to a history table and re-index, but that does make de-archiving or rollbacks harder.



  • @Nelle said:

    That being said, I have the same problem in the current project and unfortunately have the 2005, so I did some googling and found this thing:

    http://www.codeproject.com/KB/architecture/LisRecordVersioning.aspx 

    Its Option 2. works nice and so farn and i haven't noticed any performance issues.

     

    No, it's not Option 2, since it involves another table. And IMO it sucks completely. It's by far worse than the naive version of Option 2 where every table has a few additional columns.

    Here's why:

    a) performance

    let's look at a typical bread-and-butter query:

    select * from Blog join Audit on Blog.Id = Audit.Id where IsActive=1 and UserId='moron' 

    So where is the problem, you may ask. Both tables are indexed, aren't they?

    Turns out that to find a single record, the database has to loop through all versions of that record (really, no way around that) and join them to the audit table. What would otherwise be the most simple indexed single-row access becomes by far more expensive, though you might not notice in small databases.

    b) joins

    let's look at a query that returns for all blogs the number of comments

    select  a1.PermanentRecordId, Blog.title, count(*)
      from Blog
      join Audit a1 on Blog.Id = Audit.Id
      join Comment on Comment.PermanentRecordId=a1.PermanentRecordId
      join Audit a2 on Comment.Id=a2.Id
      where a1.IsActive=1
        and a2.IsActive=1
      group by a1.PermanentRecordId, Blog.title

    now compare that to the naive version of Option 2:

    select Blog.Id, Blog.title, count(*)
      from Blog
      join Comment on Blog.Id = Comment.BlogId
      where Blog.IsActive=1
         and Comment.IsActive=1
      group by Blog.Id, Blog.title.

    or Option 1,3,4:

    select Blog.Id, Blog.title, count(*)
      from Blog
      join Comment on Blog.Id = Comment.BlogId
      group by Blog.Id, Blog.title.

    and it's getting worse with every additional table, subquery etc.

    c) complex update statements

    we see on the website you've referenced how complicated inserting, updating or deleting a single record has become.

    Now consider the following statement:

    update order set foobar = (select count(*) from orderline where orderid=order.id and foo='bar')
      where id in (select orderid from orderinfo where bar = 'foo');

    Turns out you can't easily do that in that bastardized option 2 model. You'll have to loop through all records and process them one-by-one.

    d) impossible to implement with triggers

    Options 1,3,4 can be easily implemented with triggers, so the rest of the program does not have to care about versioning at all. The naive version of Option 2 requires a bit more work, but at least you use a trigger that creates a copy when you do update statements.

    e) surrogate keys

    Some people like them, some don't, but anyway, the given model doesn't allow the use of natural keys at all. No choice.

    f) no tool support

    Forget visual query builders, ERD reengineering etc.

    g) referential integrity

    Forget referential integrity constraints. Since all tables technically reference the audit table, referential integrity constraints become worthless.

     

    Conclusion: by implementing that model, you throw away most advantages an RDBMS has to offer. You might as well go back to ISAM files or similar pre-relelational data storage options.



  • a) performance

    @ammoQ said:

    select * from Blog join Audit on Blog.Id = Audit.Id where IsActive=1 and UserId='moron'

    this will (should) get automatically optimised to : 

    select * from Blog join Audit on Blog.Id = Audit.Id and IsActive=1 where UserId='moron'

    and the DBS joins only the active records from the Audit table not all versions...

    imo it still is the indexed single-row access ...

    @ammoQ said:

    Turns out that to find a single record, the database has to loop through all versions of that record
    #

    b, c, d, e, f, g

    you are right ... i agree 100% ...

    good thing i was still in the design stage ...  

    thx 

     



  • @Nelle said:

    and the DBS joins only the active records from the Audit table not all versions...


    the rows in audit table have no clue which table it belongs to, so the number of active rows in Audit table is the number of all active rows in all tables.

    so the access can be like one of those plans:

    a) foreach row in Audit where active = 1: join row(s) in Blog where UserId='moron'

    b) foreach row in Blog where UserId ='moron': join row(s) in Audit where active = 1

    c) query both tables individually, hash join

    Apparantly a) and c) would be extremely slow, so you have to hope the query optimizer chooses Plan b)




  •  @ammoQ said:

    Option 3 sucks completely. Even if "DateChanged" also includes the time in seconds, fast updates on the same row within the same seconds make it literaly impossible to recreate the state of a table at a specific point in time. At least you would need an update counter so you know a) which changes happened together and b) which changes happened first (when timestamps are equal).

    Option 4 might be ok if you do not really need to query the data by value, just by timestamp.

    Option 2 makes all queries more complicated. I would only consider that if I had a data access layer for all queries that access those tables that hides the nasty details. It's hard to enforce the integrity of the history records with this option - a faulty update statement could affect them as well. In all other options, the tables can be put into a "insert only mode" that doesn't allow updates or deletes.

    Option 1 is my favorite. You might want to create another schema for those history tables, so they don't flood the table list in tools etc.

     

    BTW, Oracle offers features in the current version that take care of most of that stuff...

     

     

    SQL Server has CDC as well - not too sure about Oracle, but we have licensing for SQL Server 2008 approved so no chance of using anything else anyway.  

    I've looked at the CDC feature in SQL 2008 - looks promising, but the three issues I see with it from the perspective of my project:

    First, each 'Record' in the system spans many tables and can be quite complex.  While it would be fairly straight forward to get a record at a point in time with CDC, I like the idea of a ‘HistoryMaster’ table. All of the individual history tables FK to the history master table, which would contain root level data such as the date/time the record was updated, the user who performed the update etc etc.  Loading a historical / audit record from the table would be as simple as ‘Get Record at History Id #1234’.   Doing the same thing with CDC I would have to figure out which rows of each table belong to the specific instance of the ‘record.’

    Second: Using the CDC method, there is no built-in way to get the application user who performed the update that I can see.  Having a master table for each update/modification of the record could be used to store the application user, date, time, IP Address, any other important information.

    Third: Subsets of the data need to be taken ‘Out on the road.’  While I won’t get into the business reason behind this requirement, the business need goes something along these lines: A specific customer’s data is transferred into another instance of the database that is blank (other than the lookup, and essential stuff).  Database goes on the road, updates are made, additions, deletions.  Data comes back and is merged into the primary system.  While on the road, the database is housed in a SQL server 2005 Express edition instance, and hence, no SQL Server agent.  CDC relies on SQL Server Agent to work, so as I see it, this is a deal-breaker for CDC. <o:p></o:p>

     



  • @kswanton said:


    Third: Subsets of the data need to be taken ‘Out on the road.’  While I won’t get into the business reason behind this requirement, the business need goes something along these lines: A specific customer’s data is transferred into another instance of the database that is blank (other than the lookup, and essential stuff).  Database goes on the road, updates are made, additions, deletions.  Data comes back and is merged into the primary system.  While on the road, the database is housed in a SQL server 2005 Express edition instance, and hence, no SQL Server agent.  CDC relies on SQL Server Agent to work, so as I see it, this is a deal-breaker for CDC.

     

    Isn't this what Microsoft Sync Framework is supposed to address?  



  •  @lpope187 said:

    @kswanton said:

    Third: Subsets of the data need to be taken ‘Out on the road.’  While I won’t get into the business reason behind this requirement, the business need goes something along these lines: A specific customer’s data is transferred into another instance of the database that is blank (other than the lookup, and essential stuff).  Database goes on the road, updates are made, additions, deletions.  Data comes back and is merged into the primary system.  While on the road, the database is housed in a SQL server 2005 Express edition instance, and hence, no SQL Server agent.  CDC relies on SQL Server Agent to work, so as I see it, this is a deal-breaker for CDC.

     

    Isn't this what Microsoft Sync Framework is supposed to address?  

     

    Yes, it is a candidate to be used for this function. 



  • I’ve normally gone for the history table route, normally implementing triggers on Update, deletes, and create on the table I want to track. These write to a second table which captures date time, username, the action and in cases where we actually care, the old data.

     

    Though I normally try and keep this to an absolute minimum, using my current contract as an example, some parts of the business process are naturally self ‘journaled’, for example tracking an order through. Reason being as the order goes through each stage, the previous stages are effectively locked down and become immutable.

     

    You may ask how an order can be amended, as long as the order hasn’t been queued for picking it can amended by effectively canceling the order then creating a new one, the old order is then linked to the new order for tracking. As the order system is pretty much in 5NF the order system is broken down into lots of small tables, once inserted you can’t update or delete. From the customer clicking ‘order’ the process is pretty much automated from the payment processors to the distribution center which has auto pickers, auto packers which land the package on a conveyer belt before going into the awaiting lorry the very idea of being able to amend already recorded data is pretty much a WTF.

     

    Where we do journal, we use option 1 (interactions with customers and customer services), occasionally option 2 for things like product price changes and other state changes (we don’t care about the performance impact, as this data is sucked out and shoved into a caching system) and sometimes we allow data to transient up to a point (say creating a promotion) with no tracking, but once published and approved it becomes immutable, but can be ‘cloned’ and the old one canceled).

     

    Oh and one I forgot, similar to price changes, say a customer name changes, the account data which holds the account number and other key account information is divorced from the personal details of the account holder(s), this table (well tables as account holder names, contact details and addresses are all separated out) are naturally built for history to track telephone number, name changes.

     

    So, IMHO there is no such thing as a one size fits all approach to tracking, sometimes option1, sometimes option 2 and sometimes just don’t allow changes! As long as your tracking mirrors the actual business process you can’t go wrong.

     

    Out of interest, has anyone read Temporal Data and the Relational Model? Was quite an interesting read, of course can’t be done with current SQL implementation. <o:p></o:p>


Log in to reply