Fear of deleting



  • In a project some smart ass adopted the principle that records should not be physically deleted but leave an audit trail instead. Each table has a "deleted" field which is time stamp. When a record has a NULL value, it is for real but otherwise it is considered deleted. Unfortunately, they didn't bother to implement cascades. The database is full of records that refer to others that are virtually deleted and by inference they must be considered deleted as well, that means not show up in queries.

    Every query on that database has to take this into account. So instead of simply

       SELECT stuff FROM orders WHERE something

    It goes like this:

       SELECT stuff FROM orders o

       JOIN customers c on ...

       JOIN products p on ...

       JOIN warehouses w on ...

       JOIN shipments s on ..

       JOIN distributors d on ...

       JOIN etcetera e on ...

       WHERE something

       AND o.deleted is null and c.deleted is null and p.deleted is null and w.deleted is null and s.deleted is null and d.deleted is null and e.deleted is null

    There are thousands of such queries working their way up the reference tree. Obviously, this is error prone and many of them miss an important reference, leading to completely unreliable reports.

    The suggestion to at least cascade the deletes throughout the database every once in a while wasn't appreciated because they're afraid they might loose something important.



  • Simple: Create views (Matrialized or "normal") that reflect the undeleted rows.  Use those in all your reporting queries.

     



  • "The suggestion to at least cascade the deletes throughout the database every once in a while wasn't appreciated because they're afraid they might loose something important."

    Surely one of the major advantages of the virtual delete is that if you do accidentally delete something important you can get it back again, therefore even with cascading virtual deletes you wont lose anything. We're actully planning on doing something vaguely similar to this, however all database access goes through a common library that is 'aware' of the database structure and will handle these cascading soft-deletes for you (as well as automatically making sure you don't load something that has been deleted unless you really want to) therefore you wont need to do the joiney retardedness you've got here.



  • @Auction_God said:

    Simple: Create views (Matrialized or "normal") that reflect the undeleted rows.  Use those in all your reporting queries.

     

    Good suggestion. I've used that to solve a similar problem in a database I maintain. (Some jerk will inevitably gripe that performance when querying from Views sucks, but... well, performance sucked before anyway, and this way the code is sane.)



  • @Auction_God said:

    Simple: Create views (Matrialized or "normal") that reflect the undeleted rows.  Use those in all your reporting queries.
     

    Your suggestion has indeed been implemented. For every table, they have vw_table that does this. The view for orders is like  the whole SELECT statement above but without the WHERE something clause!  This view traverses almost the entire database, which has about 50 million records total.

    A basic query like SELECT stuff FROM vw_orders JOIN vw_customers ON ... WHERE customer_zip = 'AB 12345' only returns a handful of rows but has horrendous performance (minutes). No surprise these neat vw_table are largely left unused.The alternative that works directly on the tables completes in milliseconds.

    Fortunately I don't have to work with this database except extract some data, do some calculations and write stuff back in a side application.



  • @JvdL said:

    A basic query like SELECT stuff FROM vw_orders JOIN vw_customers ON ... WHERE customer_zip = 'AB 12345' only returns a handful of rows but has horrendous performance (minutes). No surprise these neat vw_table are largely left unused.The alternative that works directly on the tables completes in milliseconds.
     

    you are doing something wrong  



  • @Mithious said:

    Surely one of the major advantages of the virtual delete is that if you do accidentally delete something important you can get it back again

    Not in this scenario:

    1: Create order, add order line 1 and order line2.

    2: Intentionally delete order line 2.

    3: Accidentally delete order and its cascaded order line 1.

    4: Undo the delete of the order and cascade to order line 1 and order line 2

    5: Another customer goes to the competition.

    Because of such complexities, I believe the owners of this database have never used their audit trail to undelete something.


    @Mithious said:

    All database access goes through a common library that is 'aware' of the database structure

    Famous last words. Twenty years from now the database is alive and kicking but the library is dead. Do the cascades in triggers.



  • @Auction_God said:

    Simple: Create views (Matrialized or "normal") that reflect the undeleted rows.  Use those in all your reporting queries.
    I would recommend materialized views.  Views can't be indexed, but materialized views can.  Normal views would not even be an option for me in this circumstance.



  • @JvdL said:

    @Mithious said:

    All database access goes through a common library that is 'aware' of the database structure

    Famous last words. Twenty years from now the database is alive and kicking but the library is dead. Do the cascades in triggers.

     

    I would prefer to make the database raise an error if you try to delete something which references something which isn't deleted.



  • @JvdL said:

    4: Undo the delete of the order and cascade to order line 1 and order line 2

    If you're saving the timestamp of when things are deleted then you only want to cascade to children with a timestamp equal to or after the parent record. Should solve that problem :)

    @JvdL said:

    Famous last words. Twenty years from now the database is alive and kicking but the library is dead. Do the cascades in triggers.

    The library is our code and an integral part of the product so it's not going anywhere. It also does a lot of stuff you cant, or wouldn't want to, do in triggers. Can't really go into the details on here I'm afraid.

     



  • If you use Oracle then you get out-of-the-box functionality to go back in time so you can just go ahead and delete knowing that you never lose data and can recover to any consistent state that the database has ever been in.  Cute eh?

     



  • @JvdL said:

    In a project some smart ass adopted the principle that records should not be physically deleted but leave an audit trail instead. Each table has a "deleted" field which is time stamp.

    The suggestion to at least cascade the deletes throughout the database every once in a while wasn't appreciated because they're afraid they might loose something important.

     

     

    *sigh*  the proper way to do an audit trail is to copy the data to an archive table, flattened.

     and when it comes to being scared that you might lose something....  um, hello?  BACKUPS.



  • @LoztInSpace said:

    If you use Oracle then you get out-of-the-box functionality to go back in time so you can just go ahead and delete knowing that you never lose data and can recover to any consistent state that the database has ever been in.  Cute eh?

     

     

    I think SQL server has the same thing.  However, having "virtual" deletes like this allows the system to have "restore" functionality without ever having to bother with the the database.  It is nice to be able to resurrect deleted accounts from within the application for instance.



  •  @tster said:

    I think SQL server has the same thing.  However, having "virtual" deletes like this allows the system to have "restore" functionality without ever having to bother with the the database.  It is nice to be able to resurrect deleted accounts from within the application for instance.

    On the other hand, it makes it impossible for me to delete my username, then re-create a new account using the same username.

    There are always trade-offs.


  • Discourse touched me in a no-no place

    @blakeyrat said:

     @tster said:

    I think SQL server has the same thing.  However, having "virtual" deletes like this allows the system to have "restore" functionality without ever having to bother with the the database.  It is nice to be able to resurrect deleted accounts from within the application for instance.

    On the other hand, it makes it impossible for me to delete my username, then re-create a new account using the same username.

    You say that like it's a bad thing....



  • @blakeyrat said:

    On the other hand, it makes it impossible for me to delete my username, then re-create a new account using the same username.
    thank god



  • @blakeyrat said:

    @Auction_God said:

    Simple: Create views (Matrialized or "normal") that reflect the undeleted rows.  Use those in all your reporting queries.

     

    Good suggestion. I've used that to solve a similar problem in a database I maintain. (Some jerk will inevitably gripe that performance when querying from Views sucks, but... well, performance sucked before anyway, and this way the code is sane.)

     

    "Views = Performance Problems" yea, just like queries are performance problems if not done right, so on the off chance of doing it wrong NO QUERIES! In fact NO DB ACCESS. DB ACCESS = Security concerns... In fact Computer Access = Security Concerns. All programming should be done on a piece of paper which is burned immediately.


  • @astonerbum said:

    All programming should be done on a piece of paper which is burned immediately.
     

    I spy a fun 1-up game!

    All programming should be thought out in full, then immediately forgotten by the application of copious and sudden amounts of pornography.



  • @dhromed said:

    sudden amounts of pornography
     

    The Daily Wank The Fap...



  • @dhromed said:

    @astonerbum said:

    All programming should be done on a piece of paper which is burned immediately.
     

    I spy a fun 1-up game!

    All programming should be thought out in full, then immediately forgotten by the application of copious and sudden amounts of pornography.

    All programming should be done by using cheap slave labor--each person can represent on bit--then immediately deleted with a machine gun.



  • I don't know about you guys.... but if I had to do this...


    Would making a second table for the historical data with all the same columns as the first and the additional date column make more sense?



    One table gets live data, fast indexes without junk in the way, no convoluted queries and to copy it to the second table you can do it in an INSERT (...) VALUES (date, (SELECT ...)) statement.

    'Historical' table can probably even live with reduced indexing if you don't report much (or at all)


    I'd take some extra tables over query soup any day...



  • @Giskard said:

    I don't know about you guys.... but if I had to do this...
    Would making a second table for the historical data with all the same columns as the first and the additional date column make more sense?

    One table gets live data, fast indexes without junk in the way, no convoluted queries and to copy it to the second table you can do it in an INSERT (...) VALUES (date, (SELECT ...)) statement.
    'Historical' table can probably even live with reduced indexing if you don't report much (or at all)

    I'd take some extra tables over query soup any day...

     I was thinking this also, but didn't want to suggest this since I don't do much db programming.  It does give your live data much less rows to search through instead of all the junk also that might never get used. I'd rather relocate the data to a location that can be used for auditing purposes, but doesn't represent live data like you suggest.



  • @Giskard said:

    I don't know about you guys.... but if I had to do this...
    Would making a second table for the historical data with all the same columns as the first and the additional date column make more sense?

    One table gets live data, fast indexes without junk in the way, no convoluted queries and to copy it to the second table you can do it in an INSERT (...) VALUES (date, (SELECT ...)) statement.
    'Historical' table can probably even live with reduced indexing if you don't report much (or at all)
     

    Several problems with this:

    • No way to automatically enforce data integrity; requires triggers or application logic that could be broken, buggy, or accidentally removed or disabled;
    • All of the above needs to be maintained if the table schema is ever changed;
    • Completely hoses several common methods of replication and ETL;
    • Makes queries slower and more complicated when you actually want to see the disabled/deleted rows (and there must be some instances when you do, otherwise why keep them?);
    • Requires that you maintain a separate "deleted" table for every child relationship (horrible).

    If there's some reason for you to have a "graveyard" table, for example offline synchronization, that's fine, but don't raise the dead.  A tombstone is not an undo state.  Seems a lot of programmers and database newbies make this mistake.

    The materialized and/or indexed view approach is by far the most reliable and efficient.



  • @Aaron said:

    @Giskard said:

    I don't know about you guys.... but if I had to do this...


    Would making a second table for the historical data with all the same columns as the first and the additional date column make more sense?



    One table gets live data, fast indexes without junk in the way, no convoluted queries and to copy it to the second table you can do it in an INSERT (...) VALUES (date, (SELECT ...)) statement.

    'Historical' table can probably even live with reduced indexing if you don't report much (or at all)
     

    Several problems with this:

    • No way to automatically enforce data integrity; requires triggers or application logic that could be broken, buggy, or accidentally removed or disabled;
    • All of the above needs to be maintained if the table schema is ever changed;
    • Completely hoses several common methods of replication and ETL;
    • Makes queries slower and more complicated when you actually want to see the disabled/deleted rows (and there must be some instances when you do, otherwise why keep them?);
    • Requires that you maintain a separate "deleted" table for every child relationship (horrible).

    If there's some reason for you to have a "graveyard" table, for example offline synchronization, that's fine, but don't raise the dead.  A tombstone is not an undo state.  Seems a lot of programmers and database newbies make this mistake.

    The materialized and/or indexed view approach is by far the most reliable and efficient.





    With points 1 and 2, yeah it does require some care to maintain, but any SQL bug that slips in will most likely mean some missing audit data instead of a bug in live data. Still crappy, but usually not as deadly

    Point 3... yeah.... no fight there.

    Point 4: I've never yet had to write anything that mixes live data with logged data, so reports just run on the second tables. No convoluted mix of tables required. Might not always hold true, but has been holding in the business case I've used it for

    Point 5: Ah, I will admit that would be disgustingly nasty, and the one case where we needed that kind of info I actually used the date column strategy. Later to find out we didn't even need that part of the data retained, so it just deletes the rows.



    I guess I didn't realize I kind of used both. Also, for the use we have, replication of the 'deleted' or 'logged' data actually isn't that important and can be excluded from any replication sets.
    My main concern with the information I've dealt with was always making sure live data was entirely separate from historical data, but I'm probably just thinking too defensively.



  • @blakeyrat said:

     @tster said:

    I think SQL server has the same thing.  However, having "virtual" deletes like this allows the system to have "restore" functionality without ever having to bother with the the database.  It is nice to be able to resurrect deleted accounts from within the application for instance.

    On the other hand, it makes it impossible for me to delete my username, then re-create a new account using the same username.

    There are always trade-offs.

    Only if you play along.  There's still an SQL interface you can abuse, after all.  Or was I not supposed to say that out loud?  (That's OK.  I only typed it.)


Log in to reply