The data doesn't have to be accurate



  • In our application, there are two tables. For simplicity, I'll call them Data and Description. For each row in Data, there can be one or more rows in Description. The data and descriptions can change over time, both programmatically, and user-entered, so all rows are timestamped.

    I was just tasked with creating a new screen to enter descriptions. Basically, each update operation would need to:

    • verify that the manually entered PK of the data it would reference exists
    • verify that the data is "active"
    • update the description on the data
    • update the link on Data to reference the newly saved description
    • keep the old value of the data on the returned record
    • set a success/error-message status on the row to be returned

    Logically, this makes sense. However, they want to be able to enter thousands of rows (presumably via paste from Excel) at a time. This means several db queries/updates per record times thousands of records = a significant amount of time from start to finish.

    During this time, someone else may (via this or another mechanism) update the data. Wait a minute. What happens if the user submits an update for a large number of (Description) rows, including row x, someone else updates Data for row x, then my process finally gets around to doing the update for (Description) row x? If during my process, I do not query the current value of the Data for row x, I could be returning stale values of Data for row x that existed before the other user did the update.

    My team lead ponders this for a moment and decides: it's ok; the data doesn't have to be accurate.



  • That's one solution, but a better solution would most likely be using a transaction and doing a "select ... for update"



  • @snoofle said:

    it's ok; the data doesn't have to be accurate.
     

    OK, so save time by querying random shit from Descriptions and inserting that.

    You know, TOP 100 * ORDER BY newid()



  • @dhromed said:

    @snoofle said:

    it's ok; the data doesn't have to be accurate.
     

    OK, so save time by querying random shit from Descriptions and inserting that.

    You know, TOP 100 * ORDER BY newid()

     




  •  He's basically telling you: "Stupid business req's. I don't care about them... Where's my coffee?"



  • "Use the CRS database to size the market"

    "That data is wrong"

    "Then use the SIBS database"

    "That data is also wrong"

    "Can you average them?"



  •  Averaging smooths out the errors.



  • I know this is the sidebar and solutions aren't the point, but the WTF's are, but anyway.

    As a cheap solution you could probably implement a locking mechanism on those updates. The query that gets his lock in first wins ;)
    Then if a user tries to update thedescription of a item you can return that it is currently being changed by someone else and if they want to retrieve the updated description or perhaps a diff of their changes versus the changes made by the other person.

     

    Also I hate people who disrespect data integrity with a passion. At some point in time it will always come back to bite you in the ass.



  • Create a stored procedure server side to perform steps 1 - 5, returning the success-error message (passing PK and new data for a single record).

    Have the error message return a different value based on the error condition (1 = PK does not exist, 2 = Data is not active, etc.)

    In your code, have the routine loop through the each update record making one change at a time through the stored procedure and build a result table as a memory resident table showing the success/error state for each row.  I would also attempt to return to that table the resulting values for each record.

    As the changes are thus being made incrementally you would run less risk of a single data collision dumping the whole process and  you would also be returning to the user enough data to validate whether or not the changes occured as requested or a problem occured.


  • :belt_onion:

    @stratos said:

    As a cheap solution you could probably implement a locking mechanism on those updates. The query that gets his lock in first wins ;)
    Then if a user tries to update thedescription of a item you can return that it is currently being changed by someone else and if they want to retrieve the updated description or perhaps a diff of their changes versus the changes made by the other person.
    To expand a bit on this. You basically have two options, either optimistic locking or pessimistic locking:

    With pessimistic locking, you reserve a record for you when entering your transaction. The first one to lock a record wins. All the others have to wait for you to finish your transaction.

    With optimistic locking, you don't actually lock but you save the state of the record in memory when entering your transaction. You only commit your changes if the original row hasn't changed. Here the first one to commit his change wins.

    I'm afraid the only WTF I see is that Snoofle didn't explain this to his manager so he could take an informed decision.



  • @snoofle said:

    ...the manually entered PK...

    TRWTF has just entered the building



  • Accurate data is so overrated anyway. I mean, who really need accurate data, the important thing is that the credit card informations is valid so we can charge somebody. If the client doesn't check if he is charge the right amount, that's is problem.



  • @bjolling said:

    I'm afraid the only WTF I see is that Snoofle didn't explain this to his manager so he could take an informed decision.
    Well, what about the manager, whose response to reports of possible consistency problems in an audited database is "It's okay, the data doesn't have to be accurate."



  • @SuperJames74 said:

    @snoofle said:
    ...the manually entered PK...

    TRWTF has just entered the building

    I was wondering when someone was going to point out this bit.



  • @bjolling said:

    ...I'm afraid the only WTF I see is that Snoofle didn't explain this to his manager so he could take an informed decision.

    Actually, I offered several solutions, but was told not to bother because "the data doesn't have to be accurate".

    This guy is technical; he just doesn't care about doing it right, and he controls who does what on this team; Sad, really.



  • @snoofle said:

    Actually, I offered several solutions, but was told not to bother because "the data doesn't have to be accurate".

    As someone whose primary responsibility is the BI data for my company, and who is in the middle of doing detailed regression testing because we're moving our source database (not the BI database, but the one that we take data from to put in BI) to a new database server, you made me cry a little on the inside.



  • @snoofle said:

    My team lead ponders this for a moment and decides: it's ok; the data doesn't have to be accurate.

     

     

     

    Wow! If you can get that in writing then you have an excuse for poor job performance for the rest of your employment.

    "What's that, boss, you say I didn't restore the right data from backup? That's OK, it doesn't have to be accurate."

    "So what if all ten million rows have the same last name after I did an update without a where clause? The data doesn't have to be accurate."

     

    Or possibly for blackmail?

    "Boss, you know that the Sarbanes Oxley/HIPAA/Whatever compliance requires us to be 100% accurate? Well, this piece of paper from you says otherwise, and if I don't get that pay rise then this piece of paper finds its way to the auditors' office."

     

    If nothing else, it is the CYA doc for when the auditors come after you.

     

    Seriously, though, how do people like that team lead keep their jobs? The word I would use to describe the "the data doesn't have to be accurate" attitude is "corrupt".



  • @havokk said:

    "So what if all ten million rows have the same last name after I did an update without a where clause? The data doesn't have to be accurate."
     

    That's a whole lotta Smith.


  • :belt_onion:

    @snoofle said:

    @bjolling said:

    ...I'm afraid the only WTF I see is that Snoofle didn't explain this to his manager so he could take an informed decision.

    Actually, I offered several solutions, but was told not to bother because "the data doesn't have to be accurate".

    This guy is technical; he just doesn't care about doing it right, and he controls who does what on this team; Sad, really

    You're right, that is sad.

    I hadn't understood from the OP that you actually proposed solutions. It looked like you didn't know how to solve it and turned to the manager for guidance. Instead of explaining the problems, I would have just asked the manager to pick a locking mechanism, based on the pros and cons of each.

    It just takes a few lines of code to implement a pessimistic locking system (in TRANSACT-SQL: "SET TRANSACTION ISOLATION LEVEL SERIALIZABLE" or "...REPEATABLE READ"), so if the manager proposed not to have any locking you could have said that the pessimistic locking was already in place, was already working fine and removing it would introduce bugs. I don't know any manager who would ask you to remove it.



  •  It sounds to me like the team leader is really thinking "in our company, this will never happen. Updates only happen through this process, and only one person at a time will be doing this process, so it's not cost effective to code (and test, and document, and support....) a locking scheme."

    Tradeoffs like this happen all the time.  Sometimes you can afford (or are required to provide) complete up-and-down data integrity features, to make the code solid, etc.  Other times, it is sufficient to have it work for the majority of cases. 


Log in to reply