Handling when invalid data sneaks past into the db

Benjamin Hall

Looking for opinions here.

Assume you have some kind of business rules like "All Foo.count values, if they're set, must be between 2 and 43, inclusive." But someone messed up the validation on the server so that in one particular workflow, Foo.count could be anything. And so some invalid (but still type-consistent) values like 123 or -2 have gotten into the database.

There's validation when the application tries to reconstruct the Foo entity, so it fails on those bad values. Which is good, it's doing what it says it should, preserving the business rules. But bad because errors.

What's better?

Do some manual database munging to unset the invalid Foo.count values. Downside--Risks data loss, and manual database munging is bad practice.
Insert some special-case validation logic to ignore invalid values when it's coming from the database. Downside--lets bad data propagate. Special case logic is brittle.
Throw self off of nearest tall building. Downside--the nearest tall building isn't tall enough to guarantee self-removal. And ending up in a hospital over such a thing is just messy. Oh, and suicide is messy for others to clean up, so....

Rhywden

@Benjamin-Hall My approach is a cron job which tries to replace the invalid entries with proper ones.

The other approach is highlighting the invalid entries for review, coupled with semi-automatic fixes if possible.

Both crop up due to 3rd party software outside my control.

loopback0

@Rhywden said in Handling when invalid data sneaks past into the db:

The other approach is highlighting the invalid entries for review, coupled with semi-automatic fixes if possible.

This - pull a list of all the shit data and have the relevant business owner provide the correct values. Assuming it's not an impractically large quantity of the stuff, which there shouldn't be if actual business rules prevent it.

Carnage

If possible, I'd get the correct values in there, the method I'd do that works very depending on circumstances, but everything between a nightly once run job that fetches correct data for every faulty one though fetch-and correct-on-read to collating a list of the things that are wrong and having someone do manual data entry. Not an exclusive list.

If not possible I'd remove the bad values. Incorrect data is worse than missing data.

I wouldn't like bad data to rot away though. And I don't much like having code that tries to work around bad data. That way lies insanity.

Unperverted Vixen

I'd go the manual database-munging route, after the root bug was fixed.

The cron job others proposed just seems like the manual munge with extra steps.

ObjectMike

Pre-write trigger in the db that rejects the insert/update if that one value (or any others too that have rules) are invalid. Force the issue to be really obvious really quickly.

Benjamin Hall

Ran a quick scan of the database for this one particular rule. Looks like manual munging isn't going to be a sane option--there are ~350 values that are bad. And these are login ids (which for various other reasons are also emails we have to send emails to occasionally). So just "fixing" them will be a big problem when people can't log in anymore.

aitap

@Benjamin-Hall said in Handling when invalid data sneaks past into the db:

Assume you have some kind of business rules like "All Foo.count values, if they're set, must be between 2 and 43, inclusive." But someone messed up the validation on the server so that in one particular workflow, Foo.count could be anything. And so some invalid (but still type-consistent) values like 123 or -2 have gotten into the database.

How do you get the correct value for Foo.count once you realise it's wrong? In this example, it could be an expensive select count(...) and a materialised view, but I expect your actual use case to be nastier.

acrow

@Benjamin-Hall said in Handling when invalid data sneaks past into the db:

Ran a quick scan of the database for this one particular rule. Looks like manual munging isn't going to be a sane option--there are ~350 values that are bad. And these are login ids (which for various other reasons are also emails we have to send emails to occasionally). So just "fixing" them will be a big problem when people can't log in anymore.

So, you've got login IDs. Which may sometimes be e-mails. And all of these are presumably stored as varchar.

First of all, please tell me that you didn't design this database, so I can stop looking for your address.

Second, if they are login IDs, then how exactly do they go against business rules? E-mails as usernames where that's not allowed, or the other way around?
If so, then I'd start by separating the e-mail and other forms of username, and generating userID as an index separate from either. Then ask those users missing one or the other to provide it upon their next login.

dkf

@Benjamin-Hall Can you do automated munging to work out what the correct values should be? (Doesn't matter if that's expensive. The aim is to write a program that fixes things instead of having to do it all manually. Doesn't matter much if there's a spots which need manual fixing so long as it is just one or two, not hundreds.) Can you add check clauses or triggers to the DB to stop more of this shit from getting put in there?

I remember dealing with a DB that was broken in its encoding of values in string columns. The encoding in use would vary from one value to another, both within the same column and within the same row. I don't want to think about whatever ghastly input process lead to that, but the result was only found by us when a downstream tool started to choke on the XML we were spitting out from queries. There was much nausea that day.

Arantor

I have had the misfortune to have 700 bad values in a database and manual fixing after solving the underlying problem was the only option - it would have taken longer to write something to fix it than it took me to fix manually.

Kamil Podlesak

@Benjamin-Hall said in Handling when invalid data sneaks past into the db:

Ran a quick scan of the database for this one particular rule. Looks like manual munging isn't going to be a sane option--there are ~350 values that are bad.

That's actually not that many. Easily fixed in one afternoon (I mean, you've already spent more time asking here). Then just slap a constraint check on the table column and it's done!

And these are login ids (which for various other reasons are also emails we have to send emails to occasionally).

Ok, that's bad and will definitely come and bite you. You should add a second column for the email address ASAP, because people

will come and request email change (because the old one is not usable anymore).
will come and request personal data removal (which include may email addresses, especially those that contain real name). You might not be affected by GDPR, but keep in mind that laws are often copied so you might be slapped with regulatory practice anyway.
will come and try to use one email for several accounts. Fortunately, this is the one requirement that is easily shot down by a big "over my dead body" from the product manager. But I used to work on an application where this was quite a common thing (because some important customers had two email addresses in the whole company ).

So just "fixing" them will be a big problem when people can't log in anymore.

If you separate login from email, you can easily validate just the email and leave the login "free-form"

Zenith

Software, like retail, is an exercise in psychology.

You have to make it painful for the decision maker. Flagging bad data for review works in a smart organization that doesn't let this stuff get out of hand. In a typical circus, though, it's just another pile of stupid dumped on somebody forever. So consider your options carefully.

loopback0

@Benjamin-Hall said in Handling when invalid data sneaks past into the db:

So just "fixing" them will be a big problem when people can't log in anymore.

That's not necessarily your problem though.

If there are rules defining the format of a login ID that haven't been followed then the responsible person/people should be providing corrected data and communicating that to the people who are impacted.

boomzilla

@Benjamin-Hall said in Handling when invalid data sneaks past into the db:

Do some manual database munging to unset the invalid Foo.count values. Downside--Risks data loss, and manual database munging is bad practice.

We do this when we find stuff like this. Though it's done via script, and propagated and tested through our environments (local test, user acceptance testing, production).

And then, of course, fix the validation and if the issue is simple enough, add a database constraint to make sure it doesn't slip by.

boomzilla

@Benjamin-Hall said in Handling when invalid data sneaks past into the db:

Ran a quick scan of the database for this one particular rule. Looks like manual munging isn't going to be a sane option--there are ~350 values that are bad. And these are login ids (which for various other reasons are also emails we have to send emails to occasionally). So just "fixing" them will be a big problem when people can't log in anymore.

The ultimate fix might be a multi-step thing, assuming this requires some user input for each case. A temporary login id, then pick a permanent one next time you login, etc. Difficult to say without knowing more details.