SQL Update Question



  • I have a table with data that is not normalized, in the sense that because of abnormal conditions, some data has been entered in one format and some data has been entered in another. It is also not normalized in the more traditional SQL sense, but it's only a minor WTF.

    I need to normalize the data, in the first sense, by bringing it all to a common format. I plan to run an update of the form:

    update examtreatment
         set note = TRIM( IFNULL(zzInterv, "") + " " + IFNULL(zzRespInt, "") + " ") + ...
    

    In other words, I'm taking some fields called zzInterv, zzRespInt, and a few others, and concatenating them with appropriate whitespace. Is there a less insane way to do this than with n/2 trims? ANSI SQL please.

    It needs to work correctly the first time. (Yes, there are WTFs there... if this organization wasn't so nice, I might share them.)

    Edit: is there maybe an intercalate command?



  • update examtreatment set note =
    (case when zzInterv is null then '' else zzInterv + ' '  end) +
    (case when zzRespInt is null then '' else zzRespInt + ' ' end) 
    ... 
    

    Leaving it up to you is it's more or less insane. Unless you can just strip double spaces from the result with a function...



  • Why are you concerned about the number of trims?
    If it's for performance raisins then you can do a dry-run with;

    begin transation
    update (lots of stuff)
    rollback transaction

    which will give you a (very) rough idea of time to complete.



  • Personally I'd make new columns to hold the normalized forms, check them for correctness, then rename the new columns to the old columns, check your application still works, then if all is correct drop the old columns.

    As far as performance goes, I'd consider that a drop in the bucket compared to ensuring it works.



  • What's your concern here? Performance? Downtime?

    Otherwise, the code seem correct.



  • A combination of correctness and performance (so really, downtime). I'll have half an hour to run it. I can run it more than once, but not many times. I'm not concerned about the number of trims except that I don't want to do the algebra I'd need to prove correctness. Maybe it will look easier in the morning, or maybe there is a saner approach.



  • I like that. I wasn't sure case syntax was an option.



  • Sensible, but not an option for me.



  • I hope you have multiple testing environments then, because you're fighting with dragons here. Godspeed.



  • For performance, you'll want to set up some testing dataset and measure a few versions of the script. Also, whatever db engine you're using, it might have some proprietary stuff better than posix, i would't limit myself that way.

    Nothing better comes to mind, sorry. I'm sure this was all really obvious to you beforehand.


  • Notification Spam Recipient

    @Captain said:

    I can run it more than once, but not many times

    Lol contradiction. 😛

    @blakeyrat said:

    multiple testing environments

    Sounds like we're promoting straight to prod, first try or bust!



  • Don't forget to put a trim() around it for the trailing space.


  • Discourse touched me in a no-no place

    @Captain said:

    concatenating them with appropriate whitespace.

    @Captain said:

    ANSI SQL please.

    The ANSI SQL string concatenation operator is ||, not +.

    @blakeyrat said:

    Personally I'd make new columns to hold the normalized forms

    Or a temporary table if you're more worried about stuff. It might be possible to do the dry runs in that case before taking things down.

    @blakeyrat said:

    As far as performance goes, I'd consider that a drop in the bucket compared to ensuring it works.

    Ten thousand times this. If wrong answers are permitted, it's possible to make code extremely fast, but it usually makes people very upset.



  • It might be possible to do the dry runs in that case before taking things down.

    Dry runs aren't really possible. I don't have access to a test/development database, and the application running over the production database locks a lot of tables (including the one that needs updating).

    Ten thousand times this. If wrong answers are permitted, it's possible to make code extremely fast, but it usually makes people very upset.

    I just need the code to run in a reasonable amount of time -- say, less than 30 minutes for 1500 records. What I don't have is time to debug. The database will be down while the clinicians are in a meeting, and I have to be done, one way or another, by the time the meeting is over. (If it wasn't such a "trivial" update, I'd be talking to the boss about backups and testing, etc. But I can get this done right, the first time, with formal methods.)

    I ended up settling on using the COALESCE function. It returns the first non-null value in the list, so the update can look like:

    update examtreatment
       set notes = TRIM ( COALESCE(         zzInterv,   '')
                       || COALESCE( " " ||  zzRespInt,  '')
                       || COALESCE( " " ||  zzFutPlan,  '')
                       || COALESCE( " " ||  zzClProg,   '')
                       || COALESCE( " " ||  zzRiskSpec, '')
                       || COALESCE( " " ||  zzSessCont, '')
                       ) 
    

    Almost Haskell-like. I can verify correctness by induction, and by checking for typos.


  • Discourse touched me in a no-no place

    @Captain said:

    I can verify correctness by induction, and by checking for typos.

    Simple enough to be obviously right. Good approach.



  • @Captain said:

    Dry runs aren't really possible.

    Mmm.

    @Captain said:

    I don't have access to a test/development database,

    Mmmmmm.

    @Captain said:

    What I don't have is time to debug.

    Mmmmmmmmmmmmmmmm.

    @Captain said:

    (If it wasn't such a "trivial" update, I'd be talking to the boss about backups and testing, etc. But I can get this done right, the first time, with formal methods.)

    Mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm disapproving_murmur_overload


  • Discourse touched me in a no-no place

    @Captain said:

    I just need the code to run in a reasonable amount of time -- say, less than 30 minutes for 1500 records.

    What's the database software? Do you have the ability to make a local test database, copy the schema for this table, and copy the affected records? If so, you can just test to your heart's content.



  • Can I ask why you want the solution in ANSI SQL? I'm not even sure if there are any rDBMS that fully implement ANSI SQL. The only reason I point this out is that the IFNULL keyword isn't supported in a lot of major rDBMS, but there is an equivalent keyword. Here's a partial summary - http://www.w3schools.com/sql/sql_isnull.asp. Just make sure IFNULL is the right keyword for whatever system you are running against. coalesce does seem to have better support.

    In all honesty, this isn't a complicated query that needs performance tuning. You're concatenating adjacent columns together within each record. Regardless of whether you use ifnull, case statements or whatever bastardization of the coalesce function you came up with, it is almost certainly going to result in the same query plan under the hood. SQL is (in most senses) a declarative language, if there is such a thing. You tell the db engine what you want and you don't have that much control over how it's done . You can certainly give hints to optimize performance, but this is not the type of query that needs it, nor is it a situation that calls for it. If you had to lookup data from other rows or tables within each record you're updating, and this was something that would be run more than once, then it's a different story.

    And I will second/third/fourth everyone else who has told you that it's more important for this to be correct than look pretty or perform well. I'd love to hear why you couldn't at least restore an earlier backup to a test system and do a dry run. It'd be a lot easier to have a conversation with your boss NOW about getting a proper backup and testing strategy in place than explaining to them why the production database got borked in spite of the "formal methods" you learned in your computer science classes. I don't care how good you are - you're going to fuck something up at some point. There's no shame in doing that....as long as it's in a dev/test environment!

    Also, if it takes 30 minutes to run this query against only 1,500 records, then you have more issues to worry about than denormalized data. You'd probably have a very wide table, a lot of (probably poorly thought out) indexes, and some evil triggers that call web services through a command shell or some other nefarious shit.



  • @Captain said:

    COALESCE( " " || zzRespInt, '')

    You're not using an Oracle database are you? IIRC Oracle's shitty null-empty strings would never evaluate that first parameter as null.



  • @bighuskermav said:

    result in the same query plan

    The query plan matters fuckall in this case. The cost of computing a value does.

    SELECT col + 'x' and SELECT col + calculate_pi_to_billion_decimal_places() will have the same plan, but drastically different performance.



  • @Captain said:

    Dry runs aren't really possible.

    2 points on this...
    The first, and most obvious, is; what is so hard about backing the database up, restoring it to your PC and fiddling around with the copy? Even if it's 1TB in size, a new disk is $100 so cost shouldn't be an issue.

    Second is; you can do your dry run in "readonly mode" by using rollback transaction, and a whole bunch of "select ... from" to check the intermediate results. Something like;

    begin transaction
    // a whole load of inserts, updates and other write-code
    select * from updated-table // figure out whether it worked
    rollback transaction

    You'd probably want to schedule the run in a quiet period as there may be issues with page locks for other users, plus make sure your transaction log has enough space to grow.

    But rather than saying "dry runs are not possible", it will be far less WTFey to try and figure out how to do dry runs...


  • Discourse touched me in a no-no place

    @skotl said:

    The first, and most obvious, is; what is so hard about backing the database up, restoring it to your PC and fiddling around with the copy?

    I get the impression that there's a very specific window of opportunity to do this. Which seems even more of a :wtf: but there you go. (@Captain mentioned medics, and they're well-known for being the epicentres of IT disasters. Because they're Doctors and so don't need to listen to other professionals on any topic at all ever.)



  • @dkf said:

    Because they're Doctors and so don't need to listen to other professionals on any topic at all ever.)

    If captain's just doubling-over and taking it in the gut about implementing this with zero testing, then I agree with the doctors: they shouldn't be listening to him.



  • Jesus fuck, you have 1500 records and expect to not have time to run the update several times? You are either running the worst hardware in the century, or you have literally no idea how sql works or is capable of, so SHOULD NOT BE TOUCHING A PRODUCTION SERVER, PERIOD, ESPECIALLY BECAUSE YOU AREN'T WILLING TO RUN A SELECT ON THE DATA BEFORE THE UPDATE!

    I can NOT stress this enough, you are more than playing with dragons. You are preparing to drop a nuclear bomb with no concept of half life radiation.

    It's 1500 records. Run a fucking select first, review your results and run the update. You should have more than enough time to do both.



  • Good question. I'm running Advantage Database. I'd better check that its COALESCE works how I expect.



  • @bighuskermav: In all honesty, this isn't a complicated query that needs performance tuning.

    I'm not really interested in "tuning." I am interested in finding an expression that is obviously correct, so I don't have to debug it "live".

    @skotl: The first, and most obvious, is; what is so hard about backing the database up, restoring it to your PC and fiddling around with the copy? Even if it's 1TB in size, a new disk is $100 so cost shouldn't be an issue.

    I can't even connect to the database unless the application that uses it is turned off. That's why I have such a short window to get this done. In fact, I'll be copying the whole database (which is pretty small) during that window, so I can do things like have a testing database, etc in the future. But I can't clone a database and set up a development database and then write and debug an update in 30 minutes, especially since I'm unfamiliar with the Advantage Database server. Which is why I want the update to be ready to roll.

    This isn't such a complicated update that it shouldn't be ready to roll.

    Also, if it takes 30 minutes to run this query against only 1,500 records, then you have more issues to worry about than denormalized data.

    It won't. But I don't want to count on having any time to debug. 30 minutes is the most the application that runs on the database can be down. And it needs to be down because it locks the table I need to read (and update). That's also why I can't just make a back up and test. Because I won't have a long enough window to make the back up, test, and then run the query. The update has to happen today. In 2 hours, in fact.

    @Matches: It's 1500 records. Run a fucking select first, review your results and run the update. You should have more than enough time to do both.

    I AM WILLING TO RUN A SELECT BASED ON WHAT I WANT TO UPDATE. BUT EITHER WAY I NEED THE UPDATE STATEMENT TO BE READY TO WORK. I AM UNWILLING TO COUNT ON HAVING TIME TO "DEBUG" IT BECAUSE I HAVE A SHORT WINDOW.

    @dkf:
    @Captain mentioned medics...

    The practice's records could get audited at any time, and some of the notes in the "notes" field are incomplete. The way to make them complete is to normalize them as in the update (concatenating all of the fields, with appropriate whitespace) -- in other words, the data is there, but there was a bug in the EHR system (which I don't control the source for) that meant that the data didn't get put together correctly.

    If they get audited, it's a big deal. This can't wait until the place is closed for Christmas.

    If the database is down for longer than 30 minutes, it's not a terrible problem. And I can make a backup and restore if it comes to that. And will. But it would delay everybody's appointments for as long as it takes me to get the problem fixed.

    Which is why the update needs to work right the first time. Because the update needs to happen today, and it needs to go smoothly, and having the update ready to roll will mean a shorter service interruption. This is why I didn't want a TRIM-based solution to the whitespace problem -- it's hard to reason about completely, and I have limited time to fix any problems. I don't need another potential problem to fix in my short window.

    But because the database is locked, I can't do all the fancy development database kinds of testing people in IT shops expect. At least not today, with two day's notice.



  • @Salamander said:

    You're not using an Oracle database are you? IIRC Oracle's shitty null-empty strings would never evaluate that first parameter as null.

    and on the flip side, what databases would consider " " || anything else to be null? Last I checked, string concatenating anything with a single space is, by definition, not null.


  • I survived the hour long Uno hand

    @Captain said:

    I can't even connect to the database unless the application that uses it is turned off. That's why I have such a short window to get this done. In fact, I'll be copying the whole database (which is pretty small) during that window,

    Schedule outage. Take backup. Restore application. Use backup to dry run and/or fine-tune your query. Schedule second outage window. Run query. Restore application.

    The answers are out there.



  • Schedule outage. Take backup. Restore application. Use backup to dry run and/or fine-tune your query. Schedule second outage window. Run query. Restore application.

    That's sensible, but unfortunately there's no good window for a second outage before Christmas. I had this discussion with the operations manager in depth.





  • You are correct that I phrased it poorly, but what I was trying to say was that in this specific case, isnull or a case statement is basically the same thing. I have to imagine that the isnull or coalesce is implemented as a case statement under the hood in most engines.

    In any event, I think the OP has gotten all the advice they need. If you want to be a cowboy DBA, then go for it. Just remember that plenty of "trivial" update statements have fucked up databases beyond repair. It really sounds like you don't actually have a backup strategy in place if your only means of getting a backup is to run one manually from production. Why that isn't your #1 concern is beyond me.



  • facepalm

    You'd think I'd remember that, particularly since I had to fix an issue in a SQL Server stored procedure before that was concatenating values without checking if they were NULL first.



  • You're just looking for ways to complain about my work.

    Guess what: we have a backup strategy in place. The IT department runs that.

    I'm in the Data Team, which does something completely different, and was tasked by the operations manager to get this fucking done today, because I have the skills and expertise to do it right the first time.

    If I fail catastrophically, I will look awful, and then life will go on. But getting a ticket in to the slow ass IT department is not a sensible possibility.



  • You asked for advice and I told you how I would minimize the risk of something going haywire. Personally, I would never do what you are suggesting or recommend that anyone else do it that way. I'm sorry that your work environment sucks so much that you can't simply communicate with a DBA to get a backup from last night. That sounds horrible (honestly).

    Just a bit of advice because I don't know how much experience you have, but it's OK to push back on things if you think they're a bad idea. I have done it several times, but I have been lucky enough to be at employers that trust my judgement.



  • Fair enough, I'm a little stressed out and I'm getting a lot of (what I think is unnecessary) push back from you guys. You're right, you guys did give me enough advice, and I do feel confident that things will work now.

    I know I can push back on things.

    Normally, my job consists of doing reports. I query the database with Crystal Reports and format it. Or I run queries in the EHR system. I don't typically have to go into the guts of the database, which is why I don't have a test database set up. Or a lot of the infrastructure you guys are used to. It's a strange role, because I've become an "expert" at how the EHR works, but I don't even have source code for it.

    So when this special, emergency circumstance came up, they came to me, and I'm trying to do my best at it, on short notice.

    I have lots of IT experience, but this isn't an IT position.

    My plan is:

    1. Write select statement based on the concatenation I want
    2. Copy table from the drive.
    3. Connect to database.
    4. Run select statement to verify formatting of field.
    5. If it works, run update statement.
    6. Run select statement to see if the format looks good.
    7. Restore from backup or quit.
    8. File ticket with IT for a real backup

    Although it's cowboy-ish (I get it), I think I have all the important bases covered.

    My problem is that there's an implicit <"If it doesn't work, try not to panic, but probably panic"> step after each step. I know that this is what the typical development infrastructure helps you with. But getting that set up in time is not an option. So to alleviate performance anxiety, I'm trying to set things up to go as smoothly as possible.

    I was particularly worried about the update statement itself, because of how rarely I do updates, and how much of a panic-inducing hassle debugging things can be.


  • BINNED

    Well, good luck! Tell us how it went!

    We won't judge. Too harshly.



  • Thanks, will do.



  • Took an hour to get it all done. Query ran in about 1.5 seconds.

    Advantage Database has a lot of weird gotchas. For example, it does not support || syntax. I had to change back to +. It also automatically inserts appropriate whitespace, and does not support concatenating whitespace to fields manually at all. So, ultimately,

    update examtreatment
       set notes = TRIM( COALESCE( zzInterv,   '')
                       + COALESCE( zzRespInt,  '')
                       + COALESCE( zzFutPlan,  '')
                       + COALESCE( zzClProg,   '')
                       + COALESCE( zzRiskSpec, '')
                       + COALESCE( zzSessCont, '')
                       )
    

    was good enough. I think the clinicians are still in their meeting, so all's well that ends well.

    Now to file a ticket for a copy of a complete backup.



  • @Captain said:

    The IT department runs that.

    How/when do they do it?



  • I don't know. A local ISP/IT company does it. They're pretty good at what they do, so I'm sure it's a sensible plan.

    You're probably right that I should find out, since I have become the resident EHR expert and will have to deal with this kind of thing again.


  • Discourse touched me in a no-no place

    @Captain said:

    so I don't have to debug it "live".

    As multiple people pointed out, you may not have to if you can replicate enough of the database to test locally.


  • Discourse touched me in a no-no place

    @Captain said:

    Because I won't have a long enough window to make the back up, test, and then run the query.

    If I were in that situation, I'd make a new, local database, with that table, and create 10K rows with random junk data that was broadly representative of the real data (e.g., perhaps 10% of every field has leading or trailing blanks) and test on that.


  • Discourse touched me in a no-no place

    @Captain said:

    If the database is down for longer than 30 minutes, it's not a terrible problem. And I can make a backup and restore if it comes to that. And will. But it would delay everybody's appointments for as long as it takes me to get the problem fixed.

    Why can't this happen at 5PM or whenever the office closes? (other than "because this isn't that kind of setup", obviously.)


  • Discourse touched me in a no-no place

    @Captain said:

    Guess what: we have a backup strategy in place. The IT department runs that.

    Then just restore to your PC.



  • Runs 24/7 until Christmas.

    It's done, everybody is happy, I talked with the Executive Director about never letting this happen again and we have a plan in place but it would have taken weeks to implement anyway.


  • Discourse touched me in a no-no place

    @Captain said:

    Runs 24/7 until Christmas.

    Gotcha.

    In the future you should try to get a restore-to-your-PC every so often, so this can't happen again, though.



  • Going to be honest, everything about this reeks of shady shit.

    You're literally changing existing work through a live hack that's untested to get around audits to prevent exactly that kind of bullshit. This sounds like a place liable to create extraordinary damage.



  • This whole issue is the result of crap UI. The note field is supposed to be compiled from the 5 fields I concatenated in my update. But it happens at the application layer, and a field only gets included in the note if the clinician leaves (e.g. hits tab when done) the field widget before saving. Nobody knew that until I got on a call with the vendor and they explained and confirmed that has always been the case.

    Did you see the update I ran? Hardly unethical. Merely faster than having each clinician check each record themselves.

    That way, they can help more kids.

    I will also create a "worklist" to catch exams when the clinician makes that mistake, so they can fix them as they make them.

    The audits are in place to ensure quality care. If the medical record is wrong because of a software fault, the auditor will be mad. They won't care if I went behind the system's back to fix a software fault.



  • @Captain said:

    I can't even connect to the database unless the application that uses it is turned off.

    :wtf:

    This sounds like a bad system in which to be using SQLite.



  • It's Advantage Database. SAP has a weird driver that lets their apps connect to it concurrently, but they didn't include in the "data architect" tool, the only tool I've found that can run arbitrary queries on a database.

    It's a strange, closed ecosystem.


Log in to reply