Bookkeeping code or: how I learned to stop worrying and love data duplication



  • So at work, we have certain things that users can process. I can't go into what sort of things without giving away the industry I work in. Let's say these things are stories, and they are divided into chapters, and each chapter has one or more paragraphs. Basically, our client has customers, the customer gives out licenses to their customers. The licensees have several people assigned to them, we'll call those storytellers. The storytellers supply the end users with stories.

    We built a feature for our client to prepare her own template stories. She can allocate those to some or all of her customers, and their licensees' storytellers can then make a copy and then edit the story, or simply use it as-is. I personally think this is a great feature, considering the business our client is in. Anyway, yesterday our client called. She said, "Hey toon, those three stories we've been writing, I've tried to couple them to WritaCorp but it seems to be taking a long time." Sure enough, turns out it takes three and a half minutes. For a single story to get coupled to a single customer, their biggest one, with about 60 licenses.

    So I took the debug log that the operation dumped, and saw that it took not one, not two, not three, not four, not five, not six, but seven fucking thousand (7,000) database queries. I spent some time alleviating the problem and turned those 7,000 into 3,600 in about half an hour, shaving about a minute off the running time. The decoupling, the reverse operation, took about 4,500 queries and I got that down to about a dozen.

    Why o toon, you might ask, are so many database queries needed: after all you still have 3,600 left? Well, perhaps considering how this coupling was implemented may illuminate things.

    In the olden days, there used to just be stories, with chapters and pararaphs. Then templates arrived, adding a column to the story table that equals 1 if and only if it's a template. Also to the stories were added a license ID column, and a customer ID column. So you can have a story template for just one license, or a story template that's available to all licensees of the customer. So far so reasonably decently.

    Subsequently, a co-employee who is no longer with the company, was tasked with giving our client the ability to make template stories that she can then couple to customers, or some/all of their licenses. A set of screens was added that gives our client the means to do this. What happens when you couple a story template to a customer is, it gets duplicated and added to each license. So in the case of WritaCorp, 60 copies were made, each with a set of chapters, and each of those with a set of paragraphs, all of which were inserted into the database separately. This is why I was able to easily halve the number of queries: I simply combined thousands of INSERT queries into a single one.

    "Foreign key constraints? We don't need no stinkin' foreign key constraints. We'll just insert one row, get the inserted ID back, then insert all the rows in the other table. If they have "children" then we'll obviously have to do the same for them!" Of course, even without foreign key constraints there may be several better ways to go about this, but this, apparently, is how this sort of thing is done.

    Wait, you shout! What happens if the client changes something in the story template? All those duplicates will need to change along with it! Well, my ex-coworker thought of that in advance. He added a "set ID" column to the story template table, but there is no such thing as a "set" table in our application. What this column does, is group all of those template stories together. That way, you can find the templates that belong together! So if the client changes the title of a chapter, why you just get all of the chapters of all of the other templates with the same set ID, but just the ones with the same position, and then update those (one by one, of course). Rearranging chapters becomes a great exercise in creative bookkeeping, and of course there are other WTFs that arise from this implementation.

    Now, I thoroughly dislike this sort of thing, because I hate what I like to call "bookkeeping code". With an extra hour of thinking, probably making the thought process span about 65 minutes, my ex-coworker should have come up with a much better solution. See, the whole premise behind this bookkeeping code my coworker wrote, is of course that you're somehow guaranteed to start out with a consistent state, you do something (without using transactions or anything like that), ergo you're sure to end up with the state still being consistent afterwards.

    Now recall that I said the coupling took three and a half minutes. Our server (a PHP 4 box, for a whole other extra bonus WTF) is set to timeout any scripts at four minutes. So if someone decided to refresh the page because editing a story took too long, or the script timed out after only 6,000 out of 7,000 queries finished, you're stuck with an inconsistent database state. Nobody realizes this at the time of course, and our client, thinking we built something that works, merrily keeps editing, only making the database state more inconsistent, after which this sort of situation turns into what I believe Donald Knuth, in many of his seminal works, terms a "clusterfuck".

    I don't even know how to unmangle this mess without spending several days on rethinking and rewriting whole parts of our application.



  • My God.... +1 for not jumping off a bridge after being given that mess.

    My rule of thumb: If you duplicate records in database, stop and think. You are probably doing something wrong.


  • BINNED

    @cartman82 said:

    If you duplicate records in database, stop and

    ... (clue) hammer time!


  • Discourse touched me in a no-no place

    @toon said:

    I don't even know how to unmangle this mess without spending several days on rethinking and rewriting whole parts of our application.

    That's normal when unmangling a complex system. Sometimes that can take months if the data model is truly terrible…



  • Sounds like you needed to use RDF





  • The fact you're being shafted with PHP 4 is a sign of terror in itself.



  • There's a fair bit of, shall we say, legacy code on that server. The hoster is scrapping the hardware that the server runs on, and they're not getting us another PHP 4 server (and let's face it, why should they) so sooner or later, this box will be migrated to PHP 5. I've been trying to warn the company that these legacy applications might not run, or worse, but apparently it's nothing to worry about. Let's hope this is one of those times where I'm being wrongly pessimistic!



  • Yesterday at band practice, I explained this to my non-programmer bandmates. I think it's telling that they understood exactly what the WTF was, even though they know nothing about databases or web development.



  • @toon said:

    but seven fucking thousand (7,000) database queries

    So that is how you pronounce the comma in numbers.



  • It must be incredibly legacy if it won't run on a supported version of PHP; PHP 4 was EOLed nearly 7 years ago.



  • I guess odds are good that it'll run on 5, but some of it is pretty shitty.



  • If you're database query exceeds mid-high single digit numbers for any particular transaction type

    You are probably doing it wrong.

    If your database query exceeds 3500 queries...

    WHAT THE FUCK ARE YOU DOING OH MY GOD!


  • BINNED

    @toon said:

    The hoster is scrapping the hardware that the server runs on, and they're not getting us another PHP 4 server (and let's face it, why should they) so sooner or later, this box will be migrated to PHP 5. I've been trying to warn the company that these legacy applications might not run, or worse, but apparently it's nothing to worry about.

    Of course there's nothing to worry about. You can just run them all in a VM running CentOS 4. What could possibly go wrong?



  • This is the continuity plan. Just remember to take the computer image BEFORE you scrap the server.


  • :belt_onion:

    @Matches said:

    If your database query exceeds 3500 queries...

    WHAT THE FUCK ARE YOU DOING OH MY GOD!

    You think that's bad?

    We had a soap service that consumes a set of records from multiple clients (many hundreds of clients), which inserts those records into the database. Each recordset consisted of between 400 - 800 records (USUALLY, sometimes we would get ~30,000+ at once which I will detail later). All the records had about 4 fields that had to be looked up and translated into IDs and inserted into the database. They created a function call in the service to translate each of those fields to an ID. Yay you say, code reuse! Boo I say - each call queried the database. Even if the answer had just been returned. No caching was done.
    The logic went like so - open transaction, do loop through record set, look up 4 ids in Database, Run query statement for single record to see if exists. If exists, update, otherwise insert. After the recordset completed, close transaction, rollback if error. And about 95% of the records were updates.

    So back to our 30,000 record input set.... 30,000 * 4 + 30,000 * 2 = 180,000 queries. At a somewhat ok 100ms per query, it could take 5 hours to run.

    After we finished implementing a local memory cache of the tiny lookup id fields (no more than 100ids per lookup dataset) and converting the 30,000 row insert into a single bulk merge, the entire thing ran using a grand total of 5 queries, and completed in under 5 seconds....

    edit - oh, it was also PHP5 :D and kinda funny side story - the shitty ass PHP NuSoap service ships with "debugging" set to MAX, which builds a gigantic debug string while doing its work. When 30,000+ record inputs came in, it was a race to see whether the Database server gave up or PHP ran out of memory from the multi-hundred megabyte debug string it was trying to build. They use no max limit ini override to allow it to not get killed by the server's default 15minute timeout.



  • Hire me for 120k <fuck you>USD <Arantor> a year and I will make all your headaches go away.



  • Fun fact: If you cut off your head and you will never get a headache again. And that costs less than $120k/year



  • Sure, but on the assumption he'd like to keep his sanity (and, you know, life), I feel like I'm offering a bargain, and I wouldn't touch it for a penny less than that per year.



  • @Matches said:

    and, you know, life

    Ah, my solution doesn't work in edge cases like that.


  • Discourse touched me in a no-no place

    @darkmatter said:

    the shitty ass PHP NuSoap service ships with "debugging" set to MAX, which builds a gigantic debug string while doing its work

    That sounds like something for the Bad Ideas thread.



  • Reminds me of bulk loading data in SFDC. Lots of chunking and waiting.



  • How about 120,000 yen a year? Does that sound like a deal?



  • I'm still waiting for my dislike feature to be implemented, so I'll tuck this one away for later.



  • What you really want is a 'fuck you' option.



  • No, the 'fuck you' option was me going back and ninja editing my original post.



  • Touché.



  • It's better now. quote reply it.



  • @Matches said:

    It's better now. quote reply it.

    Sounds like effort.



  • @chubertdev said:

    Reminds me of bulk loading data in SFDC. Lots of chunking and waiting.

    I am also finding absence of TRUNCATE TABLE a serious problem. I have to download all the ID and use BulkApi to delete the records. What kind of retarded invention is this?



  • DELETE FROM <table> should work.



  • drop table < table>
    exec sp_recreatetable (table)

    What the flying fuck? Type < table> (with no space) then on the second line exec sp_recreatetable (table)

    Why does the second line disappear?

    Apparently creating a malformed table command is interpretted as html, and doing < table>< th doesn't bake into view source.

    This reeks of XSS potential.



  • Yeah but that means you don't have a table afterwards. How will you put pieces of paper on it to take pictures of?



  • @Arantor said:

    DELETE FROM <table> should work.

    You haven't met SOQL, have you?



  • StackOverflow Query Language? No thanks.



  • @Arantor said:

    DELETE FROM &lt;table&gt; should work.

    LMAO!!! Try deleting 60 thousand records and come back and make your report!



  • Praise the Discourse for multiple mungings of a single statement!



  • Just disable the transaction log, you'll be fine.



  • @Matches said:

    Just disable the transaction log, you'll be fine.

    I hoped that was posted as a joke. There are no transaction logs to be disabled.



  • Yes.



  • @Arantor said:

    StackOverflow Query Language? No thanks.

    It's much worse than that, worse than anything that you can imagine.



  • LIES! LIES I SAY! I refuse to believe anything could be worse than that.



  • When a technology makes you leave a job, you know it's bad.



  • Oh?


  • :belt_onion:

    I am

    well that's certainly interesting @Matches
    amused?


  • :belt_onion:

    Interesting how things in a <table>

    well that's certainly interesting @Matches
    don't show in the post at all, or the HTML on the page, but if you use the quote button in the editor, it prints the contents of that <table>more @Matches
    into the post.



  • Yep.



  • Another thing Discourse does so well. There is a reason most forum software avoids table tags, and the few that do, tend to keep it very simple.



  • @Arantor said:

    DELETE FROM <table> should work.

    I feel like I'm walking into an ambush here but wouldn't you use ...

    TRUNCATE [TABLE]


  • Discourse touched me in a no-no place

    DROP TABLE Foo
    

    then load the DDL script again within the same transaction. (TRUNCATE isn't universally supported.)


Log in to reply