Learn to Database (article)



  • http://deliberate-software.com/Learn-To-Database/

    Two weeks later we have it running and enough data to test it. 2500 records a minute. When scaled to 45000 records it is still going 30 minutes later, so it doesn't even scale linearly. We are screwed.
    ...
    The ORM generated query selects a list of id's from table A, then puts those id's into a list as a filter for the same table:

    SELECT * FROM A WHERE ID IN (SELECT ID FROM A WHERE ...)

    We scratch our heads a bit, but hey, that can't hurt anything, surely the database figures that sort of stuff out? Isn't that what... indexes do? Our most database-savvy developer can't stop coming back to that query. He says it doesn't smell right. Finally, he takes out the "WHERE IN" filter, shims in the query, and runs the process.

    "Guys, the whole thing, all 45k came back in 40 seconds."

    The only thing worse than ORM is a home made ORM.


  • BINNED

    "We need a new architecture, with read-optimized databases and queue-based messaging."

    Yes! Throw more shit at it! Use more services! That's bound to fix it!


    Filed under: Where have we seen that before, I wonder...

    P.S. - The WHERE ... IN thing can actually be useful for offsetting on large datasets if you're dealing with a DB engine where OFFSET ... LIMIT is inherently slow. But not as a replacement for JOINs, FFS.



  • @Onyx said:

    Yes! Throw more shitbuzzwords at it!

    Because buzzwords are like magic pixie dust, they automatically improve whatever you are talking about.


  • area_deu

    I have actually rewritten queries from JOIN to WHERE id IN because that was the only way I could find to stop SQL fucking Server from "optimizing" the query to take two orders of magnitude longer.
    Young, foolish, needed the money etc.

    That was eight years or so ago. The code is still in production.


  • BINNED

    @Scarlet_Manuka said:

    Because buzzwords are like magic pixie dust🐎📯 and 🌈, they automatically improve whatever you are talking about.

    FTFY

    @ChrisH said:

    I have actually rewritten queries from JOIN to WHERE id IN because that was the only way I could find to stop SQL fucking Server from "optimizing" the query to take two orders of magnitude longer.

    That happens on occasion, yes. I have a few subqueries in some of my stored procs because they are faster than JOINs in that specific case. But that's one of those exceptions that confirm a rule, really.



  • Sometimes you need to do that sort of thing, unfortunately.

    I've been working with Oracle databases from v8 to v11 (soon to be v12), and I've noticed that the CBO has gotten much better at predicting the real cost of queries, as well as being able to transform the query in a much wider variety of ways. But, at the same time, it's gotten much, much harder to force it to choose the correct query path when it decides on a bad one. I had a query recently that I had to try all sort of optimiser hints on before I found a combination of three that together would get it to take the correct path - leave any one of them out and it would revert back to the slow way.



  • @Onyx said:

    Yes! Throw more shit at it! Use more services! That's bound to fix it!

    I remember sitting at a "Big Data" conference, and a bloke bragging at how they make their service super-fast by going from a RDBMS, putting in a few queues (Sparkle (or is that Sparql? Dafuq, dunno...)/Redis), map-reduce services (Hadoop?), and something still being in the RDBMS.

    That kept me wondering, how in the world making the chain even longer does make the data turnover faster? There are the fucking physics laws, after all! By putting queues everywhere they just offset their lags somewhere else.


  • ♿ (Parody)

    @cartman82 said:

    The only thing worse than ORM is a home made ORM.

    Sounds like they need a better RDBMS. I didn't see any mention of what they were using. I tried a couple of queries like that in Oracle, and the plans were identical if I just did select * instead of select id like their original query.


  • ♿ (Parody)

    @Scarlet_Manuka said:

    But, at the same time, it's gotten much, much harder to force it to choose the correct query path when it decides on a bad one. I had a query recently that I had to try all sort of optimiser hints on before I found a combination of three that together would get it to take the correct path - leave any one of them out and it would revert back to the slow way.

    👍 I've had problems like this, too. The only thing I could figure out was to move it to a stored proc and run a different query based on the conditions. The problem seemed to be that it remembered the plan from before, when one of the conditions, fed into a LIKE, was or wasn't %, and changed the next time.

    Suddenly that cross join that was only using a record from each table was pulling all or most of the table. Buh-buy, temp space!



  • You do know that sql profiles are the preferred way nowadays, instead of hints?
    Many hints are simply ignored by the optimizer in higher versions (11g and up).


  • Discourse touched me in a no-no place

    @boomzilla said:

    I tried a couple of queries like that in Oracle, and the plans were identical if I just did select * instead of select id like their original query.

    Not even Oracle can defend against this kind of stupid though:
    SELECT * FROM A WHERE ID IN (SELECT ID FROM A WHERE ...)


  • ♿ (Parody)

    @loopback0 said:

    @boomzilla said:
    I tried a couple of queries like that in Oracle, and the plans were identical if I just did select * instead of select id like their original query.

    Not even Oracle can defend against this kind of stupid though:
    SELECT * FROM A WHERE ID IN (SELECT ID FROM A WHERE ...)

    But that was exactly what I tried! I had one case that resulted, due to the conditions I used, in a full table scan and another that used an index. And like I said, explain plans were identical whether I used the given form or got rid of that outer query and just did a SELECT * instead of SELECT ID.

    Now, perhaps that's too simple, and a more complicated query, with joins, etc, inside the IN() would make a difference. You never can tell without trying. But that basic scenario as presented...yep, Oracle 11 protected me.


  • Discourse touched me in a no-no place

    I got different plans and different execution costs between:

    SELECT * FROM A WHERE ID IN (SELECT ID FROM A WHERE ...)

    and the less moronic:

    SELECT * FROM A WHERE ...

    Even a simple query accesses the same data in the table twice. On 10g, at least.


  • BINNED

    I had a similar situation as @boomzilla, but involving OFFSET, a bunch of filtering and ORDER BY (pagination bollocks). Something or other was messing up the index scan, but if I selected only the primary key, did an offset on the column results were ordered by and stuck that inside WHERE IN it sped it up by an order of magnitude.

    I think it also had to do with the number of JOINs I had. I've since switched some data around so I can't give you an exact repro, unfortunately.

    So yeah, edge case, and might not work the same on all RDBMS', but it happens from time to time.


  • Discourse touched me in a no-no place

    The TL;DR here is that databases be crazy.



  • 45k in 40 seconds is still pathetic performance. Unless their database is running on tinker toys.



  • Adding qualifiers and subscripts to identify the WHERE claues and tables:

    SELECT * FROM A AS X1 WHERE1 ID IN ( SELECT ID FROM A AS X2 WHERE2...

    When using this syntax, I would normally expect to see a JOIN-set connector like this in WHERE2:

    WHERE2 X1.SOME_COLUMN = X2.SOME_OTHER_COLUMN

    For this to work, the set for the IN clause is normally built for each X1 row, So the performance of this depends on the quality of WHERE2. Since it is being run once for each row of X1, if the X2 query doesn't have good indexing then the overall performance would be terrible:

    1. Get first X1 row; tablespace scan X2 to find out if ID is in the list...
    2. Get second X1 row; tablespace scan X2 to find out if ID is in the list...
    3. Get third X1 row; tablespace scan X2 to find out if ID is in the list...

    If the WHERE2 must be poorly performing, it is usually better to do a JOIN, so the database engine can choose which table to filter first. (If a JOIN is even needed...doesn't look like it is, in this example.)

    A smart database engine might treat this IN as a JOIN, but it depends a lot on the engine.



  • Or the dataset is high millions of rows (... and is updating, rather than inserting)


  • Discourse touched me in a no-no place

    @CoyneTheDup said:

    Adding qualifiers and subscripts to identify the WHERE claues and tables:

    SELECT * FROM A AS X1 WHERE1 ID IN ( SELECT ID FROM A AS X2 WHERE2...

    When using this syntax, I would normally expect to see a JOIN-set connector like this in WHERE2:

    WHERE2 X1.SOME_COLUMN = X2.SOME_OTHER_COLUMN

    For this to work, the set for the IN clause is normally built for each X1 row, So the performance of this depends on the quality of WHERE2. Since it is being run once for each row of X1, if the X2 query doesn't have good indexing then the overall performance would be terrible:

    1. Get first X1 row; tablespace scan X2 to find out if ID is in the list...
    2. Get second X1 row; tablespace scan X2 to find out if ID is in the list...
    3. Get third X1 row; tablespace scan X2 to find out if ID is in the list...

    If the WHERE2 must be poorly performing, it is usually better to do a JOIN, so the database engine can choose which table to filter first. (If a JOIN is even needed...doesn't look like it is, in this example.)

    A smart database engine might treat this IN as a JOIN, but it depends a lot on the engine.

    :wtf:



  • This is the story of how I learned an important lesson: if you are going to database, LEARN TO DATABASE.

    Why does nearly every ORM horror story end with a variation on this advice?



  • Maybe because it's a sane piece of advice, after all.



  • @wft said:

    Maybe because it's a sane piece of advice, after all.

    I don't disagree. I'm just bemused that problems like these tend to appear so frequently.



  • Because many idiots think that this piece of advice somehow doesn't apply in their particular situation.



  • Also, the desire of many idiots to roll their own ORMs stems from them being actually actively unwilling to learn any of the database itself, and an irrational wish to justify that ignorance.



  • Yep, I've met plenty of developers who were deathly afraid of learning or touching SQL.


  • ♿ (Parody)

    I love both SQL and my ORM (Hibernate). They are both super useful and appropriate for different use cases. But there will always be people treating either one as golden hammers.



  • Using Big Data techniques you can absolutely get such performance improvement that it makes perfect sense to copy everything to the Big Data environment, run your query, and copy back the result, especially if, say, your daily query uses more than your daily data, so that you only copy over the day's new data and run the query on the full dataset. It basically depends on your data architecture, volume, and the nature of your query.
    An extreme but real-world example from $WORK is a daily query that went from 18 hours to 6 minutes, for a one-time HW cost of 1/24th the yearly recurrent cost of the old system (think about asking a certain DB company known to be expensive for their very best, and a certain HW company also known to be expensive for their very best high-performance cluster to run it on; the very prescient DB company made so much money they recently bought the very shining HW company). That's a Performance/Cost improvement of 4320 times, not counting relief from stress and other problems caused when the old query had problems related to the fact that it usually took 18/24ths of available time, never less but sometimes more. The human development costs were more than recouped in the first year of operation.
    Now, I'm not saying that the original SQL query wasn't written by a friend of the WTF-author in the above post, it could have been, but the high-level theoretical query scales perfectly (complicated query run on several million totally independent chunks) and the 180-fold performance increase we saw can be to a significant degree explained by the hundredfold multiplication of computing cores and corresponding I/O power, so it can't have been all that bad.
    So, laws of physics by all means, but sometimes you can get better for less, and even much better for much less, so it would be a pity to discount the possibility. (And no, neither the sponsor/dev team nor the department got to keep even a fraction of the money we saved, but you already suspected that, right?)



  • The fundamental problem with Big Data as of today: you must plan all kinds of queries you want to run against it beforehand, as they influence how you shard your data etc. Setting shit up is much like programming in straight machine code compared to modern high level languages; and then a critical, paying customer comes up with a new facet of data you need to get them that requires you to reshuffle all your data; except that there is so much of it now that, well, you're basically fucked.

    Also, doing joins by hand and extreme denormalization (and again you're fucked if you overlook what you might need in 6 months time).

    I think some PostgreSQL-based parallelizing extensions sound much more promising (there are a few, and some of them might become a part of the core), as they are aimed at taking care of low-level stuff for you. I don't want to know how precisely will it join tables for me, I want it do it fast, and give me my data, and that's it.



  • Also, the big smell for me was that they bragged about using blazing-fast queues upon blazing-fast queues. If you ever need a queue, you do have a bottleneck which your queue is only going to mask for a while. You are just buying a bigger sink, while your downstream pipe is too narrow. One day it won't suffice you any more.

    Same told of sophisticated caches and caches for those caches (until you spend gazillion more time on invalidating issues than actual work): if you need more and more from them, the data source might just be screwed. Maybe it's time to upgrade the hardware of your primary data source, then, in the first place?



  • @nerd4sale said:

    You do know that sql profiles are the preferred way nowadays, instead of hints?Many hints are simply ignored by the optimizer in higher versions (11g and up).

    Depends, depends. One of the things it depends on is whether you have access to put sql profiles in the database. ;) More to the point, where I'm coming from is usually the context of interface work, where we modify the queries from time to time; generally the optimiser hint can remain as is (e.g. we're just getting an additional field from a table we're already looking up), while we'd need to drop the existing profile and create a new one when the query changed.

    Another thing that happens is that certain tools don't use bind variables when they should, so a sql profile will get you squat because the query changes every time.

    I remember looking at a performance issue with one program – which we're in the last stages of transitioning away from – which came down to a query which asked for a bunch of data based on the operating unit and a set of participant IDs. The operating unit was a bind variable, even though it never changed, but the participant IDs weren't, even though that changed every time it ran the query (it divided the participants into groups of five and ran the query for each group). There's not much you can do when you're faced with that level of :doing_it_wrong:.


  • Java Dev

    @Scarlet_Manuka said:

    Another thing that happens is that certain tools don't use bind variables when they should, so a sql profile will get you squat because the query changes every time.

    Hey, I fixed that for the next release specifically so that tuning profiles and autolearning of plans would actually work.

    Oh, and better identification of expensive queries in AWR reports.


Log in to reply