Resources to learn about database schema design and optimization?

Benjamin Hall

I find myself, for a bunch of reasons, with a lot of slack at work. And I want to actually be productive (I know, ). One place I'm weakest on is dealing with RDBMSs in a performant way. I know the basics of SQL and can avoid some of the really stupid things (queries against unindexed columns in a big table), but I want to be better at that.

What are some resources (ideally free and online) that people would suggest to learn more about writing performant queries and better structuring table definitions? Even if they're for different database products, I think it could still be useful.

djls45

@Benjamin-Hall I've found a fair amount of info just scanning through the questions and answers on the DBA section of StackExchange. If nothing else, they should be able to provide some links to other resources.

dkf

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

What are some resources (ideally free and online) that people would suggest to learn more about writing performant queries and better structuring table definitions? Even if they're for different database products, I think it could still be useful.

A good rule of thumb is to see what full table scans your queries are doing (with EXPLAIN or equivalent for your DBMS). The issue is that full table scans of tables are expensive where you don't want a substantial proportion of the total number of rows. Avoiding that is what you use indices for; an appropriate index turns access to a row by value from O(N) to O(logN). Getting that right, plus appropriate use of transactions, is what gives 99% of all DB performance gains. (Adding an index will slow down an insert, update, or delete, but they only usually get rebuilt once per transaction; bunching modifications if you can is a great plan.)

Indices have other tricks, but then you're getting into more DB-specific stuff. The main one I know of is defining a uniqueness constraint across multiple columns (which is a correctness property, not just a performance property). The order of columns in a multi-column index tends to be important too, but I have to look things up to know how to handle that.

Some DBs do not need explicit indices for primary keys if they're integer autoincrement. No idea if that's universal!

PleegWat

@dkf said in Resources to learn about database schema design and optimization?:

(with EXPLAIN or equivalent for your DBMS)

Reading explain plans is a skill in itself. Make sure to run them on realistically populated databases with up-to-date statistics - if the optimiser thinks the table is near-empty, it may ignore indexes which it would use on larger tables. Check any row count estimates included in the plan and consider whether they're reasonable.

Ensuring you have up-to-date statistics on your production DB is your DBA's job, if you have one.

The above is based on Oracle and may not apply to some other DBMSes.

Benjamin Hall

@PleegWat said in Resources to learn about database schema design and optimization?:

Ensuring you have up-to-date statistics on your production DB is your DBA's job, if you have one.

Unfortunately we're dev-ops, and the dedicated platform people are up to their eyeballs in transitioning from in-house (well, in AWS but self-managed) Mysql v5.5 (yes, that's ancient) to AWS Aurora.

I do have access to a read-only replica set aside for internal use (ie no production read traffic)--how costly is an EXPLAIN operation?

loopback0

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

how costly is an EXPLAIN operation?

Basically nothing as it's just an estimation. Nothing is actually executed.

8.whatever has EXPLAIN ANALYZE which runs the query to analyze the plan (and MSSQL and Oracle have equivalents), but 5.5 only has EXPLAIN.

IME the MySQL format for explain plans isn't great, and the MySQL optimiser is insane compared to Oracle's or MSSQL's but YMMV.

Benjamin Hall

Well...hand entering EXPLAIN calls for some of our less-trivial (ie not a simple SELECT fields FROM table WHERE simple_condition) queries showed that we're doing full table scans on at least two tables, both of which are pretty high-traffic (for us). I'll have to load up some data into the sandbox that overhauls one of the workflows to see how much it complains when I EXPLAIN that one. In one case, there was an entry in the possible_keys column (a foreign-key related index) that it decided not to use.

But I'm thinking that there's some low-hanging fruit there.

At least it's only one full table scan per query, not multiple!. So the total rows is only (big table)x1 (all the other table queries are 1 row each) instead of (big table)x(other big table)x...

loopback0

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

In one case, there was an entry in the possible_keys column (a foreign-key related index) that it decided not to use.

As there's an index it could have used then the optimiser decided the full scan was less costly than the index scan. This could be because the stats on the table(s) are out of whack, or it could be correct if it's a small table.
Sometimes index hints help too.

Benjamin Hall

@loopback0 said in Resources to learn about database schema design and optimization?:

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

In one case, there was an entry in the possible_keys column (a foreign-key related index) that it decided not to use.

As there's an index it could have used then the optimiser decided the full scan was less costly than the index scan. This could be because the stats on the table(s) are out of whack, or it could be correct if it's a small table.
Sometimes index hints help too.

One of the things I don't know about is what counts as "small". In this case, there are ~7k rows in the table.

HardwareGeek

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

@loopback0 said in Resources to learn about database schema design and optimization?:

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

In one case, there was an entry in the possible_keys column (a foreign-key related index) that it decided not to use.

As there's an index it could have used then the optimiser decided the full scan was less costly than the index scan. This could be because the stats on the table(s) are out of whack, or it could be correct if it's a small table.
Sometimes index hints help too.

One of the things I don't know about is what counts as "small". In this case, there are ~7k rows in the table.

Many years ago, I used to be responsible for a script that added test results to a database. IIRC, we had somewhere in the neighborhood of 100k rows. I thought that was big, until I started reading here about people dealing with hundreds of millions of rows.

7k is pretty small.

Tsaukpaetra

@HardwareGeek said in Resources to learn about database schema design and optimization?:

7k is pretty small.

Probably get cached in RAM and stay there, even.

aitap

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

What are some resources (ideally free and online) that people would suggest to learn more about writing performant queries and better structuring table definitions?

I've been planning to read Use the index, Luke! for a year or so. Table of contents seems to be exactly the kind of thing you're looking for.

acrow

There's a million books on database design on the interwebs. I think I bought this one, years ago. It was OK-ish, IIRC. Despite the cover.

Looks like it's still sold as a PDF here for... looks it up... €11,09.

...They've raised their prices. I paid less than €10 back then.

dkf

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

One of the things I don't know about is what counts as "small". In this case, there are ~7k rows in the table.

I don't know how large the storage for each row is (depends on the number and types of columns) but at 128 bytes a row, 7k rows is 917504 bytes, ie. under 1MB. Even most potato-computers from 10 years ago had many times that much space; the data will sit in RAM and never get unloaded because there's literally no reason for the DB engine to ever bother to do otherwise. That's “small” in DB terms.

The other thing to watch out for are self-joins. Self joins get very expensive if they aren't indexed (or using something the DB can look up directly like the ~~row ID~~ integer autoincrement primary key) even with “small” tables.

boomzilla

@loopback0 said in Resources to learn about database schema design and optimization?:

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

In one case, there was an entry in the possible_keys column (a foreign-key related index) that it decided not to use.

As there's an index it could have used then the optimiser decided the full scan was less costly than the index scan. This could be because the stats on the table(s) are out of whack, or it could be correct if it's a small table.
Sometimes index hints help too.

It also depends on the query and what it looks like you'll be accessing in the table. If the table is a few joins away from something that gets an index scan based on some constraint it could look like a better bet to do a full scan.

boomzilla

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

@loopback0 said in Resources to learn about database schema design and optimization?:

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

In one case, there was an entry in the possible_keys column (a foreign-key related index) that it decided not to use.

As there's an index it could have used then the optimiser decided the full scan was less costly than the index scan. This could be because the stats on the table(s) are out of whack, or it could be correct if it's a small table.
Sometimes index hints help too.

One of the things I don't know about is what counts as "small". In this case, there are ~7k rows in the table.

That's fairly small. But the only real test is to actually run the query and decide if it takes too long. Explain plans are still just educated guesses about how long the query will actually take. With any optimization, measure early, measure often.

boomzilla

@dkf said in Resources to learn about database schema design and optimization?:

The other thing to watch out for are self-joins. Self joins get very expensive if they aren't indexed (or using something the DB can look up directly like the ~~row ID~~ integer autoincrement primary key) even with “small” tables.

Self joins are where I start reaching for CTEs. Get a manageable number of records first then start the fun.

Kamil Podlesak

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

@PleegWat said in Resources to learn about database schema design and optimization?:

Ensuring you have up-to-date statistics on your production DB is your DBA's job, if you have one.

Unfortunately we're dev-ops, and the dedicated platform people are up to their eyeballs in transitioning from in-house (well, in AWS but self-managed) Mysql v5.5 (yes, that's ancient) to AWS Aurora.

I do have access to a read-only replica set aside for internal use (ie no production read traffic)--how costly is an EXPLAIN operation?

I would put it this way: database query engines today are actually non-von-Neumann Virtual Machines. Each query is actually compiled to bytecode (with heavy optimization based on the data statistics) and the EXPLAIN command just shows to bytecode; nothing more. Which is why there is no universal EXPLAIN format: DB2 is almost imperative (stars with the first scan, ends with result) because its VM is push-based, Postgresql is upside down because it's pull-based, Mysql is stupidly oversimplified because everything has been bolted-on as an afterthought (but it might get better in the last few years).

In any case, the EXPLAIN itself actually just compiles the query, so the cost is literally negligible. As others mentioned, some engines have variant that also runs the query (and compares the result to see how off the mark the ~~ass-pull~~statistics-based predictions were), those can be costly of course.

PleegWat

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

One of the things I don't know about is what counts as "small". In this case, there are ~7k rows in the table.

I'd frown at fetching 7k rows into your application unnecessarily, but a full table scan on it wouldn't worry me. Particularly since you mentioned it's relatively high traffic - doing full table scans on the in-memory data may be more efficient than keeping the index in memory as well.

dkf

@PleegWat said in Resources to learn about database schema design and optimization?:

I'd frown at fetching 7k rows into your application unnecessarily

That's not fetching them into the application, that's fetching them into the DB engine.

HannibalRex

@Benjamin-Hall I just got a marketing email the other day for this book

Efficient MySQL Performance

You'll find several books on basic or advanced MySQL performance, but nothing in between. That's because explaining MySQL performance without addressing its complexity is difficult. This practical book bridges the … - Selection from Efficient MySQL Performance [Book]

no experience with it personally, but the email went on and on about how it's great for understanding the output of EXPLAIN ANALYZE, so perhaps it's just the thing you're looking for!

boomzilla

@PleegWat said in Resources to learn about database schema design and optimization?:

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

One of the things I don't know about is what counts as "small". In this case, there are ~7k rows in the table.

I'd frown at fetching 7k rows into your application unnecessarily, but a full table scan on it wouldn't worry me. Particularly since you mentioned it's relatively high traffic - doing full table scans on the in-memory data may be more efficient than keeping the index in memory as well.

There's also the issue of how the DB engine will read records. Basically a similar issue to block size on disk, where it reads them in chunks, and so the vagaries of that can affect the actual execution and query planning.

Benjamin Hall

Ok, here's a question for the gurus. I have some aggregate entities that have sub-entities. Fairly normal. When I read one of the parent entities, I also want to grab all the sub-entities.

Let's call the parent entity Occurrence, with children entities (on other tables) of OccurrenceAssignment (1...N, and yes, there are other assignments so the disambiguation is necessary) and CoverageRequest (0..N), the latter of which has sub-entities Recipient (1..N).

Each of the latter ones has a foreign key to the occurrence.id to which it belongs. So the simple case (read a single Occurrence by id) could, in theory, just do a big JOIN query. Fine. I know how to do that.

But more often I want to get all Occurrences (with all their sub-entities) which match some pattern of fields of the Occurrence, such as "belongs to client id X and falls between dates start and stop".

Right now, I'm doing the following (that's horribly sub-optimal, I think):

Query 1: Get all the Occurrences that match the pattern: SELECT <fields> FROM occurrences WHERE <conditions>;
Query 2: Get all OccurrenceAssignments that match the ids returned from query 1: SELECT <fields> FROM occurrence_assignments WHERE occurrence_id in (<values>), where <values> is basically a map of the ids from query 1.
Query 3: Get all Requests + Recipients for all the occurrence ids via a big JOIN call with some magic to parse that out and construct the entities themselves. Does use a WHERE occurrence_id IN (<values>) clause instead of doing N individual queries.

And then assembling the mess in code.

Question 1: Is this likely to matter? Is this likely to be faster as a big JOIN? As far as scale, each Occurrence generally has between 1 and 20 OccurrenceAssignments and between 0 (mode) and 50 Requests, most of which have 1-2 Recipients. And we might be querying for up to several hundred Occurrence objects at a time, unlikely to be more than a few thousand.

Question 2: The real difficulty comes in persisting this mess when something changes. A change might involve multiple pieces, and we want to persist the whole thing together (as seen from the outside) so it's atomic--the whole thing gets persisted or it all fails.

At that point, I have to do things like

Figure out what Request entities have been deleted (easy enough) and delete those. Update the ones that still exist and create the new ones.
Figure out how the OccurrenceAssignments have changed and sync those as well.
That mostly feels like it needs to be in a loop--deleting is easy and can happen all at once, but updates and creation feels like it needs to loop through the changed elements and create/update them individually, with possibly nested loops for the Requests and their Recipients. And that feels icky.

But the bottom line question is: should I even care about this? I don't have any good tools to know how much of a cost that is and whether it matters. Or even if the increased ugliness of maintaining a complex set of queries is worth it. I don't even have any good intuition on the matter.

Probably rambling and not making any sense, but...

Watson

Without looking too closely at the fine grain of the task, being able to break it out into stages like Queries 1,2,3 suggests a CTE-based approach. All three parts go into the one statement (WITH interesting_occurrences AS (SELECT <fields> FROM occurrences .WHERE <conditions>), interesting_assignments AS (SELECT <fields> from occurrence_assignments WHERE occurrence_id in (SELECT occurrence_id from interesting_occurrences)..., etc.).

You can think of the earlier queries having their results stuffed into table-valued variables that the later queries then read from.

Depending on the dbms, it may or may not have the smarts to propagate things like statistics and conditions between the different expressions.

Even in the absence of that, though, it does keep everything in one statement and avoids the back-and-forth between it and application code.

Benjamin Hall

@Watson Looks like CTE's (the WITH clause) is only supported in Mysql 8+. Which we'll upgrade to...sometime after the heat death of the universe, likely.

But thanks for the idea--I think if we were using a "real" RDBMS that'd be the way to go.

Although I could use temp tables or possibly views...I'd have to investigate further on that.

Watson

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

@Watson Looks like CTE's (the WITH clause) is only supported in Mysql 8+. Which we'll upgrade to...sometime after the heat death of the universe, likely.

But thanks for the idea--I think if we were using a "real" RDBMS that'd be the way to go.

Although I could use temp tables or possibly views...I'd have to investigate further on that.

I keep forgetting; left MySQL & never looked back. 5.?? does at least do nested queries though, right? Inline the selects? It'd be a pig to read, though.

Benjamin Hall

@Watson said in Resources to learn about database schema design and optimization?:

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

@Watson Looks like CTE's (the WITH clause) is only supported in Mysql 8+. Which we'll upgrade to...sometime after the heat death of the universe, likely.

But thanks for the idea--I think if we were using a "real" RDBMS that'd be the way to go.

Although I could use temp tables or possibly views...I'd have to investigate further on that.

I keep forgetting; left MySQL & never looked back. 5.?? does at least do nested queries though, right? Inline the selects? It'd be a pig to read, though.

I've been told that sub selects are super slow, especially if nested more than 1 deep. Which I'd have to do. But worth investigating.

dkf

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

But the bottom line question is: should I even care about this?

The first step is probably to ask how to avoid pulling back any information from the DB that you don't need, as one of the slowest bits with a database server is marshalling data over the network. (Data movement is almost invariably very slow in any real problem.) Apart from that, you want to end up pulling a single row for every entity you want to make client side, i.e., one per Occurrence, OccurrenceAssignment, CoverageRequest, and Recipient. You don't want to have to do lots of fancy client-side caching (you can do that, but you don't want to have to do it).

boomzilla

@dkf said in Resources to learn about database schema design and optimization?:

The first step is probably to ask how to avoid pulling back any information from the DB that you don't need, as one of the slowest bits with a database server is marshalling data over the network.

It also depends on latency. I have far more issues with that (especially in dev, where the DB is in some remote datacenter than in prod, where they're all connected by fiber) and having multiple round trips than I do with just pulling lots of records across the network.

The ORM often ends up doing this stuff and it kills performance. At which point I tend to flatten things out (if I can, when there are multiple collections per object you have to separate somehow), sort them by the id or whatever and then loop through them watching for changes and assembling objects as I go. Tedious, yes, but sometimes required to get decent performance.

DogsB

An obvious one if no one has mentioned it. Don't Select * every query. Sometimes it can be the difference between 100kb coming over the wire and 20mb.

But now I can just go row[row.lenght - 1] and row[row.lenght - 2].
No you fucking won't.

*edit ah. by @dkf and more eloquently too.

dkf

@DogsB If you use SELECT * in a query, the order of columns you get back should be random. After all, if you cared you'd specify it…

DogsB

@dkf I've only come across that since I started working with oracle db. MySQL (inb4 but oracle owns MySQL now) and postgres were more dependable in that regard. I'm convinced that every questionable behaviour by dbms is probably about weird behaviour observed in oracle db.

loopback0

@DogsB said in Resources to learn about database schema design and optimization?:

Don't Select * every query

Don't SELECT * any query used in your actual application. Aside from bringing back too much data, it's asking for future problems when columns get added or removed.

@DogsB said in Resources to learn about database schema design and optimization?:

I'm convinced that every questionable behaviour by dbms is probably about weird behaviour observed in oracle db.

They all have their weird behaviours. Oracle's nothing special in that regard.

PleegWat

@DogsB said in Resources to learn about database schema design and optimization?:

@dkf I've only come across that since I started working with oracle db. MySQL (inb4 but oracle owns MySQL now) and postgres were more dependable in that regard. I'm convinced that every questionable behaviour by dbms is probably about weird behaviour observed in oracle db.

I'm pretty sure you get them in definition order in both mysql and oracle.

@loopback0 said in Resources to learn about database schema design and optimization?:

Don't SELECT * any query used in your actual application.

In the data processing layer, I'm pretty sure the only ones are in the 'dump all the contents of this table to file' utility. In the UI layer, it's all over the place, but then they also use name-based instead of position-based referencing.

Gribnit

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

sub selects are

... a sop to savage pig-newts who cannot be troubled to learn what a JOIN is. Some optimizers try to turn them into joins for you. This is a mistake. You should suffer through every scan you have specified.

Kamil Podlesak

@dkf said in Resources to learn about database schema design and optimization?:

@DogsB If you use SELECT * in a query, the order of columns you get back should be random. After all, if you cared you'd specify it…

You can use column names instead, so the random order does not matter. This is actually a good idea anyway, makes refactoring less error-prone.

Then again, SELECT * is not a good practice in any case.

Kamil Podlesak

@Gribnit said in Resources to learn about database schema design and optimization?:

@Benjamin-Hall said in Resources to learn about database schema design and optimization?:

sub selects are

... a sop to savage pig-newts who cannot be troubled to learn what a JOIN is.

Eh, no. Both EXISTS and NOT EXISTS can be replaced by joins, but then the rows are polluted by bogus rows that needs to be filtered on the client side (and columns to do this filtering by). Quite a PITA.

~~Some~~Non-shit optimizers try to turn them into joins for you.

FTTFY
Seriously, this is a baseline for a query optimizer to be considered "not a toy".

dkf

@Kamil-Podlesak said in Resources to learn about database schema design and optimization?:

@dkf said in Resources to learn about database schema design and optimization?:

@DogsB If you use SELECT * in a query, the order of columns you get back should be random. After all, if you cared you'd specify it…

You can use column names instead, so the random order does not matter. This is actually a good idea anyway, makes refactoring less error-prone.

Then again, SELECT * is not a good practice in any case.

boomzilla

@loopback0 said in Resources to learn about database schema design and optimization?:

@DogsB said in Resources to learn about database schema design and optimization?:

Don't Select * every query

Don't SELECT * any query used in your actual application. Aside from bringing back too much data, it's asking for future problems when columns get added or removed.

I only use it when I'm wrapping a subquery or reusing a CTE, which means that I've already selected the columns that I want and keeping all that in sync if something changes sucks, plus general principles.

DogsB

I forgot to mention. Database Design for Mere Mortals. I also learned most of what I know about SQL from SQL for mere mortals. They haven't been updated in a while though.

nerd4sale

@djls45 said in Resources to learn about database schema design and optimization?:

@Benjamin-Hall I've found a fair amount of info just scanning through the questions and answers on the DBA section of StackExchange. If nothing else, they should be able to provide some links to other resources.

For Sqlserver, yes.
For Oracle, no. There is a lot of bad advise for Oracle out there on SExchange.
For MySql, I don't know.

nerd4sale

@PleegWat said in Resources to learn about database schema design and optimization?:

I'm pretty sure you get them in definition order in both mysql and oracle.

Yes you do, at least in Oracle.
Unless you redefine the order, which is possible since 12c.

The real takeaway from this is: never rely on undocumented behaviour.
If you do, things will break in a next release.

I remember some ancient Oracle upgrade (from 8i to 9i maybe, not sure), when suddenly records were not sorted anymore by a group by without an explicit order by.
It was never documented that it would, but was widely used. And then it broke.