Databases use indexes for a reason...

Mason_Wheeler

Me: "Hey Architect Guy, how bad would it be to add a new index to TBL_RBT?" (TBL_RBT is a Really Big Table in the database, holding some core information. Tens of millions of records of it, on our larger clients.)

Architect Guy: "Very bad. You add a new index, then every time the table is updated it will slow things down a little because it has to update the index. Doesn't matter much for minor tables, but the way everything's always hitting RBT..."

Me: "Yeah, that's what I thought. So take a look at this. Can you think of a better way to make this search not take 4 minutes?" I enter some search data into the UI, hit the button... and wait. And wait. And wait. And wait... well, you get the point. Any other module in the program would have brought back this much information in a matter of seconds. This one, well, didn't.

Architect Guy: "Wow. What's it doing?"

So I show him. SQL's good a lot of things, but trees aren't one of them. This module reads two levels of data, an overview level and detail level, and all the details are stored in TBL_RBT. You can probably guess where this is going. It first fired off a query to get all the top-level objects. Then it ran a loop for each top-level object, querying against TBL_RBT. The SQL contained a single column in the WHERE clause, on a non-indexed column value, meaning a full table scan for each iteration of the loop. A typical request from the UI would bring back anywhere from 20 to a few hundred top-level results.

A quick call to SVN Blame and we find the culprit: this code was checked in by a member of our outsourcing team from Elbonia. The guys that got hired to save costs by giving them minor tasks to take care of at low rates while the real team does the detail-oriented "craftsmanship" work. Well, apparently one of the details that they couldn't bother familiarizing themselves with was the difference between database and RAM. Something like this might have been acceptable (maybe!) for a query against a local cache, but the DB lives on a server. There's a non-trivial level of overhead to accessing it, overhead which grows exponentially when you don't bother adding indexed columns into your WHERE clause. And a value corresponding to the clustered index for TBL_RBT was even one of the required parameters in the UI search!

So once Architect Guy had taken in the full scope of this database disaster, he and I went to the boss to ask what to do with the mess. Boss's reaction was awesome. "Who checked this code in?" he asked. "He's gone." We may have coding WTFs here, but I've got no complaints about management!

I gave him the name. Unfortunately, (or fortunately, depending on how you look at it,) that particular developer is no longer with the Elbonian team, so he couldn't fire him. I added a second parameter to the query, based on the clustered index data already available, and the loading time fell from 4 minutes to about 3 seconds.

Ben L.

So you complained about the lack of indexes, had your idea shot down, and then fixed one line of code that was made by an employee that no longer worked with your company?

Mason_Wheeler

@Ben L. said:

So you complained about the lack of indexes, had your idea shot down, and then fixed one line of code that was made by an employee that no longer worked with your company?

Not exactly. I wasn't completely familiar with this table, so I asked about the obvious solution, even though I figured I knew what the answer would probably be. Turned out I was right. But what I really wanted was to know if there was a better way to fix this, and it turns out there was.

And unfortunately the fix required significnatly more than one line of code. To make a long story short, we rolled our own ORM back before the term existed, or at least before it really caught on. It can be a bit clunky to work with sometimes, but it's in place and it works and it drives an industry-leading product, so we don't mess with it.

blakeyrat

I'm going to bring up three words: "left outer join".

Hopefully you can fill in the blanks.

Mason_Wheeler

@blakeyrat said:

I'm going to bring up three words: "left outer join".
Hopefully you can fill in the blanks.

Yeah, if only it were that simple. See "crazy homebrew ORM", above.

notromda

@Mason Wheeler said:

SQL's good a lot of things, but trees aren't one of them

Ahem. look up the concept of a nested set. Works quite well, thank you very much.

Lorne Kates

@notromda said:

@Mason Wheeler said:
SQL's good a lot of things, but trees aren't one of them

Ahem. look up the concept of a nested set. Works quite well, thank you very much.

Great article about how to implement a nested set in SQL:

http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

Any time I see a column called "parent_id", I point the developer to this article.

(Often, they say "I don't get it", and write a janky multi-query loop anyways)

Shinhan

I didnt know about the Nested Sets, and the link is a great explanation. Thanks.

henke37

It's an interesting theory. But I fear that it is trading selection performance for manipulation performance.

But hopefully the tradeoff is more than worth it.

dhromed

@Lorne Kates said:

this article.

Good.

Read now.

Thx.

token_woman

@Mason Wheeler said:

This module reads two levels of data, an overview level and detail level, and all the details are stored in TBL_RBT. You can probably guess where this is going. It first fired off a query to get all the top-level objects. Then it ran a loop for each top-level object, querying against TBL_RBT.

This is really common. In fact it's (shamefully) pretty normal in the ORM world. I think even ActiveRecord does something like this or at least did until someone fixed it.

I think it comes from trying to do too much in the application code, from a fear of putting any logic into the SQL. People are developing db-driven applications who don't know or want to know anything about SQL. I think they believe it's difficult. In fact it's orders of magnitude easier than any other sort of code, if you've studied 3 pages of elementary set theory ... which is another thing people seem to believe is difficult. There was a time when programmers were supposed to know some maths.

galgorah

@token_woman said:

@Mason Wheeler said:

This module reads two levels of data, an overview level and detail level, and all the details are stored in TBL_RBT. You can probably guess where this is going. It first fired off a query to get all the top-level objects. Then it ran a loop for each top-level object, querying against TBL_RBT.

This is really common. In fact it's (shamefully) pretty normal in the ORM world. I think even ActiveRecord does something like this or at least did until someone fixed it.

I think it comes from trying to do too much in the application code, from a fear of putting any logic into the SQL. People are developing db-driven applications who don't know or want to know anything about SQL. I think they believe it's difficult. In fact it's orders of magnitude easier than any other sort of code, if you've studied 3 pages of elementary set theory ... which is another thing people seem to believe is difficult. There was a time when programmers were supposed to know some maths.

I Don't think the issue is neccesarily ORM's as a tool. ORM's are great in certain situations. But you need to know when you should move certain logic to the DB itself. Many ORM's will allow you to execute stored procedures for example. Unfortunately I think your right that many developers shy away from sql. You're right about it being rather easy to pick up, but many devs don't want to take the time to learn or for whatever reason think it should be the sole responsibility of DBA's.

My Current company actually brought me in because of my sql experience and my experience as a Team Lead and Architect. My Job is not only to dig in and help write code, but also to teach the more junior members. I am currently also in the process of preparing to design and manage the rebuild from scratch of a massive web app written back in the classic asp days of yore. We're updating it to .net 4.0. I assure you there are many WTF's in the codebase but here management actually recognizes it, hence agreeing to the rewrite.

When I started here there were no indexes, except for 3 tables and those tables had an index for every single column. The tables themselves are often gigs in size, most of which is not even accessed that often. I worked with the dba's to reevaluate the indexes and to partition the tables. Most of this I was teaching them. They really are server admins. The company as of yet has no true DBA. This will be changing.

One of the best WTF's so far was a guy who who started and left 3 months before I was hired. He was a big fan of cursors. He would nest them 4-5 layers deep. He also loved query hints. he'd put NOLOCK on every query....

@token_woman said:

@Mason Wheeler said:
This module reads two levels of data, an overview level and detail level, and all the details are stored in TBL_RBT. You can probably guess where this is going. It first fired off a query to get all the top-level objects. Then it ran a loop for each top-level object, querying against TBL_RBT.

This is really common. In fact it's (shamefully) pretty normal in the ORM world. I think even ActiveRecord does something like this or at least did until someone fixed it.
I think it comes from trying to do too much in the application code, from a fear of putting any logic into the SQL. People are developing db-driven applications who don't know or want to know anything about SQL. I think they believe it's difficult. In fact it's orders of magnitude easier than any other sort of code, if you've studied 3 pages of elementary set theory ... which is another thing people seem to believe is difficult. There was a time when programmers were supposed to know some maths.

I don't think it's always because developers don't like SQL. I think it's often because developers don't like having to deploy changes to two environments, when you have to get separate change requests for each environment, signed by about 10 different people, and you have to organize deployment meetings with two different operations teams. From a change management perspective (at least in a bloated organization like mine), it's easier to just deploy the database stuff once, and then when you find a bug in your data access, you just change your ORM mappings or application logic.

I would personally rather do everything in stored procedures, but it's a mess when you have to work with other people, or adhere to a process.

token_woman

@galgorah said:

I Don't think the issue is neccesarily ORM's as a tool. ORM's are great in certain situations. But you need to know when you should move certain logic to the DB itself.

I totally agree. I didn't mean to suggest ORMs are bad. Actually right now I'm enjoying getting to know php-activerecord (all together now, TRWTF is .... ok, that's out of the way) because I've been impressed with Rails but have to use php at work. Turns our php-AR is a really impressive bit of code. Pretty new and as far as I can tell not that widely deployed but so far I think it is evidence that it is possible to write intelligent PHP.

CnC

Fear of moving logic to SQL or database procedure languages seems to be a common performance issue. Really companies need to just have a database team or guy who knows SQL and whatever appropriate language (IE PL/SQL) do all the data work and give the C#/JAVA/Whatever developers the proc or function calls to access it.

When the "Enterprise Shared Platform" (yes it was named that, ugg) landed on SVN I got a great fun dose of how little C# devs like using complex SQL. Insert 100 rows? Why you use 100 discrete inserts in a begin/end block and that makes it a "set operation".

db2

@Lorne Kates said:

Great article about how to implement a nested set in SQL:

MySQL :: Developer Zone

Any time I see a column called "parent_id", I point the developer to this article.
(Often, they say "I don't get it", and write a janky multi-query loop anyways)

Looked neat, until they got to the part about adding data.

UPDATE nested_category SET rgt = rgt + 2 WHERE rgt > @myRight;
UPDATE nested_category SET lft = lft + 2 WHERE lft > @myRight;

Concurrency? Why would we need that?

This might be nice for some kind of static data warehouse, but it would be complete shit for any sizable OLTP workload.

Zemm

@db2 said:

Looked neat, until they got to the part about adding data.
UPDATE nested_category SET rgt = rgt + 2 WHERE rgt > @myRight;
UPDATE nested_category SET lft = lft + 2 WHERE lft > @myRight;
Concurrency? Why would we need that?

They did lock the table for writing before doing this. However, this is actual code from my implementation of the above:

UPDATE categories SET ls = IF(ls >= @insert_right, ls+2, ls) , rs = IF (rs >= @insert_right, rs+2, rs) WHERE rs >= @insert_right

Not sure if it is better or worse? :)

Scarlet_Manuka

@Lorne Kates said:

Great article about how to implement a nested set in SQL:

http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

Any time I see a column called "parent_id", I point the developer to this article.

I read it. But I think I shall continue to use CONNECT BY PRIOR.

toshir0

@Lorne Kates said:

Great article about how to implement a nested set in SQL:
http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

Thanks, very good one.

Reminds me of this one (in french) that I read a few years ago :

http://sqlpro.developpez.com/cours/sqlserver/cte-recursives/

db2

@Zemm said:

@db2 said:
Looked neat, until they got to the part about adding data.
UPDATE nested_category SET rgt = rgt + 2 WHERE rgt > @myRight;
UPDATE nested_category SET lft = lft + 2 WHERE lft > @myRight;
Concurrency? Why would we need that?
They did lock the table for writing before doing this. However, this is actual code from my implementation of the above:
UPDATE categories SET ls = IF(ls >= @insert_right, ls+2, ls) , rs = IF (rs >= @insert_right, rs+2, rs) WHERE rs >= @insert_right
Not sure if it is better or worse? :)

Either implementation requires you to, on average, update 50% of the rows in the table for a simple insert/delete. That is NOT conducive to write performance, particularly if the table has "tens of millions" of rows in it, like in the original post. Plus you still need an index on your lft/rgt columns anyway to handle lookups, so you're dealing with that impacting the write performance regardless. Unless your database is close to 100% read-oriented, you're probably better off with the old fashioned adjacency list approach, and whatever recursive querying your database platform supports ('common table expressions' in SQL Server, for example).

nexekho1

And while you're all in a fluster about how you store a few measly strings, we game devs are pushing 400000 polygons to the screen sixty times per second alongside AI, physics and lighting.

(/troll)

boomzilla

@nexekho said:

And while you're all in a fluster about how you store a few measly strings, we game devs are pushing 400000 polygons to the screen sixty times per second alongside AI, physics and lighting.

Well, sure, if we could just throw away the rows once we'd written them like you do the polygons...

dhromed

@boomzilla said:

Well, sure, if we could just throw away the rows once we'd written them like you do the polygons...

That would be awesome.

danixdefcon5

@ShatteredArm said:

@token_woman said:

@Mason Wheeler said:

This module reads two levels of data, an overview level and detail level, and all the details are stored in TBL_RBT. You can probably guess where this is going. It first fired off a query to get all the top-level objects. Then it ran a loop for each top-level object, querying against TBL_RBT.

This is really common. In fact it's (shamefully) pretty normal in the ORM world. I think even ActiveRecord does something like this or at least did until someone fixed it.

I think it comes from trying to do too much in the application code, from a fear of putting any logic into the SQL. People are developing db-driven applications who don't know or want to know anything about SQL. I think they believe it's difficult. In fact it's orders of magnitude easier than any other sort of code, if you've studied 3 pages of elementary set theory ... which is another thing people seem to believe is difficult. There was a time when programmers were supposed to know some maths.

I don't think it's always because developers don't like SQL. I think it's often because developers don't like having to deploy changes to two environments, when you have to get separate change requests for each environment, signed by about 10 different people, and you have to organize deployment meetings with two different operations teams. From a change management perspective (at least in a bloated organization like mine), it's easier to just deploy the database stuff once, and then when you find a bug in your data access, you just change your ORM mappings or application logic.

I would personally rather do everything in stored procedures, but it's a mess when you have to work with other people, or adhere to a process.

When I was at college, the reasoning of doing most stuff in Java was to keep with the 3-layer model; keep all the Business Logic in the BL layer, which meant the J2EE container. SPs were avoided not because of SQL-averse devs, but because it would translate into vendor lock-in.

I currently mix Business Logic in the app itself with SPs in the DBMS side, SPs mostly used for data stuff too complex to lay on the app side, or for big queries that fetch a lot of stuff from different tables (like the big-ass UNION query I posted somewhere else).

blakeyrat

@danixdefcon5 said:

When I was at college, the reasoning of doing most stuff in Java was to keep with the 3-layer model; keep all the Business Logic in the BL layer, which meant the J2EE container. SPs were avoided not because of SQL-averse devs, but because it would translate into vendor lock-in.

But... if it's in Java, you already have vendor lock-in.

token_woman

@danixdefcon5 said:

When I was at college, the reasoning of doing most stuff in Java was to keep with the 3-layer model; keep all the Business Logic in the BS layer, which meant the J2EE container.

FTFY