The S in SQL stands for Sadism

Serpentes

So... I maintain a series of legacy applications. Not quite MUMPS-level Legacy, but there are sure some interesting code choices. Much of it is ASP Classic code of dubious origin. There's a lot of SQL that often reads like it was procedurally generated, but I've come to know better.

Today, I get informed that there's a problem with one of the error audit utilities. This struck me as particularly bad news because, after years of this, I'd never seen an anything audit utility except the ones I created. As it turns out, there are a bunch of them, which run on a nightly-replicated copy of the database, and which I'd never had access to. Fun times. In any case, one of them, having something to do with inter-application data-sharing messages, has "suddenly" started hanging. I know there haven't been code or DB deployments for awhile, so I'm dubious of the timing. After a little inquisition, the guy running the reports admits this hadn't been run for 3 years, but they just failed a Best Practices check from the Auditing Department and so it is suddenly urgent again.

It doesn't take me long to realize why. Here's the problematic SQL query. This isn't a stored procedure, it's built on-the-fly in the code. A1 through A4 are ... magic numbers, so far as I can tell, the result of hundreds of lines of incomprehensible VBScript.

SELECT M.MsgID, MsgEntity, Sender, SentFor, SentAt, SUM(CASE WHEN (StatusType = 2) THEN 1 ELSE 0 END), SUM(CASE WHEN (StatusType = 11) THEN 1 ELSE 0 END), SUM(CASE WHEN (StatusType = 12) THEN 1 ELSE 0 END)
FROM dbo.Msgs AS M LEFT JOIN (SELECT MsgID, StatusType FROM dbo.MsgStatusLog WHERE StatusBy = A1) AS S ON M.MsgID = S.MsgID
WHERE ((RType = 0 AND Rec = A1) OR (RType = 2 AND Rec IN (A2, A3, A4)))
GROUP BY M.MsgID, MsgEntity, Sender, SentFor, SentAt
HAVING SUM(CASE WHEN (StatusType = 3) THEN 1 ELSE 0 END) = 0
ORDER BY SentAt DESC, M.MsgID DESC

Each attempt to run this tool executes this query ~10 times with varying values for A1, A2, A3, and A4, then does math on the results. For some reason. This apparently worked okay, 3 years ago, when there were a few hundred entries in the two tables. There are now a few hundred thousand in dbo.Msgs, and just shy of a million in dbo.MsgStatusLog. Somehow, this didn't scale well. I can't imagine why not. Purging records isn't an option, because no one seems to have any idea what the appropriate data retention policy would be. Of course.

Honestly, I'm probably going to wind up rebuilding this from first principles, because I don't have the damnedest clue what that SQL is intended to do, or how to unroll it into a less server-molesting form. If anyone wants to take a guess, um, please feel free!

MathNerdCNU

@Serpentes said:

SUM(CASE WHEN (StatusType = 2) THEN 1 ELSE 0 END), SUM(CASE WHEN (StatusType = 11) THEN 1 ELSE 0 END), SUM(CASE WHEN (StatusType = 12) THEN 1 ELSE 0 END)

How NOT to do a bit-mask in SQL.

You are in for a treat.

TwelveBaud

Get the message ID, entity, sender, sent for, and sent at for a set of messages, as well as a rollup of how many times StatusType was 2, 11, or 12, for {something involving A1-4}, but only if StatusType was never 3 for that message.

Eldelshell

Thanks for reminding me why I try to stay a way as much as possible from SQL... BTW the S would be sadistic as in Sadistic Query Language.

MathNerdCNU

I find SQL falls into what I call, the "Zen" languages.

Don't try to fight it, you will lose. Instead become the bamboo in the breeze. Go with the flow. Or use PL/SQL and hate life, anti-life and all forms of being.

Dragnslcr

I doubt it will help the performance much, but there appears to be absolutely no reason for that subquery to exist. You can change it to dbo.Msgs AS M LEFT JOIN dbo.MsgStatusLog AS S ON M.MsgID = S.MsgID and then move the StatusBy = A1 to the query's WHERE clause.

As for the query's performance, are there any indexes for the tables? An index on both tables' MsgID should help quite a bit with the join, and indexes on RType and Rec should also help.

Serpentes

All the tables from this era of development are exclusively indexed by whatever was designated their primary key. So Msgs is indexed by MsgID, but the other table is only indexed by some (as far as I can tell) useless StatusID value. I'll probably try to get that changed, but it typically requires getting the request notarized by Jesus to get the DBAs to touch this crap. Understandably so.

blakeyrat

@Serpentes said:

SUM(CASE WHEN (StatusType = 2) THEN 1 ELSE 0 END), SUM(CASE WHEN (StatusType = 11) THEN 1 ELSE 0 END), SUM(CASE WHEN (StatusType = 12) THEN 1 ELSE 0 END)

Jesus Christ, that alone is enough to send me running.

blakeyrat

@Dragnslcr said:

I doubt it will help the performance much, but there appears to be absolutely no reason for that subquery to exist.

That's (almost) universally true of all subqueries in all SQL ever.

Serpentes

Jesus Christ, that alone is enough to send me running.

If I could, I would.

After taking this reporting team to the thumbscrews, apparently the way this system works is by creating these message... things that are passed between part of the core software and a menagerie of first- and third-party applications. The StatusTypes aren't bitmasks. They're like running commentary on what the recipient applications have reported back to core as having done with the message thing (or not done, in some cases; the same process is used to record what more enlightened software would call an "Exception").

In a moderately-tolerable world, there would be documentation of these, but this is clearly not that world. The table that was intended to provide the definitions doesn't exist, and probably never did. I'm fairly sure that StatusType 2 is something like "acknowledged". I know for a fact that StatusType 3 is closed or disregarded or done or something of that nature; once a target application submits StatusType 3 for a given message, it never sees the message packet again. But that's why there's all the A1-A4 nonsense; single messages may (or may not...) go to multiple applications. So one may have dealt with it and killfiled it before another one even acknowledges it, and those updates are tracked separately. Hell if I know what StatusType 11 and 12 are, or why the other statuses aren't handled here (SELECT DISTINCT tells me there are 43 in total).

Honestly, I cannot believe that any part of this system functions.

dkf

@Serpentes said:

there are 43 in total

A state graph with 43 nodes? Oh, I'm so sorry for you.

Serpentes

A day later, this has gotten much, much worse. Because that first query isn't by any means the most sanity-wrenching option. This, right here, is why we cannot have nice things:

SELECT M.MsgID, MsgEntity, Sender, SentFor, SentAt, SUM(CASE WHEN StatusType = 2 THEN 1 ELSE 0 END), SUM(CASE WHEN (StatusType = 11 AND StatusBy = A1) THEN 1 ELSE 0 END), SUM(CASE WHEN (StatusType = 12 AND StatusBy = A1) THEN 1 ELSE 0 END)
FROM dbo.Msgs AS M LEFT JOIN dbo.MsgStatusLog AS S ON M.MsgID = S.MsgID
WHERE (RType = 1 AND Rec = B1)
GROUP BY M.MsgID, Parent, Sender, SentFor, RType, Rec, MsgEntity, Msg, SentAt
HAVING SUM(CASE WHEN StatusType = 3 THEN 1 ELSE 0 END) = 0
ORDER BY SentAt DESC, MsgID DESC

I'm sure everyone will enjoy that the SUM/CASE "counting" method now includes two conditional checks because, hey, why not, right? But there's a hidden horror here that has my vote for greatest source of trauma. See that "Msg" field in the GROUP BY clause? Yeah. That's a LONGTEXT.

I'm no expert on SQL optimization, but I'm pretty sure you're not supposed to GROUP BY a LONGTEXT field.

The only good news I've had is that most of those 43 states are only used once, in really old data. All that seems to exist in any numbers are 2, 3, 4, 11, 12, and 99. I'm fairly sure that 2 is "acknowledged by destination system", 3 is "completed by destination system", 4 is something like "terminated by core system", and 99 is an unhandled exception. 11 and 12 are ... also numbers. Fun fact: status code 1 was apparently never used.

TwelveBaud

@Serpentes said:

I'm no expert on SQL optimization, but I'm pretty sure you're not supposed to GROUP BY a LONGTEXT field.

You're not, but there's no other way to include it in a SELECT that also has aggregations (SUM(), for example). Of course, if your SQL server supports subqueries in a JOIN, you could just GROUP BY the MsgID in the inner query and do all your summation there, and then use the outer query to attach all the other fields (like the full text of the message).

marczellm

@Serpentes said:

Fun fact: status code 1 was apparently never used.

That one thing makes complete sense. Status code 1 would mean that everything is OK.

Serpentes

You're not, but there's no other way to include it in a SELECT that also
has aggregations (SUM(), for example). Of course, if your SQL server
supports subqueries in a JOIN, you could just GROUP BY the MsgID in the
inner query and do all your summation there, and then use the outer
query to attach all the other fields (like the full text of the
message).

Sure, I'll give you that. Of course, note that this query doesn't actually select the Msg. It just groups by it...

Jaime

@TwelveBaud said:

You're not, but there's no other way to include it in a SELECT that also has aggregations (SUM(), for example).

Yes you can. Include it in the SELECT clause as MAX(Msg). It's "less clumsy" to read than putting it in the GROUP BY clause.

Weng

@Serpentes said:

but it typically requires getting the request notarized by Jesus to get the DBAs to touch this crap. Understandably so

I never have known why DBAs are so fucking afraid of index changes in WORM databases.

Yes,they consume storage and slightly increase I/O time on write. But they let us do queries that don't crush the CPU, RAM and TempDB. Which we run routinely.

antiquarian

@MathNerdCNU said:

Or use PL/SQL and hate life, anti-life and all forms of being.

That also applies to TRANSACT-SQL.

PleegWat

@Serpentes said:

I'm no expert on SQL optimization, but I'm pretty sure you're not supposed to GROUP BY a LONGTEXT field.

It may be appropriate to move the group by and aggregations into the subquery? Something like

SELECT
    msgID,
    SUM(CASE WHEN StatusType = 2 THEN 1 ELSE 0 END) twos,
    SUM(CASE WHEN (StatusType = 11 AND StatusBy = A1) THEN 1 ELSE 0 END) elevens,
    SUM(CASE WHEN (StatusType = 12 AND StatusBy = A1) THEN 1 ELSE 0 END) twelves
FROM dbo.MsgStatusLog
GROUP BY msgID
HAVING SUM(CASE WHEN StatusType = 3 THEN 1 ELSE 0 END) = 0

Then join Msg to the result of the subquery. You may also elect to select threes here and filter on it in the where clause of the outer query. Mainly a question of what reads better.

This may help significantly if the query optimizer is not smart enough to realize it can do the groupby before the join.

blakeyrat

You can checksum the LONGTEXT, do the GROUP BY on that, then do the string comparison GROUP BY on the smaller set of intersecting checksums.

This is how I've done fast URL compares in SQL in the past. The checksums were materialized in the table.

Serpentes

Okay. It's possible that these queries have simply driven me insane.

Is there any reason why I can't just remove reference to that LONGTEXT entirely? It's not selected. It's not referenced by the WHERE clause or the HAVING clause, or anything else except that it's in the laundry-list of GROUP BY fields. But the query is already grouping by MsgID, which is a unique identifier... so there's no way that grouping by anything else as well could bundle the data differently, right?

Also, after having run a ton of these... as far as I can tell, the results of those summation operations are always either 0 or 1. This query is literally doing SELECT SUM(CASE WHEN (argle bargle) THEN 1 ELSE 0 END) to test the existence or nonexistence of specific single rows of data in MsgStatusLog. I'm not sure what the best way to do this is, but I'm pretty sure this way needs to be lit on fire, along with its creator.

PleegWat

That group by includes the primary key of the Msg table, so it's only grouping MsgStatusLog, which is why I suggested moving the groupby into that subquery.

You could COUNT() instead of SUM() (and put NULL in the else), but that doesn't matter much for readability. You could do something like case when exists (select 1 from MsgStatusLog s where m.MsgID = s.MsgID and s.msgStatus = 11) then 1 else 0 end, but that doesn't gain you anything in readability and probably not in speed either since the naive plan joins the status table 4 times.

TwelveBaud

@Serpentes said:

It's not selected. It's not referenced by the WHERE clause or the HAVING clause, or anything else except that it's in the laundry-list of GROUP BY fields. But the query is already grouping by MsgID, which is a unique identifier...

Yup, it's useless. Nuke it.@Serpentes said:

as far as I can tell, the results of those summation operations are always either 0 or 1. This query is literally doing SELECT SUM(CASE WHEN (argle bargle) THEN 1 ELSE 0 END) to test the existence or nonexistence of specific single rows of data in MsgStatusLog. I'm not sure what the best way to do this is, but I'm pretty sure this way needs to be lit on fire, along with its creator.

One like is not enough.

RTapeLoadingError

@Weng said:

@Serpentes said:
I'll probably try to get that changed, but it typically requires getting the request notarized by Jesus to get the DBAs to touch this crap. Understandably so.

I never have known why DBAs are so fucking afraid of index changes in WORM databases.

Yes,they consume storage and slightly increase I/O time on write. But they let us do queries that don't crush the CPU, RAM and TempDB. Which we run routinely.

Speaking as a DBA (but not for all DBAs)
As a DBA you shouldn't be afraid of index changes, whatever the read/write profile of the database.

It is important, however, to approach this task with definite goals in terms of how much faster you are trying to make certain queries as well as being aware of the implication for other activities such as writes, index maintenance jobs and backups.

But, you're dead right. In a DB with many more reads than writes (e.g. reporting database) there's a strong case for throwing as many indexes at it as it needs.