<strong>GROUP BY ALL THE THINGS!</strong>

Groaner

In maintaining stored procedures in the codebase I work on, I frequently encounter this (anti)pattern:

select c.ID as CategoryID,
  c.Name as CategoryName,
  p.ID as ProductID,
  p.Name as ProductName,
  p.Description as ProductDescription,
  pv.ID as ProductVersionID,
  pv.Name as ProductVersionName,
  avg(pv.Price) as ProductAveragePrice
from dbo.tblCategories c
inner join dbo.tblProducts p on
(p.CategoryID = c.ID)
inner join dbo.tblProductVersions pv on
(pv.ProductID = p.ID)
group by  c.ID,
  c.Name,
  p.ID,
  p.Name,
  p.Description,
  pv.ID,
  pv.Name

This results in a pretty wide grouping key, which is going to make aggregating the results a needlessly expensive operation. Besides, when you have the unique key for a table as part of a group by, any other subset of columns in the same table aren't going to be any more unique. So one optimization is to reduce the group by to the minimal set of columns needed for a unique tuple, and then slapping aggregates on the rest of the columns, like so:

select c.ID as CategoryID,
  max(c.Name) as CategoryName,
  p.ID as ProductID,
  max(p.Name) as ProductName,
  max(p.Description) as ProductDescription,
  pv.ID as ProductVersionID,
  max(pv.Name) as ProductVersionName,
  avg(pv.Price) as ProductAveragePrice
from dbo.tblCategories c
inner join dbo.tblProducts p on
(p.CategoryID = c.ID)
inner join dbo.tblProductVersions pv on
(pv.ProductID = p.ID)
group by c.ID,
  p.ID,
  pv.ID

This is slightly better, but we can optimize further by taking advantage of the fact that p.ID will be constant for any single row in tblProductVersions, and c.ID will be constant for any single row in tblProducts. This gets us to here:

select max(c.ID) as CategoryID,
  max(c.Name) as CategoryName,
  max(p.ID) as ProductID,
  max(p.Name) as ProductName,
  max(p.Description) as ProductDescription,
  pv.ID as ProductVersionID,
  max(pv.Name) as ProductVersionName,
  avg(pv.Price) as ProductAveragePrice
from dbo.tblCategories c
inner join dbo.tblProducts p on
(p.CategoryID = c.ID)
inner join dbo.tblProductVersions pv on
(pv.ProductID = p.ID)
group by pv.ID

But wait... the only reason there was a group by to begin with was avg(pv.Price), and the fact that we are grouping by pv.ID means that avg(pv.Price) = pv.Price, and the original query optimizes to this:

select c.ID as CategoryID,
  c.Name as CategoryName,
  p.ID as ProductID,
  p.Name as ProductName,
  p.Description as ProductDescription,
  pv.ID as ProductVersionID,
  pv.Name as ProductVersionName,
  pv.Price as ProductVersionAveragePrice
from dbo.tblCategories c
inner join dbo.tblProducts p on
(p.CategoryID = c.ID)
inner join dbo.tblProductVersions pv on
(pv.ProductID = p.ID)

Cargo culting at its finest!

wft

SQL standard actually demands all non-aggregate columns be mentioned in the GROUP BY clause. Which, given ambiguities otherwise, is a sound idea.

ENOWTF

ben_lubar

They saw an error message, and instead of fixing the root problem, they fixed what the error message said.

If that's not what this site is for, I don't know what is.

wft

This site is about lojban, microaggressions, and got bashing. What else?

Kuro

@wft said:

and goat bashing

FTFY

Filed Under: Those damn s should get off my lawn!

loopback0

@Groaner said:

means that avg(pv.Price) = pv.Price

Huh? How does that work?

ben_lubar

You forgot something.

ben_lubar

What's the average age of @loopback0 right now?

loopback0

Yeah nevermind I see it now. I read it as grouping on p.Id.

RandomStranger

In the application I am currently maintaining, I see a lot of queries of the type

SELECT DISTINCT Id, {more columns} FROM {some table}

... and yes, Id is the primary key. Sure, I suppose the query optimizer is most likely smart enough to just ignore the "DISTINCT" keyword when you select a set of columns including one that is guaranteed to be unique by definition - it just bothers me seeing this used everywhere with the perpetrator apparently never really understanding what he was doing.

F(a)iled under: You keep using that keyword. I don't think it means what you think it means.

Maciejasjmj

@wft said:

SQL standard actually demands all non-aggregate columns be mentioned in the GROUP BY clause.

TRWTF is doing this and not having a proper WHATEVER() (or even CRASHONAMBIGUITY()) aggregate. Using MAX() just looks horrible and makes no sense.

wft

They'd better use window functions... Is there anything like that in the SQL server land?

Maciejasjmj

You mean things like OVER() PARTITION BY? It's there, but it's not standard SQL, and judging by my codebases not that many people know either about them or how to use them properly.

Vault_Dweller

@RandomStranger said:

SELECT DISTINCT

This is a perfect example of people fixing the error message and not the underlying problem (and I see it a lot).
Dev: "My query returns duplicate rows. Let's use distinct to get rid of them"
Dev's conscious: "But if your query returns duplicate rows, doesn't that mean that there is a more fundamental flaw with your query (wrong join condition, etc.)?"
Dev: "What was that? Oh, just the wind blowing. Here is your report, Mr. Manager!"

Jaloopa

@RandomStranger said:

SELECT DISTINCT

@Groaner said:

group by

You haven't until you've seen both of these in a single query. Usually with some actual aggregates but not always

Jaloopa

I've just looked through the 65kLOC SQL file which contains every view, stored proc, etc. that this application uses. In 233 matches for GROUP BY, 110 are in a SELECT DISTINCT, 88 in a plain SELECT, the rest in commented out code and actual comments.

Some of the comments are along the lines of --Removed pointless grouping. Every single one of these is still a SELECT DISTINCT. Some of the SELECT DISTINCTs are in UNIONs, so there are three levels of removing duplicates

Vault_Dweller

@Jaloopa said:

three levels of removing duplicates

And if it is anything like some of our clients' queries, there will still be duplicates due to some field values being different

Jaloopa

I haven't seen any dupes survive the distinct, unionised groups, but I can tell you that there will be more levels of grouping and distincting because these views love to call each other, in a hierarchy that reaches 15 levels of nested views at some points.

loopback0

@Jaloopa said:

so there are three levels of removing duplicates

SO DAWG I HEARD YOU LIKE REMOVING DUPLICATES SO I REMOVED DUPLICATES FROM YOUR DISTINCT VALUES SO YOU CAN REMOVE DUPLICATES WHILE YOU DISTINCT VALUES.