<strong>GROUP BY ALL THE THINGS!</strong>



  • In maintaining stored procedures in the codebase I work on, I frequently encounter this (anti)pattern:

    select c.ID as CategoryID,
      c.Name as CategoryName,
      p.ID as ProductID,
      p.Name as ProductName,
      p.Description as ProductDescription,
      pv.ID as ProductVersionID,
      pv.Name as ProductVersionName,
      avg(pv.Price) as ProductAveragePrice
    from dbo.tblCategories c
    inner join dbo.tblProducts p on
    (p.CategoryID = c.ID)
    inner join dbo.tblProductVersions pv on
    (pv.ProductID = p.ID)
    group by  c.ID,
      c.Name,
      p.ID,
      p.Name,
      p.Description,
      pv.ID,
      pv.Name
    

    This results in a pretty wide grouping key, which is going to make aggregating the results a needlessly expensive operation. Besides, when you have the unique key for a table as part of a group by, any other subset of columns in the same table aren't going to be any more unique. So one optimization is to reduce the group by to the minimal set of columns needed for a unique tuple, and then slapping aggregates on the rest of the columns, like so:

    select c.ID as CategoryID,
      max(c.Name) as CategoryName,
      p.ID as ProductID,
      max(p.Name) as ProductName,
      max(p.Description) as ProductDescription,
      pv.ID as ProductVersionID,
      max(pv.Name) as ProductVersionName,
      avg(pv.Price) as ProductAveragePrice
    from dbo.tblCategories c
    inner join dbo.tblProducts p on
    (p.CategoryID = c.ID)
    inner join dbo.tblProductVersions pv on
    (pv.ProductID = p.ID)
    group by c.ID,
      p.ID,
      pv.ID
    

    This is slightly better, but we can optimize further by taking advantage of the fact that p.ID will be constant for any single row in tblProductVersions, and c.ID will be constant for any single row in tblProducts. This gets us to here:

    select max(c.ID) as CategoryID,
      max(c.Name) as CategoryName,
      max(p.ID) as ProductID,
      max(p.Name) as ProductName,
      max(p.Description) as ProductDescription,
      pv.ID as ProductVersionID,
      max(pv.Name) as ProductVersionName,
      avg(pv.Price) as ProductAveragePrice
    from dbo.tblCategories c
    inner join dbo.tblProducts p on
    (p.CategoryID = c.ID)
    inner join dbo.tblProductVersions pv on
    (pv.ProductID = p.ID)
    group by pv.ID
    

    But wait... the only reason there was a group by to begin with was avg(pv.Price), and the fact that we are grouping by pv.ID means that avg(pv.Price) = pv.Price, and the original query optimizes to this:

    select c.ID as CategoryID,
      c.Name as CategoryName,
      p.ID as ProductID,
      p.Name as ProductName,
      p.Description as ProductDescription,
      pv.ID as ProductVersionID,
      pv.Name as ProductVersionName,
      pv.Price as ProductVersionAveragePrice
    from dbo.tblCategories c
    inner join dbo.tblProducts p on
    (p.CategoryID = c.ID)
    inner join dbo.tblProductVersions pv on
    (pv.ProductID = p.ID)
    

    :wtf:

    Cargo culting at its finest!



  • SQL standard actually demands all non-aggregate columns be mentioned in the GROUP BY clause. Which, given ambiguities otherwise, is a sound idea.

    ENOWTF



  • They saw an error message, and instead of fixing the root problem, they fixed what the error message said.

    If that's not what this site is for, I don't know what is.



  • This site is about lojban, microaggressions, and got bashing. What else?


  • Winner of the 2016 Presidential Election

    @wft said:

    and goat bashing

    FTFY

    Filed Under: Those damn :goat:s should get off my lawn!



  • @Groaner said:

    means that avg(pv.Price) = pv.Price

    Huh? How does that work?



  • You forgot something.



  • What's the average age of @loopback0 right now?



  • Yeah nevermind I see it now. I read it as grouping on p.Id.



  • In the application I am currently maintaining, I see a lot of queries of the type

    SELECT DISTINCT Id, {more columns} FROM {some table}

    ... and yes, Id is the primary key. Sure, I suppose the query optimizer is most likely smart enough to just ignore the "DISTINCT" keyword when you select a set of columns including one that is guaranteed to be unique by definition - it just bothers me seeing this used everywhere with the perpetrator apparently never really understanding what he was doing.


    F(a)iled under: You keep using that keyword. I don't think it means what you think it means.



  • @wft said:

    SQL standard actually demands all non-aggregate columns be mentioned in the GROUP BY clause.

    TRWTF is doing this and not having a proper WHATEVER() (or even CRASHONAMBIGUITY()) aggregate. Using MAX() just looks horrible and makes no sense.



  • They'd better use window functions... Is there anything like that in the SQL server land?



  • You mean things like OVER() PARTITION BY? It's there, but it's not standard SQL, and judging by my codebases not that many people know either about them or how to use them properly.


  • Notification Spam Recipient

    @RandomStranger said:

    SELECT DISTINCT

    This is a perfect example of people fixing the error message and not the underlying problem (and I see it a lot).
    Dev: "My query returns duplicate rows. Let's use distinct to get rid of them"
    Dev's conscious: "But if your query returns duplicate rows, doesn't that mean that there is a more fundamental flaw with your query (wrong join condition, etc.)?"
    Dev: "What was that? Oh, just the wind blowing. Here is your report, Mr. Manager!"


  • BINNED

    @RandomStranger said:

    SELECT DISTINCT

    @Groaner said:
    group by

    You haven't :headdesk: until you've seen both of these in a single query. Usually with some actual aggregates but not always


  • BINNED

    I've just looked through the 65kLOC SQL file which contains every view, stored proc, etc. that this application uses. In 233 matches for GROUP BY, 110 are in a SELECT DISTINCT, 88 in a plain SELECT, the rest in commented out code and actual comments.

    Some of the comments are along the lines of --Removed pointless grouping. Every single one of these is still a SELECT DISTINCT. Some of the SELECT DISTINCTs are in UNIONs, so there are three levels of removing duplicates


  • Notification Spam Recipient

    @Jaloopa said:

    three levels of removing duplicates

    And if it is anything like some of our clients' queries, there will still be duplicates due to some field values being different


  • BINNED

    I haven't seen any dupes survive the distinct, unionised groups, but I can tell you that there will be more levels of grouping and distincting because these views love to call each other, in a hierarchy that reaches 15 levels of nested views at some points.



  • @Jaloopa said:

    so there are three levels of removing duplicates

    SO DAWG I HEARD YOU LIKE REMOVING DUPLICATES SO I REMOVED DUPLICATES FROM YOUR DISTINCT VALUES SO YOU CAN REMOVE DUPLICATES WHILE YOU DISTINCT VALUES.

    ­



  • @Jaloopa said:

    in a hierarchy that reaches 15 levels of nested views at some points.

    :eek:


  • BINNED



  • @wft said:

    This site is about lojban, microaggressions, and got bashing. What else?

    And redundancy. Don't forget redundancy.

    Oh yeah, and redundancy, too.


  • :belt_onion:

    Could you repeat that, just in case?


  • sockdevs

    Is that related to the tautology department and were they informed in triplicate?


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.