Averaging the lag time (PostgreSQL)


  • BINNED

    I have a requirement for a report I need to generate. The table definition is rather large, but for purposes of this the important columns are callStart (timestamp of start of each recorded call) and callerPhoneNumber (the caller that made the call).

    I need to extract some statistics from this, grouped by the caller. The names should be self-explanatory, I pasted a few properly working columns for reference:

    SELECT
            "callerPhoneNumber",
            COUNT(*) AS "totalCalls",
            COUNT(*) filter (WHERE "answered" = TRUE) AS "answeredCalls",
            COUNT(*) filter (WHERE "answered" = FALSE) AS "unansweredCalls",
            SUM("callTime") filter (WHERE "answered" = TRUE) AS "callTimeTotal",
            AVG("callTime") filter (WHERE "answered" = TRUE) AS "callTimeAverage",
            MAX("callTime") AS "callTimeMax",
            "callStart" - LAG("callStart") over(ORDER BY "callStart") AS "callLagAvg"
    FROM asterisk."CallsList"
    WHERE "callStart" >= '2017-04-20' AND "callStart" < '2017-04-21' 
    GROUP BY "callerPhoneNumber", "callStart"
    ORDER BY "callerPhoneNumber" ASC;
    

    (Dates hardcoded for testing porpoises on a testing copy, will be "current day" in actuality)

    The problem is using the LAG window function. Without that I get an output like:

    callerPhoneNumber | totalCalls | answeredCalls | unansweredCalls | callTimeTotal | callTimeAverage | callTimeMax 
    -------------------+------------+---------------+-----------------+---------------+-----------------+-------------
     207               |        182 |            60 |             122 | 01:08:48      | 00:01:08.8      | 00:06:23
     208               |        108 |            46 |              62 | 01:25:14      | 00:01:51.173913 | 00:13:07
     209               |         53 |            33 |              20 | 00:52:50      | 00:01:36.060606 | 00:07:37
     210               |        159 |            49 |             110 | 01:01:13      | 00:01:14.959184 | 00:06:52
     211               |        100 |            53 |              47 | 01:22:29      | 00:01:33.377358 | 00:11:41
     212               |        201 |            52 |             149 | 01:43:38      | 00:01:59.576923 | 00:10:46
     214               |        271 |            28 |             243 | 00:29:57      | 00:01:04.178571 | 00:03:06
     215               |        173 |            46 |             127 | 00:50:24      | 00:01:05.73913  | 00:05:07
     216               |        187 |            64 |             123 | 01:17:57      | 00:01:13.078125 | 00:08:27
    

    Which is exactly what I want. However, using the LAG portion requires me to group by callStartas well, meaning I get statistics per-call rather than aggregates. Also, I can't wrap it in avg, Postgres complains about using a window function inside an aggregate function.

    I'm by no means an SQL guru, so I'm stuck. CTEs? Subqueries (I tried that, killed the poor DB with looping, I was probably using the wrong approach TBQH)? I have no idea where to start... ☹


  • Notification Spam Recipient

    @Onyx AIUI you want to calculate the average time between calls? It means that you want something like:

    AVG("callStart" - LAG("callStart") over(ORDER BY "callStart")) AS "callLagAvg"
    

    I'm not sure about PostgreSQL, but that is illegal in T-SQL. I would calculate the lag in a seperate cte or temp table and then do the query on that.


  • BINNED

    @Vault_Dweller yeah, sorry, I realized I explained the goal poorly, I was tangled up in stuff while writing the post...

    Your assumption is correct, and yes, that's illegal in PSQL as well, sadly.


  • Notification Spam Recipient

    @Onyx So what prevents you from what I described, i.e.

    SELECT <col_list>, "callStart" - LAG("callStart") over(ORDER BY "callStart") AS "callLag"
    into #callslist --This is a temp table, not sure how to do them in PostgreSQL. You can use a cte too
    FROM asterisk."CallsList"
    

    and then just something like:

    SELECT
            "callerPhoneNumber",
            COUNT(*) AS "totalCalls",
            COUNT(*) filter (WHERE "answered" = TRUE) AS "answeredCalls",
            COUNT(*) filter (WHERE "answered" = FALSE) AS "unansweredCalls",
            SUM("callTime") filter (WHERE "answered" = TRUE) AS "callTimeTotal",
            AVG("callTime") filter (WHERE "answered" = TRUE) AS "callTimeAverage",
            MAX("callTime") AS "callTimeMax",
            AVG("callLag") AS "callLagAvg"
    FROM #callslist
    WHERE "callStart" >= '2017-04-20' AND "callStart" < '2017-04-21' 
    GROUP BY "callerPhoneNumber", "callStart"
    ORDER BY "callerPhoneNumber" ASC;
    

    I'm not sure it will be possible without using cte's/temp tables


  • BINNED

    @Vault_Dweller I tried this:

    WITH d AS (
            SELECT avg(difference) AS diff
            FROM (
                    SELECT "callStart" - lag("callStart") OVER (ORDER BY "callStart") AS difference
                    FROM asterisk."CallLogs"
            ) t
    )
    SELECT
            "callerPhoneNumber",
            COUNT(*) AS "totalCalls",
            COUNT(*) filter (WHERE "answered" = TRUE) AS "answeredCalls",
            COUNT(*) filter (WHERE "answered" = FALSE) AS "unansweredCalls",
            SUM("callTime") filter (WHERE "answered" = TRUE) AS "callTimeTotal",
            AVG("callTime") filter (WHERE "answered" = TRUE) AS "callTimeAverage",
            MAX("callTime") AS "callTimeMax",
            AVG(d.diff) AS "callLagAvg"
    FROM asterisk."CallsList"
    WHERE "callStart" >= '2017-04-20' AND "callStart" < '2017-04-21' 
    GROUP BY "callerPhoneNumber", "callStart"
    ORDER BY "callerPhoneNumber" ASC;
    

    Which gets the average... for all results, not grouped by caller. Trying to cram in a GROUP BY into the CTE is a terrible performance killer - it might work like that, or as a subquery, but I'm kinda hoping against hope it can be something that gets generated rather quickly. That might not be a thing.


  • Notification Spam Recipient

    @Onyx Oops, sorry, your LAG operator should actually look something like:

    LAG("callStart") OVER (PARTITION BY "callerPhoneNumber" ORDER BY "callStart")
    

    The partition part is important


  • BINNED

    CRACKED IT!

    SELECT
            c."callerPhoneNumber",
            COUNT(*) AS "totalCalls",
            COUNT(*) filter (WHERE "answered" = TRUE) AS "answeredCalls",
            COUNT(*) filter (WHERE "answered" = FALSE) AS "unansweredCalls",
            SUM("callTime") filter (WHERE "answered" = TRUE) AS "callTimeTotal",
            AVG("callTime") filter (WHERE "answered" = TRUE) AS "callTimeAverage",
            MAX("callTime") AS "callTimeMax",
            (
                SELECT AVG(difference) filter (WHERE t."callerPhoneNumber" = c."callerPhoneNumber" ) AS diff
                FROM (
                    SELECT "callStart" - lag("callStart") OVER (PARTITION BY "callerPhoneNumber" ORDER BY "callStart") AS difference, "callerPhoneNumber"
                    FROM asterisk."CallsList"
                    WHERE "callStart" >= '2017-04-20' AND "callStart" < '2017-04-21'
                    GROUP BY "callerPhoneNumber", "callStart"
                ) t
            ) AS "avgLagTime"
    FROM asterisk."CallsList" c
    WHERE "callStart" >= '2017-04-20' AND "callStart" < '2017-04-21'
    GROUP BY c."callerPhoneNumber"
    ORDER BY c."callerPhoneNumber" ASC;
    

    I was stupid for using a WHERE in the subquery at first instead of filter. Also, adding the dates into the subquery does wonders.

    Thanks for the tips, @Vault_Dweller !


  • Notification Spam Recipient

    @Onyx Interesting that it has decent peformance (I'm assuming), using subqueries within the select part of a query is usually sub-optimal. I'm assuming it's the date filter that caused the actual improvement here.

    Anyway, glad I could be of help.


  • BINNED

    @Vault_Dweller said in Averaging the lag time (PostgreSQL):

    Interesting that it has decent peformance (I'm assuming)

     Planning time: 5.544 ms
     Execution time: 495.325 ms
    

    For something that's going to be ran probably daily, I'm pretty happy with that. Also, I'm aware subqueries are suboptimal, but I have no idea how to make the filtering per-caller work properly using CTEs — if it's doable it's out of my league, anything past JOINs is usually voodoo to me, I'm not a DB guy and I mostly lack the time to learn, even though I'd love to.


  • FoxDev

    @Onyx At half a second, I'd be happy running that every five minutes 🙂


  • Notification Spam Recipient

    @Onyx said in Averaging the lag time (PostgreSQL):

    I'm not a DB guy

    I am, which is why I notice stuff like that, but

    @Onyx said in Averaging the lag time (PostgreSQL):

    Planning time: 5.544 ms
    Execution time: 495.325 ms

    looks like "mission accomplished" to me!


  • ♿ (Parody)

    @Onyx said in Averaging the lag time (PostgreSQL):

    Trying to cram in a GROUP BY into the CTE is a terrible performance killer

    My preferred solution to this is moar CTEs. So...the first one selects all of your data, then the second one would do your LAG calculations, then your main query joins the LAG stuff and that value gets added to the 'main' GROUP BY.

    Caveat: this works quite well in Oracle, because the data is fetched once and then operated on in memory. I haven't used postgres enough to know how it will react to that sort of thing.


  • BINNED

    @boomzilla said in Averaging the lag time (PostgreSQL):

    @Onyx said in Averaging the lag time (PostgreSQL):

    Trying to cram in a GROUP BY into the CTE is a terrible performance killer

    My preferred solution to this is moar CTEs. So...the first one selects all of your data, then the second one would do your LAG calculations, then your main query joins the LAG stuff and that value gets added to the 'main' GROUP BY.

    Caveat: this works quite well in Oracle, because the data is fetched once and then operated on in memory. I haven't used postgres enough to know how it will react to that sort of thing.

    I'll give that a go tomorrow, might be a benefit to that. Never know when larger datasets will be required, if I can get it to perform even better now I'll have less headaches later...


  • ♿ (Parody)

    @Onyx Given your actual results (assuming that . is a decimal and not a thousands separator) it probably won't improve on that because the DB is already doing the optimizations. Which is the fun of optimizing SQL: you never know for sure what will happen until you try.


  • BINNED

    @boomzilla It's especially fun when the dataset you're working on is juuuuust below the treshold of when certain indexes will trigger (instead of sequential scans) or it will have to start materializing views because fuck you you're over the memory limit for caching.

    Literally had a situation where we tuned a query and it got over a threshold the very next day...


Log in to reply