Averaging the lag time (PostgreSQL)
-
I have a requirement for a report I need to generate. The table definition is rather large, but for purposes of this the important columns are
callStart
(timestamp of start of each recorded call) andcallerPhoneNumber
(the caller that made the call).I need to extract some statistics from this, grouped by the caller. The names should be self-explanatory, I pasted a few properly working columns for reference:
SELECT "callerPhoneNumber", COUNT(*) AS "totalCalls", COUNT(*) filter (WHERE "answered" = TRUE) AS "answeredCalls", COUNT(*) filter (WHERE "answered" = FALSE) AS "unansweredCalls", SUM("callTime") filter (WHERE "answered" = TRUE) AS "callTimeTotal", AVG("callTime") filter (WHERE "answered" = TRUE) AS "callTimeAverage", MAX("callTime") AS "callTimeMax", "callStart" - LAG("callStart") over(ORDER BY "callStart") AS "callLagAvg" FROM asterisk."CallsList" WHERE "callStart" >= '2017-04-20' AND "callStart" < '2017-04-21' GROUP BY "callerPhoneNumber", "callStart" ORDER BY "callerPhoneNumber" ASC;
(Dates hardcoded for testing porpoises on a testing copy, will be "current day" in actuality)
The problem is using the
LAG
window function. Without that I get an output like:callerPhoneNumber | totalCalls | answeredCalls | unansweredCalls | callTimeTotal | callTimeAverage | callTimeMax -------------------+------------+---------------+-----------------+---------------+-----------------+------------- 207 | 182 | 60 | 122 | 01:08:48 | 00:01:08.8 | 00:06:23 208 | 108 | 46 | 62 | 01:25:14 | 00:01:51.173913 | 00:13:07 209 | 53 | 33 | 20 | 00:52:50 | 00:01:36.060606 | 00:07:37 210 | 159 | 49 | 110 | 01:01:13 | 00:01:14.959184 | 00:06:52 211 | 100 | 53 | 47 | 01:22:29 | 00:01:33.377358 | 00:11:41 212 | 201 | 52 | 149 | 01:43:38 | 00:01:59.576923 | 00:10:46 214 | 271 | 28 | 243 | 00:29:57 | 00:01:04.178571 | 00:03:06 215 | 173 | 46 | 127 | 00:50:24 | 00:01:05.73913 | 00:05:07 216 | 187 | 64 | 123 | 01:17:57 | 00:01:13.078125 | 00:08:27
Which is exactly what I want. However, using the
LAG
portion requires me to group bycallStart
as well, meaning I get statistics per-call rather than aggregates. Also, I can't wrap it inavg
, Postgres complains about using a window function inside an aggregate function.I'm by no means an SQL guru, so I'm stuck. CTEs? Subqueries (I tried that, killed the poor DB with looping, I was probably using the wrong approach TBQH)? I have no idea where to start...
-
@Onyx AIUI you want to calculate the average time between calls? It means that you want something like:
AVG("callStart" - LAG("callStart") over(ORDER BY "callStart")) AS "callLagAvg"
I'm not sure about PostgreSQL, but that is illegal in T-SQL. I would calculate the lag in a seperate cte or temp table and then do the query on that.
-
@Vault_Dweller yeah, sorry, I realized I explained the goal poorly, I was tangled up in stuff while writing the post...
Your assumption is correct, and yes, that's illegal in PSQL as well, sadly.
-
@Onyx So what prevents you from what I described, i.e.
SELECT <col_list>, "callStart" - LAG("callStart") over(ORDER BY "callStart") AS "callLag" into #callslist --This is a temp table, not sure how to do them in PostgreSQL. You can use a cte too FROM asterisk."CallsList"
and then just something like:
SELECT "callerPhoneNumber", COUNT(*) AS "totalCalls", COUNT(*) filter (WHERE "answered" = TRUE) AS "answeredCalls", COUNT(*) filter (WHERE "answered" = FALSE) AS "unansweredCalls", SUM("callTime") filter (WHERE "answered" = TRUE) AS "callTimeTotal", AVG("callTime") filter (WHERE "answered" = TRUE) AS "callTimeAverage", MAX("callTime") AS "callTimeMax", AVG("callLag") AS "callLagAvg" FROM #callslist WHERE "callStart" >= '2017-04-20' AND "callStart" < '2017-04-21' GROUP BY "callerPhoneNumber", "callStart" ORDER BY "callerPhoneNumber" ASC;
I'm not sure it will be possible without using cte's/temp tables
-
@Vault_Dweller I tried this:
WITH d AS ( SELECT avg(difference) AS diff FROM ( SELECT "callStart" - lag("callStart") OVER (ORDER BY "callStart") AS difference FROM asterisk."CallLogs" ) t ) SELECT "callerPhoneNumber", COUNT(*) AS "totalCalls", COUNT(*) filter (WHERE "answered" = TRUE) AS "answeredCalls", COUNT(*) filter (WHERE "answered" = FALSE) AS "unansweredCalls", SUM("callTime") filter (WHERE "answered" = TRUE) AS "callTimeTotal", AVG("callTime") filter (WHERE "answered" = TRUE) AS "callTimeAverage", MAX("callTime") AS "callTimeMax", AVG(d.diff) AS "callLagAvg" FROM asterisk."CallsList" WHERE "callStart" >= '2017-04-20' AND "callStart" < '2017-04-21' GROUP BY "callerPhoneNumber", "callStart" ORDER BY "callerPhoneNumber" ASC;
Which gets the average... for all results, not grouped by caller. Trying to cram in a
GROUP BY
into the CTE is a terrible performance killer - it might work like that, or as a subquery, but I'm kinda hoping against hope it can be something that gets generated rather quickly. That might not be a thing.
-
@Onyx Oops, sorry, your LAG operator should actually look something like:
LAG("callStart") OVER (PARTITION BY "callerPhoneNumber" ORDER BY "callStart")
The partition part is important
-
CRACKED IT!
SELECT c."callerPhoneNumber", COUNT(*) AS "totalCalls", COUNT(*) filter (WHERE "answered" = TRUE) AS "answeredCalls", COUNT(*) filter (WHERE "answered" = FALSE) AS "unansweredCalls", SUM("callTime") filter (WHERE "answered" = TRUE) AS "callTimeTotal", AVG("callTime") filter (WHERE "answered" = TRUE) AS "callTimeAverage", MAX("callTime") AS "callTimeMax", ( SELECT AVG(difference) filter (WHERE t."callerPhoneNumber" = c."callerPhoneNumber" ) AS diff FROM ( SELECT "callStart" - lag("callStart") OVER (PARTITION BY "callerPhoneNumber" ORDER BY "callStart") AS difference, "callerPhoneNumber" FROM asterisk."CallsList" WHERE "callStart" >= '2017-04-20' AND "callStart" < '2017-04-21' GROUP BY "callerPhoneNumber", "callStart" ) t ) AS "avgLagTime" FROM asterisk."CallsList" c WHERE "callStart" >= '2017-04-20' AND "callStart" < '2017-04-21' GROUP BY c."callerPhoneNumber" ORDER BY c."callerPhoneNumber" ASC;
I was stupid for using a
WHERE
in the subquery at first instead offilter
. Also, adding the dates into the subquery does wonders.Thanks for the tips, @Vault_Dweller !
-
@Onyx Interesting that it has decent peformance (I'm assuming), using subqueries within the select part of a query is usually sub-optimal. I'm assuming it's the date filter that caused the actual improvement here.
Anyway, glad I could be of help.
-
@Vault_Dweller said in Averaging the lag time (PostgreSQL):
Interesting that it has decent peformance (I'm assuming)
Planning time: 5.544 ms Execution time: 495.325 ms
For something that's going to be ran probably daily, I'm pretty happy with that. Also, I'm aware subqueries are suboptimal, but I have no idea how to make the filtering per-caller work properly using CTEs — if it's doable it's out of my league, anything past
JOIN
s is usually voodoo to me, I'm not a DB guy and I mostly lack the time to learn, even though I'd love to.
-
@Onyx At half a second, I'd be happy running that every five minutes
-
@Onyx said in Averaging the lag time (PostgreSQL):
I'm not a DB guy
I am, which is why I notice stuff like that, but
@Onyx said in Averaging the lag time (PostgreSQL):
Planning time: 5.544 ms
Execution time: 495.325 mslooks like "mission accomplished" to me!
-
@Onyx said in Averaging the lag time (PostgreSQL):
Trying to cram in a
GROUP BY
into the CTE is a terrible performance killerMy preferred solution to this is moar CTEs. So...the first one selects all of your data, then the second one would do your
LAG
calculations, then your main query joins theLAG
stuff and that value gets added to the 'main'GROUP BY
.Caveat: this works quite well in Oracle, because the data is fetched once and then operated on in memory. I haven't used postgres enough to know how it will react to that sort of thing.
-
@boomzilla said in Averaging the lag time (PostgreSQL):
@Onyx said in Averaging the lag time (PostgreSQL):
Trying to cram in a
GROUP BY
into the CTE is a terrible performance killerMy preferred solution to this is moar CTEs. So...the first one selects all of your data, then the second one would do your
LAG
calculations, then your main query joins theLAG
stuff and that value gets added to the 'main'GROUP BY
.Caveat: this works quite well in Oracle, because the data is fetched once and then operated on in memory. I haven't used postgres enough to know how it will react to that sort of thing.
I'll give that a go tomorrow, might be a benefit to that. Never know when larger datasets will be required, if I can get it to perform even better now I'll have less headaches later...
-
@Onyx Given your actual results (assuming that
.
is a decimal and not a thousands separator) it probably won't improve on that because the DB is already doing the optimizations. Which is the fun of optimizing SQL: you never know for sure what will happen until you try.
-
@boomzilla It's especially fun when the dataset you're working on is juuuuust below the treshold of when certain indexes will trigger (instead of sequential scans) or it will have to start materializing views because fuck you you're over the memory limit for caching.
Literally had a situation where we tuned a query and it got over a threshold the very next day...