Run-length encoding using SQL
-
Well, in my leisure time, I'm trying to wrap my head around the following. The problem was solved in a more or less robust way by my colleague, but I still think there's room for improvement. So here goes.
We've got a data source, which, once per X time units (once a day, once an hour, once a minute, doesn't really matter) fetches a value for entities that we have — imagine it's a sensor data, or some report, let's just pretend it's a sensor measurement, to make things easier to imagine. This data gets written to a table like this:
sensor_id moment value ------------------------------------------------------- 10200101 2015-01-01 12:01:02 10.23 10200101 2015-01-01 13:01:02 10.23 10200101 2015-01-01 14:01:02 10.23 10200101 2015-01-01 15:01:02 11.25 10200101 2015-01-01 16:01:02 11.25 10200101 2015-01-01 17:01:02 11.25 10200101 2015-01-01 18:01:02 11.25 10200101 2015-01-01 19:01:02 11.25 10200101 2015-01-01 20:01:02 12.00 10200101 2015-01-01 21:01:02 12.00 10200101 2015-01-01 22:01:02 12.00 10200101 2015-01-01 23:01:02 12.00 10200101 2015-01-02 00:01:02 12.00 10200101 2015-01-02 01:01:02 12.00
(the actual values span much further, but the requirement is to poll at this interval anyway, because you never know.)
The script works okay, and it's really nobody's intention to change it.
Meanwhile, there is another table that holds the same values, but in a more efficient way:
sensor_id start end value ------------------------------------------------------------------------- 10200101 2015-01-01 12:01:02 2015-01-01 14:01:02 10.23 10200101 2015-01-01 15:01:02 2015-01-01 19:01:02 11.25 10200101 2015-01-01 20:01:02 2015-01-02 01:01:02 12.00
And there is a stored procedure that takes all values from the first table and transfers them into the second table, to be run once in a while. It executes in something like O(N²) time, or maybe even more. The thing is, N is the number of distinct sensor_ids and is estimated at several million. Second thing, the data from the first table is considered priority and can be re-imported several times from backup sources, and has to fit neatly with the rest of records in the second table (so it's not append-only, nor only last records get updated).
Thus, a month worth of data can be transferred from one table to another in a few days. This is where it starts to bother me.
The question is, is there an elegant and reasonably performant way, in SQL, to aggregate the data in the first table, as presented in the second table?
-
Some combination of
CUBE
and/orROLLUP
would be my guess
-
Seems like a minor twist on the queries we came up with for attendance badgers, no?
@Pjh has those someplace
-
https://github.com/AccaliaDeElementia/SockBot/blob/master/sock_modules/stats.yml#L31-L57
at least that's the version of the attendance query that @shadowmod uses
-
That's the O(N2) thing he is trying to get away from.
You could make it O(1) for inserts by updating the second table with a trigger. Updates and deletes might still need the O(N2) algorithm, but you might be able to put a where clause on it that drastically limits N.
-
picture me shrugging. i don't know SQL to that level of mastery.
@aliceif mentioned a query and i knew where to find it (the original came from @PJH in the first place) so i chimed in with the actual query.
-
-
The trick is to "collapse" contiguous intervals with the same value, so that may be it.
TRWTF is I forgot to mention this is a MySQL shop... They don't do WITH ;-)
-
Ah…
Hmm… the only real solution I can think of is grouping on two of the columns, and using
MIN
andMAX
to get the timestamps… but I get the feeling that's what's being done already?
-
Interesting @accalia there -- 'y' and 'a' are nowhere near 'a' and 'l' on the keyboard >.>
sorry.
apparently the caffeine hasn't kicked in yet.
-
Nope.
Imagine
tstamp value d1 10 d2 10 d3 11 d4 11 d5 10 d6 10
So selecting MIN(tstamp), MAX(tstamp), value GROUP BY value will give you a totally wrong result.
-
SELECT sensor_id, MIN(moment), MAX(moment), value FROM table GROUP BY sensor_id, value;
-
If you can find a way to "group" runs, this approach might work, e.g.
tstamp value group d1 10 1 d2 10 1 d3 11 2 d4 11 2 d5 10 3 d6 10 3
-
still would group over a discontinuity....
-
Reminded me of this, for some reason:
<JonTG> Man, my penis is so big if I laid it out on a keyboard it'd go all the way from A to Z <JonTG> wait, shit
bash.orgHoly shit, Discourse actually handles
<blockqoute> + <cite>
properly!
-
Here is a solution in MySQL which I need yet to check.
I don't think it would apply to our needs cleanly — there are critical bits and pieces out of scope (like, if we interrupt the process, it won't fuck the data up, and the bit with updating/deleting), but nice to have in the toolbox.
-
still would group over a discontinuity....
…oh yeah…Hmm… this has gone beyond my expertise now…
-
Hmm… this has gone beyond my expertise now…
we past mine ages ago... well about 17 posts ago really.
-
You could probably do something with row_number over partition by and then stick it into a subquery and use min/max, but that wouldn't be too different from the way originally linked above.
-
It would be different, because MySQL cannot into window functions.
Window functions make it easier on your eyes, though.Also, if your initial table has, SUDDENLY, a few dozen billion records, a self-join is a Bad Idea™...
-
Also, if your initial table has, SUDDENLY, a few dozen billion records, a self-join is a Bad Idea™...
RDBMS doesn't scale, Doing It Wrong™, etc.
It would be different, because MySQL cannot into window functions.
Oh, right, MySQL.
-
Reminded me of this
That joke confused me when I first heard it back in the days. Because on German keyboards Z is where the Y is on an american keyboard. And while that might not be a 100000 km penis it's still no need to say "wait, shit".
Filed Under: Different keyboards are a barrier to understanding jokes
-
They were Doing It Wrong™, actually.
They didn't think of interval table at the beginning, and were just sitting with it for a few years straight.
When the tables grew to ~120 gigabytes, it started to worry them a bit.
A trigger-based solution, even though updates would be a royal PITA, would be appropriate at the start; now it was an issue to migrate these gazillions of records and not losing a single one of them on the way. Thus, a solution with no joins, runs in quadratic time, but also robust as fuck: we interrupted it several times and restarted, and it got things right all the time.
-
Did you consider using a cursor? Input one row at a time, aggregate in a temp table, output if the value changes.
-
That joke confused me when I first heard it back in the days. Because on German keyboards Z is where the Y is on an american keyboard. And while that might not be a 100000 km penis <!-- I might be exxagerating here --> it's still no need to say "wait, shit".
Let's just put it like this: that joke will still work very effectively in French…
-
AZERTY, hehehe