Run-length encoding using SQL



  • Well, in my leisure time, I'm trying to wrap my head around the following. The problem was solved in a more or less robust way by my colleague, but I still think there's room for improvement. So here goes.

    We've got a data source, which, once per X time units (once a day, once an hour, once a minute, doesn't really matter) fetches a value for entities that we have — imagine it's a sensor data, or some report, let's just pretend it's a sensor measurement, to make things easier to imagine. This data gets written to a table like this:

    sensor_id        moment                   value
    -------------------------------------------------------
    10200101         2015-01-01 12:01:02      10.23
    10200101         2015-01-01 13:01:02      10.23
    10200101         2015-01-01 14:01:02      10.23
    10200101         2015-01-01 15:01:02      11.25
    10200101         2015-01-01 16:01:02      11.25
    10200101         2015-01-01 17:01:02      11.25
    10200101         2015-01-01 18:01:02      11.25
    10200101         2015-01-01 19:01:02      11.25
    10200101         2015-01-01 20:01:02      12.00
    10200101         2015-01-01 21:01:02      12.00
    10200101         2015-01-01 22:01:02      12.00
    10200101         2015-01-01 23:01:02      12.00
    10200101         2015-01-02 00:01:02      12.00
    10200101         2015-01-02 01:01:02      12.00
    

    (the actual values span much further, but the requirement is to poll at this interval anyway, because you never know.)

    The script works okay, and it's really nobody's intention to change it.

    Meanwhile, there is another table that holds the same values, but in a more efficient way:

    sensor_id        start                    end                    value
    -------------------------------------------------------------------------
    10200101         2015-01-01 12:01:02      2015-01-01 14:01:02    10.23
    10200101         2015-01-01 15:01:02      2015-01-01 19:01:02    11.25
    10200101         2015-01-01 20:01:02      2015-01-02 01:01:02    12.00
    

    And there is a stored procedure that takes all values from the first table and transfers them into the second table, to be run once in a while. It executes in something like O(N²) time, or maybe even more. The thing is, N is the number of distinct sensor_ids and is estimated at several million. Second thing, the data from the first table is considered priority and can be re-imported several times from backup sources, and has to fit neatly with the rest of records in the second table (so it's not append-only, nor only last records get updated).

    Thus, a month worth of data can be transferred from one table to another in a few days. This is where it starts to bother me.

    The question is, is there an elegant and reasonably performant way, in SQL, to aggregate the data in the first table, as presented in the second table?


  • FoxDev

    Some combination of CUBE and/or ROLLUP would be my guess


  • I survived the hour long Uno hand

    Seems like a minor twist on the queries we came up with for attendance badgers, no?

    @Pjh has those someplace


  • FoxDev



  • That's the O(N2) thing he is trying to get away from.

    You could make it O(1) for inserts by updating the second table with a trigger. Updates and deletes might still need the O(N2) algorithm, but you might be able to put a where clause on it that drastically limits N.


  • FoxDev

    picture me shrugging. i don't know SQL to that level of mastery.

    @aliceif mentioned a query and i knew where to find it (the original came from @PJH in the first place) so i chimed in with the actual query.


  • I survived the hour long Uno hand

    @accalia said:

    @aliceif

    Interesting @accalia there -- 'y' and 'a' are nowhere near 'a' and 'l' on the keyboard >.>



  • The trick is to "collapse" contiguous intervals with the same value, so that may be it.

    TRWTF is I forgot to mention this is a MySQL shop... They don't do WITH ;-)


  • FoxDev

    Ah…

    Hmm… the only real solution I can think of is grouping on two of the columns, and using MIN and MAX to get the timestamps… but I get the feeling that's what's being done already?


  • FoxDev

    @Yamikuronue said:

    Interesting @accalia there -- 'y' and 'a' are nowhere near 'a' and 'l' on the keyboard >.>

    :headdesk:

    sorry.

    apparently the caffeine hasn't kicked in yet.



  • Nope.

    Imagine

    tstamp value
    d1     10
    d2     10
    d3     11
    d4     11
    d5     10
    d6     10
    

    So selecting MIN(tstamp), MAX(tstamp), value GROUP BY value will give you a totally wrong result.


  • FoxDev

    :headdesk:

    SELECT sensor_id, MIN(moment), MAX(moment), value
    FROM table
    GROUP BY sensor_id, value;
    

  • Notification Spam Recipient

    If you can find a way to "group" runs, this approach might work, e.g.

    tstamp value group
    d1     10    1
    d2     10    1
    d3     11    2
    d4     11    2
    d5     10    3
    d6     10    3

  • FoxDev

    still would group over a discontinuity....


  • BINNED

    Reminded me of this, for some reason:

    <JonTG> Man, my penis is so big if I laid it out on a keyboard it'd go all the way from A to Z
    <JonTG> wait, shit
    bash.org

    Holy shit, Discourse actually handles <blockqoute> + <cite> properly!



  • Here is a solution in MySQL which I need yet to check.

    I don't think it would apply to our needs cleanly — there are critical bits and pieces out of scope (like, if we interrupt the process, it won't fuck the data up, and the bit with updating/deleting), but nice to have in the toolbox.


  • FoxDev

    @accalia said:

    still would group over a discontinuity....

    …oh yeah…

    Hmm… this has gone beyond my expertise now…


  • FoxDev

    @RaceProUK said:

    Hmm… this has gone beyond my expertise now…

    we past mine ages ago... well about 17 posts ago really.


  • Discourse touched me in a no-no place

    You could probably do something with row_number over partition by and then stick it into a subquery and use min/max, but that wouldn't be too different from the way originally linked above.



  • It would be different, because MySQL cannot into window functions.
    Window functions make it easier on your eyes, though.

    Also, if your initial table has, SUDDENLY, a few dozen billion records, a self-join is a Bad Idea™...


  • Discourse touched me in a no-no place

    @wft said:

    Also, if your initial table has, SUDDENLY, a few dozen billion records, a self-join is a Bad Idea™...

    RDBMS doesn't scale, Doing It Wrong™, etc.

    @wft said:

    It would be different, because MySQL cannot into window functions.

    :wtf:
    Oh, right, MySQL.


  • Winner of the 2016 Presidential Election

    @Onyx said:

    Reminded me of this

    That joke confused me when I first heard it back in the days. Because on German keyboards Z is where the Y is on an american keyboard. And while that might not be a 100000 km penis it's still no need to say "wait, shit".

    Filed Under: Different keyboards are a barrier to understanding jokes



  • They were Doing It Wrong™, actually.

    They didn't think of interval table at the beginning, and were just sitting with it for a few years straight.

    When the tables grew to ~120 gigabytes, it started to worry them a bit.

    A trigger-based solution, even though updates would be a royal PITA, would be appropriate at the start; now it was an issue to migrate these gazillions of records and not losing a single one of them on the way. Thus, a solution with no joins, runs in quadratic time, but also robust as fuck: we interrupted it several times and restarted, and it got things right all the time.



  • Did you consider using a cursor? Input one row at a time, aggregate in a temp table, output if the value changes.


  • Discourse touched me in a no-no place

    @Kuro said:

    That joke confused me when I first heard it back in the days. Because on German keyboards Z is where the Y is on an american keyboard. And while that might not be a 100000 km penis <!-- I might be exxagerating here --> it's still no need to say "wait, shit".

    Let's just put it like this: that joke will still work very effectively in French…



  • AZERTY, hehehe


Log in to reply