@ben_lubar tries to parse <del>ServerCooties</del> data



  • Here's the query I want to run:

    with dates as (select ((select min(created_at)::date from posts) + day * interval '1 day')::date date from generate_series(0, (select max(created_at)::date from posts) - (select min(created_at)::date from posts)) day order by date asc), counts as (select date, author_id, count(*) from dates, posts where created_at between date - interval '1 year' and date group by date, author_id order by date asc, count desc), ranked as (select *, rank() over (partition by date order by count desc) from counts) select date, rank, name, count from ranked left join users on id = author_id where rank <= 10 order by date asc, rank asc;
    

    Here's what I get when I limit it to the last 10 days:

    postgres=# explain analyze verbose with dates as (select ((select min(created_at)::date from posts) + day * interval '1 day')::date date from generate_series(0, (select max(created_at)::date from posts) - (select min(created_at)::date from posts)) day order by date desc limit 10), counts as (select date, author_id, count(*) from dates, posts where created_at between date - interval '1 year' and date group by date, author_id order by date asc, count desc), ranked as (select *, rank() over (partition by date order by count desc) from counts) select date, rank, name, count from ranked left join users on id = author_id where rank <= 10 order by date asc, rank asc;
                                                                                                                                                                                                    QUERY PLAN                                                                                                                                                                                                 
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Sort  (cost=208088.75..208103.92 rows=6067 width=31) (actual time=7513.875..7513.921 rows=100 loops=1)
       Output: ranked.date, ranked.rank, users.name, ranked.count
       Sort Key: ranked.date, ranked.rank
       Sort Method: quicksort  Memory: 32kB
       CTE dates
         ->  Limit  (cost=43.19..43.21 rows=10 width=4) (actual time=12.654..12.678 rows=10 loops=1)
               Output: ((($1 + ((day.day)::double precision * '1 day'::interval)))::date)
               InitPlan 2 (returns $1)
                 ->  Result  (cost=0.51..0.53 rows=1 width=0) (actual time=0.044..0.044 rows=1 loops=1)
                       Output: ($0)::date
                       InitPlan 1 (returns $0)
                         ->  Limit  (cost=0.42..0.51 rows=1 width=8) (actual time=0.040..0.041 rows=1 loops=1)
                               Output: posts.created_at
                               ->  Index Only Scan using posts_created_at on public.posts  (cost=0.42..78856.24 rows=901161 width=8) (actual time=0.038..0.038 rows=1 loops=1)
                                     Output: posts.created_at
                                     Index Cond: (posts.created_at IS NOT NULL)
                                     Heap Fetches: 1
               InitPlan 4 (returns $3)
                 ->  Result  (cost=0.51..0.53 rows=1 width=0) (actual time=0.095..0.096 rows=1 loops=1)
                       Output: ($2)::date
                       InitPlan 3 (returns $2)
                         ->  Limit  (cost=0.42..0.51 rows=1 width=8) (actual time=0.087..0.088 rows=1 loops=1)
                               Output: posts_1.created_at
                               ->  Index Only Scan Backward using posts_created_at on public.posts posts_1  (cost=0.42..78856.24 rows=901161 width=8) (actual time=0.085..0.085 rows=1 loops=1)
                                     Output: posts_1.created_at
                                     Index Cond: (posts_1.created_at IS NOT NULL)
                                     Heap Fetches: 1
               InitPlan 6 (returns $5)
                 ->  Result  (cost=0.51..0.53 rows=1 width=0) (actual time=0.041..0.042 rows=1 loops=1)
                       Output: ($4)::date
                       InitPlan 5 (returns $4)
                         ->  Limit  (cost=0.42..0.51 rows=1 width=8) (actual time=0.037..0.037 rows=1 loops=1)
                               Output: posts_2.created_at
                               ->  Index Only Scan using posts_created_at on public.posts posts_2  (cost=0.42..78856.24 rows=901161 width=8) (actual time=0.035..0.035 rows=1 loops=1)
                                     Output: posts_2.created_at
                                     Index Cond: (posts_2.created_at IS NOT NULL)
                                     Heap Fetches: 1
               ->  Sort  (cost=41.61..44.11 rows=1000 width=4) (actual time=12.652..12.663 rows=10 loops=1)
                     Output: ((($1 + ((day.day)::double precision * '1 day'::interval)))::date)
                     Sort Key: ((($1 + ((day.day)::double precision * '1 day'::interval)))::date) DESC
                     Sort Method: top-N heapsort  Memory: 25kB
                     ->  Function Scan on pg_catalog.generate_series day  (cost=0.01..20.01 rows=1000 width=4) (actual time=1.121..8.068 rows=5634 loops=1)
                           Output: (($1 + ((day.day)::double precision * '1 day'::interval)))::date
                           Function Call: generate_series(0, ($3 - $5))
       CTE counts
         ->  Sort  (cost=198063.98..198109.48 rows=18200 width=12) (actual time=7286.828..7291.778 rows=9868 loops=1)
               Output: dates.date, posts_3.author_id, (count(*))
               Sort Key: dates.date, (count(*)) DESC
               Sort Method: quicksort  Memory: 1155kB
               ->  HashAggregate  (cost=196594.18..196776.18 rows=18200 width=12) (actual time=7272.899..7279.037 rows=9868 loops=1)
                     Output: dates.date, posts_3.author_id, count(*)
                     Group Key: dates.date, posts_3.author_id
                     ->  Nested Loop  (cost=0.43..189084.51 rows=1001290 width=12) (actual time=12.726..5342.047 rows=2792422 loops=1)
                           Output: dates.date, posts_3.author_id
                           ->  CTE Scan on dates  (cost=0.00..0.20 rows=10 width=4) (actual time=12.655..12.710 rows=10 loops=1)
                                 Output: dates.date
                           ->  Index Scan using posts_created_at on public.posts posts_3  (cost=0.43..17907.14 rows=100129 width=16) (actual time=0.072..256.479 rows=279242 loops=10)
                                 Output: posts_3.id, posts_3.topic_id, posts_3.author_id, posts_3.author_name, posts_3.parent_post_id, posts_3.content, posts_3.created_at, posts_3.ip, posts_3.editor_id, posts_3.edited_at, posts_3.deleted
                                 Index Cond: ((posts_3.created_at >= (dates.date - '1 year'::interval)) AND (posts_3.created_at <= dates.date))
       CTE ranked
         ->  WindowAgg  (cost=1651.80..2015.80 rows=18200 width=20) (actual time=7308.258..7328.054 rows=9868 loops=1)
               Output: counts.date, counts.author_id, counts.count, rank() OVER (?)
               ->  Sort  (cost=1651.80..1697.30 rows=18200 width=20) (actual time=7308.250..7313.102 rows=9868 loops=1)
                     Output: counts.date, counts.count, counts.author_id
                     Sort Key: counts.date, counts.count DESC
                     Sort Method: quicksort  Memory: 1155kB
                     ->  CTE Scan on counts  (cost=0.00..364.00 rows=18200 width=20) (actual time=7286.832..7302.301 rows=9868 loops=1)
                           Output: counts.date, counts.count, counts.author_id
       ->  Merge Right Join  (cost=791.01..7539.04 rows=6067 width=31) (actual time=7335.604..7513.793 rows=100 loops=1)
             Output: ranked.date, ranked.rank, users.name, ranked.count
             Merge Cond: (users.id = ranked.author_id)
             ->  Index Scan using users_pkey on public.users  (cost=0.29..6303.08 rows=141698 width=19) (actual time=0.021..107.009 rows=140880 loops=1)
                   Output: users.id, users.name, users.slug, users.password, users.full_name, users.email, users.email_confirmed, users.banned, users.ban_expiration, users.join_date, users.last_online, users.picture, users.uploaded_picture, users.cover_url, users.cover_position, users.group_title, users.website, users.location, users.signature, users.about_me, users.birthday, users.profile_views
             ->  Sort  (cost=790.71..805.88 rows=6067 width=28) (actual time=7335.575..7335.622 rows=100 loops=1)
                   Output: ranked.date, ranked.rank, ranked.count, ranked.author_id
                   Sort Key: ranked.author_id
                   Sort Method: quicksort  Memory: 32kB
                   ->  CTE Scan on ranked  (cost=0.00..409.50 rows=6067 width=28) (actual time=7308.268..7335.507 rows=100 loops=1)
                         Output: ranked.date, ranked.rank, ranked.count, ranked.author_id
                         Filter: (ranked.rank <= 10)
                         Rows Removed by Filter: 9768
     Planning time: 0.690 ms
     Execution time: 7514.711 ms
    (83 rows)
    

    With these results (it's not an up-to-date backup):

        date    | rank |    name     | count 
    ------------+------+-------------+-------
     2016-06-21 |    1 | boomzilla   | 13416
     2016-06-21 |    2 | blakeyrat   | 12541
     2016-06-21 |    3 | Polygeekery |  9921
     2016-06-21 |    4 | FrostCat    |  9382
     2016-06-21 |    5 | accalia     |  9209
     2016-06-21 |    6 | loopback0   |  7961
     2016-06-21 |    7 | Fox         |  7804
     2016-06-21 |    8 | ben_lubar   |  7435
     2016-06-21 |    9 | dkf         |  7334
     2016-06-21 |   10 | abarker     |  6576
     2016-06-22 |    1 | boomzilla   | 13353
     2016-06-22 |    2 | blakeyrat   | 12521
     2016-06-22 |    3 | Polygeekery |  9919
     2016-06-22 |    4 | FrostCat    |  9377
     2016-06-22 |    5 | accalia     |  9198
     2016-06-22 |    6 | loopback0   |  7961
     2016-06-22 |    7 | Fox         |  7804
     2016-06-22 |    8 | ben_lubar   |  7420
     2016-06-22 |    9 | dkf         |  7333
     2016-06-22 |   10 | abarker     |  6569
     2016-06-23 |    1 | boomzilla   | 13235
     2016-06-23 |    2 | blakeyrat   | 12472
     2016-06-23 |    3 | Polygeekery |  9893
     2016-06-23 |    4 | FrostCat    |  9371
     2016-06-23 |    5 | accalia     |  9115
     2016-06-23 |    6 | loopback0   |  7961
     2016-06-23 |    7 | Fox         |  7804
     2016-06-23 |    8 | ben_lubar   |  7403
     2016-06-23 |    9 | dkf         |  7320
     2016-06-23 |   10 | abarker     |  6520
     2016-06-24 |    1 | boomzilla   | 13184
     2016-06-24 |    2 | blakeyrat   | 12437
     2016-06-24 |    3 | Polygeekery |  9879
     2016-06-24 |    4 | FrostCat    |  9326
     2016-06-24 |    5 | accalia     |  9076
     2016-06-24 |    6 | loopback0   |  7961
     2016-06-24 |    7 | Fox         |  7804
     2016-06-24 |    8 | ben_lubar   |  7394
     2016-06-24 |    9 | dkf         |  7309
     2016-06-24 |   10 | abarker     |  6510
     2016-06-25 |    1 | boomzilla   | 13131
     2016-06-25 |    2 | blakeyrat   | 12405
     2016-06-25 |    3 | Polygeekery |  9868
     2016-06-25 |    4 | FrostCat    |  9296
     2016-06-25 |    5 | accalia     |  9009
     2016-06-25 |    6 | loopback0   |  7961
     2016-06-25 |    7 | Fox         |  7804
     2016-06-25 |    8 | ben_lubar   |  7370
     2016-06-25 |    9 | dkf         |  7299
     2016-06-25 |   10 | abarker     |  6468
     2016-06-26 |    1 | boomzilla   | 13046
     2016-06-26 |    2 | blakeyrat   | 12363
     2016-06-26 |    3 | Polygeekery |  9863
     2016-06-26 |    4 | FrostCat    |  9256
     2016-06-26 |    5 | accalia     |  8932
     2016-06-26 |    6 | loopback0   |  7961
     2016-06-26 |    7 | Fox         |  7804
     2016-06-26 |    8 | ben_lubar   |  7354
     2016-06-26 |    9 | dkf         |  7289
     2016-06-26 |   10 | Tsaukpaetra |  6465
     2016-06-27 |    1 | boomzilla   | 12960
     2016-06-27 |    2 | blakeyrat   | 12337
     2016-06-27 |    3 | Polygeekery |  9861
     2016-06-27 |    4 | FrostCat    |  9225
     2016-06-27 |    5 | accalia     |  8856
     2016-06-27 |    6 | loopback0   |  7961
     2016-06-27 |    7 | Fox         |  7804
     2016-06-27 |    8 | ben_lubar   |  7329
     2016-06-27 |    9 | dkf         |  7271
     2016-06-27 |   10 | Tsaukpaetra |  6464
     2016-06-28 |    1 | boomzilla   | 12951
     2016-06-28 |    2 | blakeyrat   | 12316
     2016-06-28 |    3 | Polygeekery |  9859
     2016-06-28 |    4 | FrostCat    |  9219
     2016-06-28 |    5 | accalia     |  8847
     2016-06-28 |    6 | loopback0   |  7961
     2016-06-28 |    7 | Fox         |  7804
     2016-06-28 |    8 | ben_lubar   |  7322
     2016-06-28 |    9 | dkf         |  7266
     2016-06-28 |   10 | Tsaukpaetra |  6464
     2016-06-29 |    1 | boomzilla   | 12935
     2016-06-29 |    2 | blakeyrat   | 12305
     2016-06-29 |    3 | Polygeekery |  9853
     2016-06-29 |    4 | FrostCat    |  9205
     2016-06-29 |    5 | accalia     |  8822
     2016-06-29 |    6 | loopback0   |  7961
     2016-06-29 |    7 | Fox         |  7804
     2016-06-29 |    8 | ben_lubar   |  7293
     2016-06-29 |    9 | dkf         |  7265
     2016-06-29 |   10 | Tsaukpaetra |  6464
     2016-06-30 |    1 | boomzilla   | 12908
     2016-06-30 |    2 | blakeyrat   | 12265
     2016-06-30 |    3 | Polygeekery |  9834
     2016-06-30 |    4 | FrostCat    |  9195
     2016-06-30 |    5 | accalia     |  8742
     2016-06-30 |    6 | loopback0   |  7961
     2016-06-30 |    7 | Fox         |  7804
     2016-06-30 |    8 | ben_lubar   |  7272
     2016-06-30 |    9 | dkf         |  7258
     2016-06-30 |   10 | Tsaukpaetra |  6464
    (100 rows)
    

    How can I make the query run faster? There are about 6000 days between the first and last posts in the database, so taking over an hour to get this data isn't going to be fun for whoever is waiting for the chart to be generated.


  • Considered Harmful

    @ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    How can I make the query run faster? There are about 6000 days between the first and last posts in the database, so taking over an hour to get this data isn't going to be fun for whoever is waiting for the chart to be generated.

    Try materializing some of those subqueries. Temp table or persisted view.

    Assuming you're just working with a snapshot exported from Mongo, you can denormalize that shit without too much worry about maintaining consistency.



  • @ben_lubar Well here's a few suggestions:

    1. Format your query so it's on more than one line, because WTF man
    2. Get rid of the roughly 57436 subqueries you have and just do a single join clause

    You appear to have a query within a query within a query if I'm reading that right. Jesus man.



  • begin;
    create index posts_created_at on posts (created_at);
    create index posts_author_id on posts (author_id);
    create temporary table temp_dates as select ((select min(created_at)::date from posts) + day * interval '1 day')::date date from generate_series(0, (select max(created_at)::date from posts) - (select min(created_at)::date from posts)) day;
    create temporary table temp_counts_pre as select date, author_id, count(*) from temp_dates, posts where created_at::date = date group by date, author_id;
    create index temp_counts_pre_date_author_id on temp_counts_pre (date, author_id);
    create temporary table temp_counts as select a.date, author_id, sum(count) count from temp_dates a, temp_counts_pre b where b.date between a.date - interval '1 year' and a.date group by a.date, author_id;
    create index temp_counts_date_count on temp_counts (date, count desc);
    explain analyze verbose with counts as (select *, rank() over (partition by date order by count desc) from temp_counts) select date, rank, name, count from counts left join users on id = author_id where rank <= 10 order by date asc, rank asc;
    with counts as (select *, rank() over (partition by date order by count desc) from temp_counts) select date, rank, name, count from counts left join users on id = author_id where rank <= 10 order by date asc, rank asc;
    rollback;
    

    Output:

    postgres=# begin;
    BEGIN
    Time: 0.137 ms
    postgres=# create index posts_created_at on posts (created_at);
    CREATE INDEX
    Time: 3773.139 ms
    postgres=# create index posts_author_id on posts (author_id);
    CREATE INDEX
    Time: 1269.059 ms
    postgres=# create temporary table temp_dates as select ((select min(created_at)::date from posts) + day * interval '1 day')::date date from generate_series(0, (select max(created_at)::date from posts) - (select min(created_at)::date from posts)) day;
    SELECT 5634
    Time: 24.168 ms
    postgres=# create temporary table temp_counts_pre as select date, author_id, count(*) from temp_dates, posts where created_at::date = date group by date, author_id;
    SELECT 169666
    Time: 2678.269 ms
    postgres=# create index temp_counts_pre_date_author_id on temp_counts_pre (date, author_id);
    CREATE INDEX
    Time: 73.029 ms
    postgres=# create temporary table temp_counts as select a.date, author_id, sum(count) count from temp_dates a, temp_counts_pre b where b.date between a.date - interval '1 year' and a.date group by a.date, author_id;
    SELECT 4509688
    Time: 129317.276 ms
    postgres=# create index temp_counts_date_count on temp_counts (date, count desc);
    CREATE INDEX
    Time: 9796.385 ms
    postgres=# explain analyze verbose with counts as (select *, rank() over (partition by date order by count desc) from temp_counts) select date, rank, name, count from counts left join users on id = author_id where rank <= 10 order by date asc, rank asc;
                                                                                       QUERY PLAN                                                                                    
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Sort  (cost=746886.42..750644.49 rows=1503229 width=55) (actual time=15610.866..15636.024 rows=44877 loops=1)
       Output: counts.date, counts.rank, users.name, counts.count
       Sort Key: counts.date, counts.rank
       Sort Method: external sort  Disk: 1752kB
       CTE counts
         ->  WindowAgg  (cost=0.43..330933.29 rows=4509688 width=44) (actual time=0.054..11095.926 rows=4509688 loops=1)
               Output: temp_counts.date, temp_counts.author_id, temp_counts.count, rank() OVER (?)
               ->  Index Scan using temp_counts_date_count on pg_temp_2.temp_counts  (cost=0.43..252013.75 rows=4509688 width=44) (actual time=0.044..3563.355 rows=4509688 loops=1)
                     Output: temp_counts.date, temp_counts.count, temp_counts.author_id
       ->  Hash Left Join  (cost=6636.20..158964.58 rows=1503229 width=55) (actual time=389.273..15535.957 rows=44877 loops=1)
             Output: counts.date, counts.rank, users.name, counts.count
             Hash Cond: (counts.author_id = users.id)
             ->  CTE Scan on counts  (cost=0.00..101467.98 rows=1503229 width=52) (actual time=0.061..15046.578 rows=44877 loops=1)
                   Output: counts.date, counts.rank, counts.count, counts.author_id
                   Filter: (counts.rank <= 10)
                   Rows Removed by Filter: 4464811
             ->  Hash  (cost=4033.98..4033.98 rows=141698 width=19) (actual time=389.028..389.028 rows=141427 loops=1)
                   Output: users.name, users.id
                   Buckets: 65536  Batches: 4  Memory Usage: 2422kB
                   ->  Seq Scan on public.users  (cost=0.00..4033.98 rows=141698 width=19) (actual time=28.603..288.194 rows=141427 loops=1)
                         Output: users.name, users.id
     Planning time: 34.240 ms
     Execution time: 15717.063 ms
    (23 rows)
    
    Time: 16434.860 ms
    postgres=# with counts as (select *, rank() over (partition by date order by count desc) from temp_counts) select date, rank, name, count from counts left join users on id = author_id where rank <= 10 order by date asc, rank asc;
        date    | rank |         name         | count 
    ------------+------+----------------------+-------
    [snip]
    (44877 rows)
    
    Time: 6726.870 ms
    postgres=# rollback;
    ROLLBACK
    Time: 70.226 ms
    


  • Thanks for ruining my statistics, @abarker
    https://what.thedailywtf.com/post/797354



  • @ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    postgres=# create temporary table temp_counts as select a.date, author_id, sum(count) count from temp_dates a, temp_counts_pre b where b.date between a.date - interval '1 year' and a.date group by a.date, author_id;
    SELECT 4509688
    Time: 129317.276 ms
    
    postgres=# explain analyze verbose create temporary table temp_counts as select a.date, author_id, sum(count) count from temp_dates a, temp_counts_pre b where b.date between a.date - interval '1 year' and a.date group by a.date, author_id;
                                                                                             QUERY PLAN                                                                                         
    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     GroupAggregate  (cost=27137620.94..28339921.77 rows=40000 width=20) (actual time=192270.885..276757.795 rows=4509688 loops=1)
       Output: a.date, b.author_id, sum(b.count)
       Group Key: a.date, b.author_id
       ->  Sort  (cost=27137620.94..27438071.14 rows=120180083 width=20) (actual time=192270.866..238706.102 rows=56896009 loops=1)
             Output: a.date, b.author_id, b.count
             Sort Key: a.date, b.author_id
             Sort Method: external sort  Disk: 1881848kB
             ->  Nested Loop  (cost=0.42..3615179.19 rows=120180083 width=20) (actual time=0.055..91374.964 rows=56896009 loops=1)
                   Output: a.date, b.author_id, b.count
                   ->  Seq Scan on pg_temp_2.temp_dates a  (cost=0.00..88.75 rows=6375 width=4) (actual time=0.018..7.720 rows=5634 loops=1)
                         Output: a.date
                   ->  Index Scan using temp_counts_pre_date_author_id on pg_temp_2.temp_counts_pre b  (cost=0.42..378.55 rows=18852 width=20) (actual time=0.015..6.305 rows=10099 loops=5634)
                         Output: b.date, b.author_id, b.count
                         Index Cond: ((b.date >= (a.date - '1 year'::interval)) AND (b.date <= a.date))
     Planning time: 0.280 ms
     Execution time: 282411.389 ms
    (16 rows)
    
    Time: 282419.638 ms
    

    I wonder if I can rewrite that query to insert each day's data separately so it doesn't have to sort a terabyte:wtf: of data at once...





  • create temporary table temp_counts (date date not null, author_id bigint, count bigint not null);
    do $$
    declare
    a record;
    begin
    for a in select date from temp_dates order by date asc loop
    insert into temp_counts select a.date, author_id, sum(count) count from temp_counts_pre b where b.date between a.date - interval '1 year' and a.date group by a.date, author_id;
    end loop;
    end;
    $$ language plpgsql;
    

    Unfortunately, each of the INSERT queries takes about 50ms, which means instead of taking 2 minutes, now it takes 5 minutes.


  • Discourse touched me in a no-no place

    @ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    I wonder if I can rewrite that query to insert each day's data separately so it doesn't have to sort a terabyte of data at once...

    It might be faster to just do one table scan, and figure out which records you want outside the SQL. Yes, even if it's the whole table. It would certainly be saner.


  • Discourse touched me in a no-no place

    @ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    Unfortunately, each of the INSERT queries takes about 50ms

    I haven't got the spare brainpower to decipher what you're trying to accomplish at the moment. Can you explain it in English instead of SQL?



  • @FrostCat said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    Can you explain it in English

    TL;DR version: Top 10 users for "posts in the last year" for each day.


  • Discourse touched me in a no-no place

    @ben_lubar Ok. Let me cogitate for a bit. I still think your best bet is going to be--based on what's been said so far--a single table scan of posts into a temp table with some of the indexes you want. And bear in mind SQL per se is not my forte so I may not be able to tell you exactly how to write the SQL.

    How would I find the db schema, such as it is? Don't actually give it to me now, I won't be able to process it.



  • @FrostCat said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    How would I find the db schema, such as it is?

    When combined with the next sentence you wrote, I cannot parse what you meant.

    In any case, when you want it, it'll be here:

    postgres=# \dt           
                      List of relations
     Schema |          Name           | Type  |  Owner   
    --------+-------------------------+-------+----------
     public | categories              | table | postgres
     public | groups                  | table | postgres
     public | posts                   | table | postgres
     public | topics                  | table | postgres
     public | user_favorite_posts     | table | postgres
     public | user_groups             | table | postgres
     public | user_ips                | table | postgres
     public | user_post_votes         | table | postgres
     public | user_watched_categories | table | postgres
     public | user_watched_topics     | table | postgres
     public | user_watched_users      | table | postgres
     public | users                   | table | postgres
    (12 rows)
    
    postgres=# \d categories 
                                           Table "public.categories"
           Column       |          Type           |                        Modifiers                        
    --------------------+-------------------------+---------------------------------------------------------
     id                 | bigint                  | not null default nextval('categories_id_seq'::regclass)
     name               | text                    | not null
     slug               | text                    | not null
     parent_category_id | bigint                  | 
     position           | integer                 | 
     description        | text                    | 
     link               | character varying(2048) | 
     icon               | character varying(255)  | 
     fg_color           | character varying(255)  | 
     bg_color           | character varying(255)  | 
     recent_replies     | integer                 | 
     disabled           | boolean                 | not null default false
    Indexes:
        "categories_pkey" PRIMARY KEY, btree (id)
        "categories_slug" UNIQUE, btree (slug)
    Check constraints:
        "categories_slug_exists" CHECK (slug <> ''::text)
    Foreign-key constraints:
        "categories_parent_category_id_fkey" FOREIGN KEY (parent_category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE SET NULL
    Referenced by:
        TABLE "categories" CONSTRAINT "categories_parent_category_id_fkey" FOREIGN KEY (parent_category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE SET NULL
        TABLE "topics" CONSTRAINT "topics_category_id_fkey" FOREIGN KEY (category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE RESTRICT
        TABLE "user_watched_categories" CONSTRAINT "user_watched_categories_category_id_fkey" FOREIGN KEY (category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE CASCADE
    Triggers:
        categories_make_slug BEFORE INSERT OR UPDATE OF name, slug ON categories FOR EACH ROW EXECUTE PROCEDURE make_slug()
    
    postgres=# \d groups
                               Table "public.groups"
            Column         |           Type           |       Modifiers        
    -----------------------+--------------------------+------------------------
     name                  | text                     | not null
     slug                  | text                     | not null
     create_time           | timestamp with time zone | not null default now()
     user_title            | text                     | 
     description           | text                     | not null
     icon                  | character varying(255)   | 
     label_color           | character varying(255)   | 
     cover_url             | character varying(2048)  | 
     cover_thumb_url       | character varying(2048)  | 
     cover_position        | double precision[]       | 
     hidden                | boolean                  | not null default false
     system                | boolean                  | not null default false
     private               | boolean                  | not null default true
     disable_join_requests | boolean                  | not null default false
     user_title_enabled    | boolean                  | not null default false
    Indexes:
        "groups_pkey" PRIMARY KEY, btree (name)
        "groups_slug" UNIQUE, btree (slug)
        "groups_create_time" btree (create_time)
    Check constraints:
        "groups_slug_exists" CHECK (slug <> ''::text)
    Referenced by:
        TABLE "user_groups" CONSTRAINT "user_groups_group_name_fkey" FOREIGN KEY (group_name) REFERENCES groups(name) ON UPDATE CASCADE ON DELETE CASCADE
        TABLE "users" CONSTRAINT "users_group_title_fkey" FOREIGN KEY (group_title) REFERENCES groups(name) ON UPDATE CASCADE ON DELETE SET NULL
    Triggers:
        groups_make_slug BEFORE INSERT OR UPDATE OF name, slug ON groups FOR EACH ROW EXECUTE PROCEDURE make_slug()
    
    postgres=# \d posts
                                          Table "public.posts"
         Column     |           Type           |                     Modifiers                      
    ----------------+--------------------------+----------------------------------------------------
     id             | bigint                   | not null default nextval('posts_id_seq'::regclass)
     topic_id       | bigint                   | not null
     author_id      | bigint                   | 
     author_name    | character varying(255)   | 
     parent_post_id | bigint                   | 
     content        | text                     | not null
     created_at     | timestamp with time zone | not null default now()
     ip             | inet[]                   | 
     editor_id      | bigint                   | 
     edited_at      | timestamp with time zone | 
     deleted        | boolean                  | not null default false
    Indexes:
        "posts_pkey" PRIMARY KEY, btree (id)
    Foreign-key constraints:
        "posts_author_id_fkey" FOREIGN KEY (author_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL
        "posts_editor_id_fkey" FOREIGN KEY (editor_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL
        "posts_parent_post_id_fkey" FOREIGN KEY (parent_post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE SET NULL
        "posts_topic_id_fkey" FOREIGN KEY (topic_id) REFERENCES topics(id) ON UPDATE CASCADE ON DELETE CASCADE
    Referenced by:
        TABLE "posts" CONSTRAINT "posts_parent_post_id_fkey" FOREIGN KEY (parent_post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE SET NULL
        TABLE "topics" CONSTRAINT "topics_main_post_id_fkey" FOREIGN KEY (main_post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE SET NULL
        TABLE "user_favorite_posts" CONSTRAINT "user_favorite_posts_post_id_fkey" FOREIGN KEY (post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE CASCADE
        TABLE "user_post_votes" CONSTRAINT "user_post_votes_post_id_fkey" FOREIGN KEY (post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE CASCADE
    
    postgres=# \d topics
                                         Table "public.topics"
        Column    |           Type           |                      Modifiers                      
    --------------+--------------------------+-----------------------------------------------------
     id           | bigint                   | not null default nextval('topics_id_seq'::regclass)
     name         | text                     | not null
     slug         | text                     | not null
     category_id  | bigint                   | not null
     author_id    | bigint                   | 
     main_post_id | bigint                   | 
     created_at   | timestamp with time zone | not null default now()
     view_count   | bigint                   | not null default 0
     locked       | boolean                  | not null default false
     deleted      | boolean                  | not null default false
     pinned       | boolean                  | not null default false
    Indexes:
        "topics_pkey" PRIMARY KEY, btree (id)
        "topics_slug" btree (slug)
    Foreign-key constraints:
        "topics_author_id_fkey" FOREIGN KEY (author_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL
        "topics_category_id_fkey" FOREIGN KEY (category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE RESTRICT
        "topics_main_post_id_fkey" FOREIGN KEY (main_post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE SET NULL
    Referenced by:
        TABLE "posts" CONSTRAINT "posts_topic_id_fkey" FOREIGN KEY (topic_id) REFERENCES topics(id) ON UPDATE CASCADE ON DELETE CASCADE
        TABLE "user_watched_topics" CONSTRAINT "user_watched_topics_topic_id_fkey" FOREIGN KEY (topic_id) REFERENCES topics(id) ON UPDATE CASCADE ON DELETE CASCADE
    Triggers:
        topics_make_slug BEFORE INSERT OR UPDATE OF name, slug ON topics FOR EACH ROW EXECUTE PROCEDURE make_slug()
    
    postgres=# \d user_favorite_posts
                    Table "public.user_favorite_posts"
        Column    |           Type           |       Modifiers        
    --------------+--------------------------+------------------------
     user_id      | bigint                   | not null
     post_id      | bigint                   | not null
     favorited_at | timestamp with time zone | not null default now()
    Indexes:
        "user_favorite_posts_pkey" PRIMARY KEY, btree (user_id, post_id)
    Foreign-key constraints:
        "user_favorite_posts_post_id_fkey" FOREIGN KEY (post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE CASCADE
        "user_favorite_posts_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    
    postgres=# \d user_groups
                       Table "public.user_groups"
       Column   |           Type           |       Modifiers        
    ------------+--------------------------+------------------------
     user_id    | bigint                   | not null
     group_name | text                     | not null
     is_owner   | boolean                  | not null default false
     join_date  | timestamp with time zone | not null default now()
    Indexes:
        "user_groups_pkey" PRIMARY KEY, btree (user_id, group_name)
    Foreign-key constraints:
        "user_groups_group_name_fkey" FOREIGN KEY (group_name) REFERENCES groups(name) ON UPDATE CASCADE ON DELETE CASCADE
        "user_groups_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    
    postgres=# \d user_ips   
                        Table "public.user_ips"
      Column   |           Type           |       Modifiers        
    -----------+--------------------------+------------------------
     user_id   | bigint                   | not null
     ip        | inet                     | not null
     last_seen | timestamp with time zone | not null default now()
    Indexes:
        "user_ips_pkey" PRIMARY KEY, btree (user_id, ip)
    Foreign-key constraints:
        "user_ips_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    
    postgres=# \d user_post_votes
                      Table "public.user_post_votes"
        Column    |           Type           |       Modifiers        
    --------------+--------------------------+------------------------
     user_id      | bigint                   | not null
     post_id      | bigint                   | not null
     up           | boolean                  | 
     last_changed | timestamp with time zone | not null default now()
    Indexes:
        "user_post_votes_pkey" PRIMARY KEY, btree (user_id, post_id)
    Foreign-key constraints:
        "user_post_votes_post_id_fkey" FOREIGN KEY (post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE CASCADE
        "user_post_votes_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    
    postgres=# \d user_watched_categories
                  Table "public.user_watched_categories"
        Column    |           Type           |       Modifiers        
    --------------+--------------------------+------------------------
     user_id      | bigint                   | not null
     category_id  | bigint                   | not null
     watched      | boolean                  | 
     last_changed | timestamp with time zone | not null default now()
    Indexes:
        "user_watched_categories_pkey" PRIMARY KEY, btree (user_id, category_id)
    Foreign-key constraints:
        "user_watched_categories_category_id_fkey" FOREIGN KEY (category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE CASCADE
        "user_watched_categories_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    
    postgres=# \d user_watched_topics    
                    Table "public.user_watched_topics"
        Column    |           Type           |       Modifiers        
    --------------+--------------------------+------------------------
     user_id      | bigint                   | not null
     topic_id     | bigint                   | not null
     watched      | boolean                  | 
     last_changed | timestamp with time zone | not null default now()
    Indexes:
        "user_watched_topics_pkey" PRIMARY KEY, btree (user_id, topic_id)
    Foreign-key constraints:
        "user_watched_topics_topic_id_fkey" FOREIGN KEY (topic_id) REFERENCES topics(id) ON UPDATE CASCADE ON DELETE CASCADE
        "user_watched_topics_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    
    postgres=# \d user_watched_users 
                     Table "public.user_watched_users"
         Column     |           Type           |       Modifiers        
    ----------------+--------------------------+------------------------
     user_id        | bigint                   | not null
     target_user_id | bigint                   | not null
     watched        | boolean                  | 
     last_changed   | timestamp with time zone | not null default now()
    Indexes:
        "user_watched_users_pkey" PRIMARY KEY, btree (user_id, target_user_id)
    Foreign-key constraints:
        "user_watched_users_target_user_id_fkey" FOREIGN KEY (target_user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
        "user_watched_users_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    
    postgres=# \d users             
                                           Table "public.users"
          Column      |           Type           |                     Modifiers                      
    ------------------+--------------------------+----------------------------------------------------
     id               | bigint                   | not null default nextval('users_id_seq'::regclass)
     name             | character varying(255)   | not null
     slug             | character varying(255)   | not null
     password         | character varying(60)    | 
     full_name        | text                     | 
     email            | citext                   | 
     email_confirmed  | boolean                  | not null default false
     banned           | boolean                  | not null default false
     ban_expiration   | timestamp with time zone | 
     join_date        | timestamp with time zone | not null default now()
     last_online      | timestamp with time zone | 
     picture          | character varying(2048)  | 
     uploaded_picture | character varying(2048)  | 
     cover_url        | character varying(2048)  | 
     cover_position   | double precision[]       | 
     group_title      | text                     | 
     website          | character varying(2048)  | 
     location         | text                     | 
     signature        | text                     | 
     about_me         | text                     | 
     birthday         | date                     | 
     profile_views    | bigint                   | not null default 0
    Indexes:
        "users_pkey" PRIMARY KEY, btree (id)
        "users_email" UNIQUE, btree (email)
        "users_slug" UNIQUE, btree (slug)
        "users_join_date" btree (join_date)
    Check constraints:
        "users_slug_exists" CHECK (slug::text <> ''::text)
    Foreign-key constraints:
        "users_group_title_fkey" FOREIGN KEY (group_title) REFERENCES groups(name) ON UPDATE CASCADE ON DELETE SET NULL
    Referenced by:
        TABLE "posts" CONSTRAINT "posts_author_id_fkey" FOREIGN KEY (author_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL
        TABLE "posts" CONSTRAINT "posts_editor_id_fkey" FOREIGN KEY (editor_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL
        TABLE "topics" CONSTRAINT "topics_author_id_fkey" FOREIGN KEY (author_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL
        TABLE "user_favorite_posts" CONSTRAINT "user_favorite_posts_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
        TABLE "user_groups" CONSTRAINT "user_groups_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
        TABLE "user_ips" CONSTRAINT "user_ips_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
        TABLE "user_post_votes" CONSTRAINT "user_post_votes_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
        TABLE "user_watched_categories" CONSTRAINT "user_watched_categories_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
        TABLE "user_watched_topics" CONSTRAINT "user_watched_topics_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
        TABLE "user_watched_users" CONSTRAINT "user_watched_users_target_user_id_fkey" FOREIGN KEY (target_user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
        TABLE "user_watched_users" CONSTRAINT "user_watched_users_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    Triggers:
        users_make_slug BEFORE INSERT OR UPDATE OF name, slug ON users FOR EACH ROW EXECUTE PROCEDURE make_slug()
    

  • Discourse touched me in a no-no place

    @ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    I cannot parse what you meant.

    I haven't got the time or extra brainpower while I'm at work to expend serious thought on it. Tonight or this weekend I bet I can make it run better, though. Hit me up if you see me on GW2.



  • postgres=# select case when join_date < timestamp '2014-02-03T11:33:49' then 'Community Server' when join_date < timestamp '2016-03-19T15:00:00' then 'Discourse' else 'NodeBB' end era, extract(year from join_date) as year, count(*) from users left join posts on users.id = posts.author_id left join user_post_votes on users.id = user_post_votes.user_id where posts.id is null and user_post_votes.user_id is null group by era, year order by year asc;
           era        | year | count 
    ------------------+------+-------
     Community Server | 2004 |   169
     Community Server | 2005 |  1935
     Community Server | 2006 |  2671
     Community Server | 2007 |  2218
     Community Server | 2008 |  2305
     Community Server | 2009 |  4565
     Community Server | 2010 | 33301
     Community Server | 2011 | 38782
     Community Server | 2012 |  7794
     Community Server | 2013 | 36990
     Community Server | 2014 |   303
     Discourse        | 2014 |  2284
     Discourse        | 2015 |   467
     Discourse        | 2016 |    59
     NodeBB           | 2016 |    92
    (15 rows)
    

  • Discourse touched me in a no-no place

    @ben_lubar Are you counting "# of unliked posts by year"? "# of orphaned likes because :doing_it_wrong: deleted the post"?



  • @FrostCat no, I'm counting the number of posts a user has made from a year before a given day to that day.


  • Discourse touched me in a no-no place

    @ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    a user

    I didn't see a user ID in the sql, but whatever. I figured my second question would be some kind of clue to the nature of my post, but maybe @blakeyrat's borrowed your shoulder aliens.



  • @ben_lubar If you're writing loops in SQL, you're doing it wrong.



  • @ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    no, I'm counting the number of posts a user has made from a year before a given day to that day.

    That might be a bit tricky. Give me a bit.


  • ♿ (Parody)

    @ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    @FrostCat no, I'm counting the number of posts a user has made from a year before a given day to that day.

    Ugh...did you look at the spoon (and related) badge query from discourse? NB: I haven't looked seriously at your query because you put it all in one line, and like blakey, ain't nobody got time for that.



  • Here:

    select p.PostDate, u.UserName, topTen.PostCount
    from Post p
    	join (
    		select lastPost.PostDate as 'LastPostDate',
    			allPost.UserID as 'UserID',
    			count(allPost.PostID) as 'PostCount'
    		from Post lastPost
    			join Post allPost on allPost.PostDate > dateadd( yy, -1, lastPost.PostDate )
    		group by lastPost.PostDate, allPost.UserID
    		--order by lastPost.PostDate desc, count( allPost.PostID ) desc
    	) topTen on topTen.LastPostDate = p.PostDate
    	join [User] u on u.UserID = topTen.UserID
    order by p.PostDate desc, topTen.PostCount desc
    

    I did indeed need to use a subquery, alas. (If you wanted the whole list for each date, not just the top 10, I'm 99.9% sure you can write this without a subquery. But you'd be sending a shitload more data over the wire.)

    Considerations:

    1. This query will not "fill in blanks" (if there's a day with zero posts, it won't show up in the results. Posts near the start of the timeline won't have the full 10 people listed, etc.) That's tough to fix without a data warehouse-type "date" table.
    2. This query will not round to the nearest day-- my sample data happened to all be at timecode 00:00:00 but in real life data, of course, you want to use datepart or equivalent to ensure the group bys and joins happen only on dates and not times. In T-SQL you can cast to Date instead of DateTime. Not sure how you do it in Postgres.
    3. This query is in T-SQL, no guarantees it translates into Postgres SQL.
    4. The commented-out order by is helpful when trying to debug the inner query, but obviously useless when you're using it as an inner query. That's why it's commented-out. Why make the DB engine sort when it doesn't need to.
    5. "allPost" is a badly-named table alias. It's "all posts in the year preceding lastPost's date". But that's pretty long.

  • Discourse touched me in a no-no place

    @blakeyrat I don't know SQL (or T-SQL) as well as I'd like. What's the difference[1] between "from Post lastPost" and "join [User] u" ? I'd assumed lastPost and u are tablename aliases (don't know the right term), but then why does one have [] and the other not?

    [1] regarding names, not the sql clause from vs join.


  • ♿ (Parody)

    @FrostCat That's (square brackets) how you use a keyword as a name in sql server.



  • @FrostCat said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    What's the difference[1] between "from Post lastPost" and "join [User] u" ? I'd assumed lastPost and u are tablename aliases (don't know the right term), but then why does one have [] and the other not?

    User's a reserved keyword in SQL. The square braces indicate I want to use the User database object, not the reserved keyword.

    My demo database I did in like 45 seconds, and I was too lazy once I realized User was a reserved keyword to switch it to "Users" which is the normal solution to this problem.


  • ♿ (Parody)

    @blakeyrat said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    join Post allPost on allPost.PostDate > dateadd( yy, -1, lastPost.PostDate )

    Seems like this should be something like:

    join Post allPost on allPost.PostDate between dateadd( yy, -1, lastPost.PostDate ) and lastPost.PostDate

    Or you're getting stuff in the future, too.



  • @boomzilla Yup, I think you're right.

    select p.PostDate, u.UserName, topTen.PostCount
    from Post p
    	join (
    		select lastPost.PostDate as 'LastPostDate',
    			allPost.UserID as 'UserID',
    			count(allPost.PostID) as 'PostCount'
    		from Post lastPost
    			join Post allPost on allPost.PostDate between dateadd( yy, -1, lastPost.PostDate ) and lastPost.PostDate
    		group by lastPost.PostDate, allPost.UserID
    		--order by lastPost.PostDate desc, count( allPost.PostID ) desc
    	) topTen on topTen.LastPostDate = p.PostDate
    	join [User] u on u.UserID = topTen.UserID
    order by p.PostDate desc, topTen.PostCount desc
    

    Good catch. SQL's tricky.

    To pontificate a bit, the trick here is to join a table to itself. Something a lot of SQL neophytes don't realize is possible/useful, but once you learn the trick you can solve millions of problems with it. There ain't nothing in the SQL spec saying you can't have 1, 2, or even 50 copies of the Post table in your joins.



  • @blakeyrat

    postgres=# explain select p.created_at::date, u.name, tt.count from posts p join (select lp.created_at::date last_post_date, ap.author_id user_id, count(ap.id) post_count from posts lp join posts ap on ap.created_at::date between lp.created_at::date - interval '1 year' and lp.created_at::date group by last_post_date, user_id) tt on tt.last_post_date = p.created_at::date join users u on u.id = tt.user_id order by p.created_at::date desc, tt.post_count desc;
    ERROR:  column "p.created_at" must appear in the GROUP BY clause or be used in an aggregate function
    LINE 1: explain select p.created_at::date, u.name, tt.count from pos...
                           ^
    

    Postgres apparently doesn't like GROUP BY inside a join.



  • @ben_lubar Ben L, if you post another SQL query on a single line I will hunt you down and kill you, I am not kidding.

    Your keyboard has an "enter" key, please use it.


    @ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    Postgres apparently doesn't like GROUP BY inside a join.

    I can't help you with that. I use a DBMS that doesn't suck shit.

    EDIT: this StackOverflow question contains a working subquery with a group by in it:

    select n1.name, n1.author_id, cast(count_1 as numeric)/total_count
      from (select id, name, author_id, count(1) as count_1
              from names
              group by id, name, author_id) n1
    inner join (select author_id, count(1) as total_count
                  from names
                  group by author_id) n2
      on (n2.author_id = n1.author_id)
    

    So you can do it. I'm not sure what's wrong with the one-line monstrosity you posted.


  • ♿ (Parody)

    FYI, here's the query for the 🥄 :badger: :

    WITH exclusions AS ( /* Which topics to exclude from counters */
    	SELECT user_id, id, topic_id, post_number
    	FROM posts
    	WHERE raw LIKE '%c09fa970-5a9a-11e4-8ed6-0800200c9a66%' AND
    	user_id IN  (
    		SELECT gu.user_id
    		FROM group_users gu
    		WHERE group_id IN(
    			SELECT g.id
    			FROM groups g
    			WHERE g.name IN ('badge_granters_admin')
    		)
    	)
    ),
    LastMonth AS ( /* Count eligible posts from last month */
        SELECT row_number() OVER (ORDER BY count(*) DESC, bp.user_id), u.username, bp.user_id, count(*)
        FROM badge_posts bp
        JOIN users u on u.id=bp.user_id AND
            bp.user_id NOT IN ( /* ignore bots */
                    SELECT gu.user_id
                    FROM group_users gu
                    WHERE group_id IN(
                        SELECT g.id
                        FROM groups g
                        WHERE g.name IN ('bots')
                    )
            )
        WHERE topic_id NOT IN ( /* short topics */
           SELECT topic_id FROM badge_posts GROUP BY topic_id HAVING count(topic_id) <10
        ) AND topic_id NOT IN ( /* Ineligible topics */
           SELECT topic_id
           FROM exclusions
        ) AND bp.created_at > CURRENT_DATE - INTERVAL '1 month'
        GROUP BY u.username, bp.user_id
        HAVING count(*) > 1
        ORDER BY count(*) DESC, bp.user_id
    ),
    TotalUsers AS (
        SELECT max(row_number) from LastMonth
    )
    SELECT username, user_id, row_number, (row_number*100/TotalUsers.max), count, CURRENT_DATE granted_at
    FROM LastMonth, TotalUsers
    WHERE row_number = 1
    


  • @blakeyrat said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    Your keyboard has an "enter" key, please use it.

    explain select p.created_at::date, u.name, tt.count
    from posts p
    join (select lp.created_at::date last_post_date, ap.author_id user_id, count(ap.id) post_count
          from posts lp
          join posts ap
          on ap.created_at::date between lp.created_at::date - interval '1 year'
          and lp.created_at::date group by last_post_date, user_id) tt
    on tt.last_post_date = p.created_at::date
    join users u
    on u.id = tt.user_id
    group by p.created_at::date, u.name, tt.post_count
    order by p.created_at::date desc, tt.post_count desc;
    
                                                                                                 QUERY PLAN                                                                                              
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     GroupAggregate  (cost=3904072989514.71..4088823725666.74 rows=7390029446081 width=71)
       Group Key: ((p.created_at)::date), tt.post_count, u.name
       ->  Sort  (cost=3904072989514.71..3922548063129.91 rows=7390029446081 width=71)
             Sort Key: ((p.created_at)::date) DESC, tt.post_count DESC, u.name
             ->  Merge Join  (cost=54950161682.14..202754857645.02 rows=7390029446081 width=71)
                   Merge Cond: (((p.created_at)::date) = tt.last_post_date)
                   ->  Sort  (cost=163649.36..165902.26 rows=901161 width=8)
                         Sort Key: ((p.created_at)::date)
                         ->  Seq Scan on posts p  (cost=0.00..62194.61 rows=901161 width=8)
                   ->  Materialize  (cost=54949998032.79..54958198597.89 rows=1640113020 width=67)
                         ->  Sort  (cost=54949998032.79..54954098315.34 rows=1640113020 width=67)
                               Sort Key: tt.last_post_date
                               ->  Hash Join  (cost=53432873040.44..54429888270.11 rows=1640113020 width=67)
                                     Hash Cond: (tt.user_id = u.id)
                                     ->  Subquery Scan on tt  (cost=53432866404.24..54372092444.88 rows=1640113020 width=64)
                                           ->  GroupAggregate  (cost=53432866404.24..54355691314.68 rows=1640113020 width=24)
                                                 Group Key: ((lp.created_at)::date), ap.author_id
                                                 ->  Sort  (cost=53432866404.24..53658447278.66 rows=90232349769 width=24)
                                                       Sort Key: ((lp.created_at)::date), ap.author_id
                                                       ->  Nested Loop  (cost=0.00..27761429835.18 rows=90232349769 width=24)
                                                             Join Filter: (((ap.created_at)::date <= (lp.created_at)::date) AND ((ap.created_at)::date >= ((lp.created_at)::date - '1 year'::interval)))
                                                             ->  Seq Scan on posts ap  (cost=0.00..62194.61 rows=901161 width=24)
                                                             ->  Materialize  (cost=0.00..70221.42 rows=901161 width=8)
                                                                   ->  Seq Scan on posts lp  (cost=0.00..62194.61 rows=901161 width=8)
                                     ->  Hash  (cost=4033.98..4033.98 rows=141698 width=19)
                                           ->  Seq Scan on users u  (cost=0.00..4033.98 rows=141698 width=19)
    (26 rows)
    

    rows=7390029446081

    oh god



  • @ben_lubar It should return a maximum of 10 rows per day for each day that had at least one post.

    EDIT: oh I see the problem. My inner table NAMED top 10 at some point lost it's top 10. Oops.



  • @blakeyrat

                                                                                                    QUERY PLAN                                                                                                 
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     GroupAggregate  (cost=54391206248.86..54391207375.31 rows=45058 width=71)
       Group Key: ((p.created_at)::date), tt.post_count, u.name
       ->  Sort  (cost=54391206248.86..54391206361.51 rows=45058 width=71)
             Sort Key: ((p.created_at)::date) DESC, tt.post_count DESC, u.name
             ->  Hash Join  (cost=54391133650.85..54391200914.49 rows=45058 width=71)
                   Hash Cond: ((p.created_at)::date = tt.last_post_date)
                   ->  Seq Scan on posts p  (cost=0.00..62194.61 rows=901161 width=8)
                   ->  Hash  (cost=54391133650.73..54391133650.73 rows=10 width=67)
                         ->  Nested Loop  (cost=54391133567.67..54391133650.73 rows=10 width=67)
                               ->  Subquery Scan on tt  (cost=54391133567.38..54391133567.50 rows=10 width=64)
                                     ->  Limit  (cost=54391133567.38..54391133567.40 rows=10 width=24)
                                           ->  Sort  (cost=54391133567.38..54395233849.93 rows=1640113020 width=24)
                                                 Sort Key: ((lp.created_at)::date) DESC, (count(ap.id)) DESC
                                                 ->  GroupAggregate  (cost=53432866404.24..54355691314.68 rows=1640113020 width=24)
                                                       Group Key: ((lp.created_at)::date), ap.author_id
                                                       ->  Sort  (cost=53432866404.24..53658447278.66 rows=90232349769 width=24)
                                                             Sort Key: ((lp.created_at)::date) DESC, ap.author_id
                                                             ->  Nested Loop  (cost=0.00..27761429835.18 rows=90232349769 width=24)
                                                                   Join Filter: (((ap.created_at)::date <= (lp.created_at)::date) AND ((ap.created_at)::date >= ((lp.created_at)::date - '1 year'::interval)))
                                                                   ->  Seq Scan on posts ap  (cost=0.00..62194.61 rows=901161 width=24)
                                                                   ->  Materialize  (cost=0.00..70221.42 rows=901161 width=8)
                                                                         ->  Seq Scan on posts lp  (cost=0.00..62194.61 rows=901161 width=8)
                               ->  Index Scan using users_pkey on users u  (cost=0.29..8.31 rows=1 width=19)
                                     Index Cond: (id = tt.user_id)
    (24 rows)
    

    nested nested nested loops are scary


  • ♿ (Parody)

    @ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    oh god

    Your idea to pull up 10 users for each day for the last 12 years or whatever based on what we all did for the year previous to that was never going to be a reasonable query. Why do you want this?

    Maybe break it down by month? Make a table and put each user's post count by month in there. Aggregating that would be simple. This is probably an even better idea if you really really really want to go by day.

    But either way you'll probably want / need to break the insert up or the commit / redo stuff will kill you. Literally. Dead ben_lubar.



  • @boomzilla said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    never going to be a reasonable query

    THIS DATA IS TOO BIG


  • ♿ (Parody)

    @ben_lubar Yes. You can't eat the elephant all at once.



  • @ben_lubar Hey you asked us for help. If you think you can write it better, knock yourself out.


  • ♿ (Parody)

    @ben_lubar Set up a script to do a day at a time, or something, and count all the posts for that day per user. So each day is done inside its own transaction. This is going to suck not matter what, but when you get done, you'll be able to do a lot of interesting things with that data very quickly.

    Also...don't create any indexes or foreign keys on your table while you're filling it up. Do that afterwards.


  • Discourse touched me in a no-no place

    @blakeyrat said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    User's a reserved keyword in SQL. The square braces indicate I want to use the User database object, not the reserved keyword

    Ah. That's kind of like how in Access, you could have a field with a space in the name, as long as you referred to it with brackets when you used it.


  • Discourse touched me in a no-no place

    @boomzilla said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    count all the posts for that day per user.

    Quick question, how many posts are there, total?

    define temp-table tt
        field user as int
        field date as date
        field count as int
        .
    
    for each posts no-lock:
        find first tt where tt.user = posts.user_id
          and tt.date = posts.date
          no-error.
    
          if not available tt then create tt.
          tt.posts = tt.posts + 1.
    end.
    

    at the end of this, the temp-table is full of records that show how many posts every user had on every day in which that user had posts. That's likely to be a much nicer data set with which to work.

    One pass. No sorting. If you need a sort of some kind, put an appropriate index on the temp table. The query takes about as long to run as select user_id from posts.

    Want to know the top ten posters for each day? select top 10 user from posts where date=whatever order by count. Only with the right syntax.



  • @FrostCat said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    @blakeyrat said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    User's a reserved keyword in SQL. The square braces indicate I want to use the User database object, not the reserved keyword

    Ah. That's kind of like how in Access, you could have a field with a space in the name, as long as you referred to it with brackets when you used it.

    T-SQL does allow spaces in the name if you delimit it with either square brackets or double quotes.

    PL/SQL allows aliases to have spaces inside double quotes. Square brackets apparently indicate something involving expressions.


  • Discourse touched me in a no-no place

    @djls45 said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    T-SQL does allow spaces

    I was just trying to come up with an analogy. In Blakey's usage, I guess the brackets escape the normal meaning of the word.



  • what db is this?




  • Discourse touched me in a no-no place

    @djls45 said in @ben_lubar tries to parse <del>ServerCooties</del> data:

    PL/SQL allows aliases to have spaces inside double quotes.

    That's the syntax that the SQL standard specifies. Apparently MSSQL also supports it (as well as the bracketed form).



  • My progress on the !!SCIENCE!! script is here: https://gist.github.com/BenLubar/98fe068162c5ae9b11659b7029da3a76



  • 2016/07/09 19:46:18 float64 cannot be converted to int64 without losing precision: NaN
    main.Int64
            /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1491
    main.HandleTopic.func1
            /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1188
    main.ForSortedSet
            /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1561
    main.HandleTopic
            /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1195
    main.ForSortedSet
            /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1561
    main.main
            /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:110
    runtime.main
            /home/ben/.golang/src/runtime/proc.go:183
    runtime.goexit
            /home/ben/.golang/src/runtime/asm_amd64.s:2086
    main.HandleTopic.func1
            /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1190: position
    main.ForSortedSet
            /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1561: ForSortedSet("tid:18672:bookmarks", ("21", NaN))
    main.HandleTopic
            /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1196: bookmarks
    main.ForSortedSet
            /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1561: ForSortedSet("topics:tid", ("18672", 9.805716e+11))
    exit status 1
    


  • 2016/07/10 11:04:40 topic:20281:"Re: the deleted post thread"
    2016/07/10 11:04:40 topic:20282:"Re: the deleted post thread"
    2016/07/10 11:04:40 topic:20283:"Fork + github + npm"
    2016/07/10 11:04:40 topic:20284:"Bot User Agents"
    2016/07/10 11:04:40 topic:20285:"Why is polygamy illegal?"
    2016/07/10 11:04:40 topic:20286:"GOOOOOGGGGGLLLLLEEEE"
    2016/07/10 11:04:40 pq: invalid byte sequence for encoding "UTF8": 0x00
    main.HandleTopic
            /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1177: insert topic into database
    main.ForSortedSet
            /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1565: ForSortedSet("topics:tid", ("20287", 1.466585097322e+12))
    exit status 1
    

    Goddamnit @aliceif