@ben_lubar tries to parse <del>ServerCooties</del> data

ben_lubar

Here's the query I want to run:

with dates as (select ((select min(created_at)::date from posts) + day * interval '1 day')::date date from generate_series(0, (select max(created_at)::date from posts) - (select min(created_at)::date from posts)) day order by date asc), counts as (select date, author_id, count(*) from dates, posts where created_at between date - interval '1 year' and date group by date, author_id order by date asc, count desc), ranked as (select *, rank() over (partition by date order by count desc) from counts) select date, rank, name, count from ranked left join users on id = author_id where rank <= 10 order by date asc, rank asc;

Here's what I get when I limit it to the last 10 days:

postgres=# explain analyze verbose with dates as (select ((select min(created_at)::date from posts) + day * interval '1 day')::date date from generate_series(0, (select max(created_at)::date from posts) - (select min(created_at)::date from posts)) day order by date desc limit 10), counts as (select date, author_id, count(*) from dates, posts where created_at between date - interval '1 year' and date group by date, author_id order by date asc, count desc), ranked as (select *, rank() over (partition by date order by count desc) from counts) select date, rank, name, count from ranked left join users on id = author_id where rank <= 10 order by date asc, rank asc;
                                                                                                                                                                                                QUERY PLAN                                                                                                                                                                                                 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=208088.75..208103.92 rows=6067 width=31) (actual time=7513.875..7513.921 rows=100 loops=1)
   Output: ranked.date, ranked.rank, users.name, ranked.count
   Sort Key: ranked.date, ranked.rank
   Sort Method: quicksort  Memory: 32kB
   CTE dates
     ->  Limit  (cost=43.19..43.21 rows=10 width=4) (actual time=12.654..12.678 rows=10 loops=1)
           Output: ((($1 + ((day.day)::double precision * '1 day'::interval)))::date)
           InitPlan 2 (returns $1)
             ->  Result  (cost=0.51..0.53 rows=1 width=0) (actual time=0.044..0.044 rows=1 loops=1)
                   Output: ($0)::date
                   InitPlan 1 (returns $0)
                     ->  Limit  (cost=0.42..0.51 rows=1 width=8) (actual time=0.040..0.041 rows=1 loops=1)
                           Output: posts.created_at
                           ->  Index Only Scan using posts_created_at on public.posts  (cost=0.42..78856.24 rows=901161 width=8) (actual time=0.038..0.038 rows=1 loops=1)
                                 Output: posts.created_at
                                 Index Cond: (posts.created_at IS NOT NULL)
                                 Heap Fetches: 1
           InitPlan 4 (returns $3)
             ->  Result  (cost=0.51..0.53 rows=1 width=0) (actual time=0.095..0.096 rows=1 loops=1)
                   Output: ($2)::date
                   InitPlan 3 (returns $2)
                     ->  Limit  (cost=0.42..0.51 rows=1 width=8) (actual time=0.087..0.088 rows=1 loops=1)
                           Output: posts_1.created_at
                           ->  Index Only Scan Backward using posts_created_at on public.posts posts_1  (cost=0.42..78856.24 rows=901161 width=8) (actual time=0.085..0.085 rows=1 loops=1)
                                 Output: posts_1.created_at
                                 Index Cond: (posts_1.created_at IS NOT NULL)
                                 Heap Fetches: 1
           InitPlan 6 (returns $5)
             ->  Result  (cost=0.51..0.53 rows=1 width=0) (actual time=0.041..0.042 rows=1 loops=1)
                   Output: ($4)::date
                   InitPlan 5 (returns $4)
                     ->  Limit  (cost=0.42..0.51 rows=1 width=8) (actual time=0.037..0.037 rows=1 loops=1)
                           Output: posts_2.created_at
                           ->  Index Only Scan using posts_created_at on public.posts posts_2  (cost=0.42..78856.24 rows=901161 width=8) (actual time=0.035..0.035 rows=1 loops=1)
                                 Output: posts_2.created_at
                                 Index Cond: (posts_2.created_at IS NOT NULL)
                                 Heap Fetches: 1
           ->  Sort  (cost=41.61..44.11 rows=1000 width=4) (actual time=12.652..12.663 rows=10 loops=1)
                 Output: ((($1 + ((day.day)::double precision * '1 day'::interval)))::date)
                 Sort Key: ((($1 + ((day.day)::double precision * '1 day'::interval)))::date) DESC
                 Sort Method: top-N heapsort  Memory: 25kB
                 ->  Function Scan on pg_catalog.generate_series day  (cost=0.01..20.01 rows=1000 width=4) (actual time=1.121..8.068 rows=5634 loops=1)
                       Output: (($1 + ((day.day)::double precision * '1 day'::interval)))::date
                       Function Call: generate_series(0, ($3 - $5))
   CTE counts
     ->  Sort  (cost=198063.98..198109.48 rows=18200 width=12) (actual time=7286.828..7291.778 rows=9868 loops=1)
           Output: dates.date, posts_3.author_id, (count(*))
           Sort Key: dates.date, (count(*)) DESC
           Sort Method: quicksort  Memory: 1155kB
           ->  HashAggregate  (cost=196594.18..196776.18 rows=18200 width=12) (actual time=7272.899..7279.037 rows=9868 loops=1)
                 Output: dates.date, posts_3.author_id, count(*)
                 Group Key: dates.date, posts_3.author_id
                 ->  Nested Loop  (cost=0.43..189084.51 rows=1001290 width=12) (actual time=12.726..5342.047 rows=2792422 loops=1)
                       Output: dates.date, posts_3.author_id
                       ->  CTE Scan on dates  (cost=0.00..0.20 rows=10 width=4) (actual time=12.655..12.710 rows=10 loops=1)
                             Output: dates.date
                       ->  Index Scan using posts_created_at on public.posts posts_3  (cost=0.43..17907.14 rows=100129 width=16) (actual time=0.072..256.479 rows=279242 loops=10)
                             Output: posts_3.id, posts_3.topic_id, posts_3.author_id, posts_3.author_name, posts_3.parent_post_id, posts_3.content, posts_3.created_at, posts_3.ip, posts_3.editor_id, posts_3.edited_at, posts_3.deleted
                             Index Cond: ((posts_3.created_at >= (dates.date - '1 year'::interval)) AND (posts_3.created_at <= dates.date))
   CTE ranked
     ->  WindowAgg  (cost=1651.80..2015.80 rows=18200 width=20) (actual time=7308.258..7328.054 rows=9868 loops=1)
           Output: counts.date, counts.author_id, counts.count, rank() OVER (?)
           ->  Sort  (cost=1651.80..1697.30 rows=18200 width=20) (actual time=7308.250..7313.102 rows=9868 loops=1)
                 Output: counts.date, counts.count, counts.author_id
                 Sort Key: counts.date, counts.count DESC
                 Sort Method: quicksort  Memory: 1155kB
                 ->  CTE Scan on counts  (cost=0.00..364.00 rows=18200 width=20) (actual time=7286.832..7302.301 rows=9868 loops=1)
                       Output: counts.date, counts.count, counts.author_id
   ->  Merge Right Join  (cost=791.01..7539.04 rows=6067 width=31) (actual time=7335.604..7513.793 rows=100 loops=1)
         Output: ranked.date, ranked.rank, users.name, ranked.count
         Merge Cond: (users.id = ranked.author_id)
         ->  Index Scan using users_pkey on public.users  (cost=0.29..6303.08 rows=141698 width=19) (actual time=0.021..107.009 rows=140880 loops=1)
               Output: users.id, users.name, users.slug, users.password, users.full_name, users.email, users.email_confirmed, users.banned, users.ban_expiration, users.join_date, users.last_online, users.picture, users.uploaded_picture, users.cover_url, users.cover_position, users.group_title, users.website, users.location, users.signature, users.about_me, users.birthday, users.profile_views
         ->  Sort  (cost=790.71..805.88 rows=6067 width=28) (actual time=7335.575..7335.622 rows=100 loops=1)
               Output: ranked.date, ranked.rank, ranked.count, ranked.author_id
               Sort Key: ranked.author_id
               Sort Method: quicksort  Memory: 32kB
               ->  CTE Scan on ranked  (cost=0.00..409.50 rows=6067 width=28) (actual time=7308.268..7335.507 rows=100 loops=1)
                     Output: ranked.date, ranked.rank, ranked.count, ranked.author_id
                     Filter: (ranked.rank <= 10)
                     Rows Removed by Filter: 9768
 Planning time: 0.690 ms
 Execution time: 7514.711 ms
(83 rows)

With these results (it's not an up-to-date backup):

    date    | rank |    name     | count 
------------+------+-------------+-------
 2016-06-21 |    1 | boomzilla   | 13416
 2016-06-21 |    2 | blakeyrat   | 12541
 2016-06-21 |    3 | Polygeekery |  9921
 2016-06-21 |    4 | FrostCat    |  9382
 2016-06-21 |    5 | accalia     |  9209
 2016-06-21 |    6 | loopback0   |  7961
 2016-06-21 |    7 | Fox         |  7804
 2016-06-21 |    8 | ben_lubar   |  7435
 2016-06-21 |    9 | dkf         |  7334
 2016-06-21 |   10 | abarker     |  6576
 2016-06-22 |    1 | boomzilla   | 13353
 2016-06-22 |    2 | blakeyrat   | 12521
 2016-06-22 |    3 | Polygeekery |  9919
 2016-06-22 |    4 | FrostCat    |  9377
 2016-06-22 |    5 | accalia     |  9198
 2016-06-22 |    6 | loopback0   |  7961
 2016-06-22 |    7 | Fox         |  7804
 2016-06-22 |    8 | ben_lubar   |  7420
 2016-06-22 |    9 | dkf         |  7333
 2016-06-22 |   10 | abarker     |  6569
 2016-06-23 |    1 | boomzilla   | 13235
 2016-06-23 |    2 | blakeyrat   | 12472
 2016-06-23 |    3 | Polygeekery |  9893
 2016-06-23 |    4 | FrostCat    |  9371
 2016-06-23 |    5 | accalia     |  9115
 2016-06-23 |    6 | loopback0   |  7961
 2016-06-23 |    7 | Fox         |  7804
 2016-06-23 |    8 | ben_lubar   |  7403
 2016-06-23 |    9 | dkf         |  7320
 2016-06-23 |   10 | abarker     |  6520
 2016-06-24 |    1 | boomzilla   | 13184
 2016-06-24 |    2 | blakeyrat   | 12437
 2016-06-24 |    3 | Polygeekery |  9879
 2016-06-24 |    4 | FrostCat    |  9326
 2016-06-24 |    5 | accalia     |  9076
 2016-06-24 |    6 | loopback0   |  7961
 2016-06-24 |    7 | Fox         |  7804
 2016-06-24 |    8 | ben_lubar   |  7394
 2016-06-24 |    9 | dkf         |  7309
 2016-06-24 |   10 | abarker     |  6510
 2016-06-25 |    1 | boomzilla   | 13131
 2016-06-25 |    2 | blakeyrat   | 12405
 2016-06-25 |    3 | Polygeekery |  9868
 2016-06-25 |    4 | FrostCat    |  9296
 2016-06-25 |    5 | accalia     |  9009
 2016-06-25 |    6 | loopback0   |  7961
 2016-06-25 |    7 | Fox         |  7804
 2016-06-25 |    8 | ben_lubar   |  7370
 2016-06-25 |    9 | dkf         |  7299
 2016-06-25 |   10 | abarker     |  6468
 2016-06-26 |    1 | boomzilla   | 13046
 2016-06-26 |    2 | blakeyrat   | 12363
 2016-06-26 |    3 | Polygeekery |  9863
 2016-06-26 |    4 | FrostCat    |  9256
 2016-06-26 |    5 | accalia     |  8932
 2016-06-26 |    6 | loopback0   |  7961
 2016-06-26 |    7 | Fox         |  7804
 2016-06-26 |    8 | ben_lubar   |  7354
 2016-06-26 |    9 | dkf         |  7289
 2016-06-26 |   10 | Tsaukpaetra |  6465
 2016-06-27 |    1 | boomzilla   | 12960
 2016-06-27 |    2 | blakeyrat   | 12337
 2016-06-27 |    3 | Polygeekery |  9861
 2016-06-27 |    4 | FrostCat    |  9225
 2016-06-27 |    5 | accalia     |  8856
 2016-06-27 |    6 | loopback0   |  7961
 2016-06-27 |    7 | Fox         |  7804
 2016-06-27 |    8 | ben_lubar   |  7329
 2016-06-27 |    9 | dkf         |  7271
 2016-06-27 |   10 | Tsaukpaetra |  6464
 2016-06-28 |    1 | boomzilla   | 12951
 2016-06-28 |    2 | blakeyrat   | 12316
 2016-06-28 |    3 | Polygeekery |  9859
 2016-06-28 |    4 | FrostCat    |  9219
 2016-06-28 |    5 | accalia     |  8847
 2016-06-28 |    6 | loopback0   |  7961
 2016-06-28 |    7 | Fox         |  7804
 2016-06-28 |    8 | ben_lubar   |  7322
 2016-06-28 |    9 | dkf         |  7266
 2016-06-28 |   10 | Tsaukpaetra |  6464
 2016-06-29 |    1 | boomzilla   | 12935
 2016-06-29 |    2 | blakeyrat   | 12305
 2016-06-29 |    3 | Polygeekery |  9853
 2016-06-29 |    4 | FrostCat    |  9205
 2016-06-29 |    5 | accalia     |  8822
 2016-06-29 |    6 | loopback0   |  7961
 2016-06-29 |    7 | Fox         |  7804
 2016-06-29 |    8 | ben_lubar   |  7293
 2016-06-29 |    9 | dkf         |  7265
 2016-06-29 |   10 | Tsaukpaetra |  6464
 2016-06-30 |    1 | boomzilla   | 12908
 2016-06-30 |    2 | blakeyrat   | 12265
 2016-06-30 |    3 | Polygeekery |  9834
 2016-06-30 |    4 | FrostCat    |  9195
 2016-06-30 |    5 | accalia     |  8742
 2016-06-30 |    6 | loopback0   |  7961
 2016-06-30 |    7 | Fox         |  7804
 2016-06-30 |    8 | ben_lubar   |  7272
 2016-06-30 |    9 | dkf         |  7258
 2016-06-30 |   10 | Tsaukpaetra |  6464
(100 rows)

How can I make the query run faster? There are about 6000 days between the first and last posts in the database, so taking over an hour to get this data isn't going to be fun for whoever is waiting for the chart to be generated.

error

@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

How can I make the query run faster? There are about 6000 days between the first and last posts in the database, so taking over an hour to get this data isn't going to be fun for whoever is waiting for the chart to be generated.

Try materializing some of those subqueries. Temp table or persisted view.

Assuming you're just working with a snapshot exported from Mongo, you can denormalize that shit without too much worry about maintaining consistency.

blakeyrat

@ben_lubar Well here's a few suggestions:

Format your query so it's on more than one line, because WTF man
Get rid of the roughly 57436 subqueries you have and just do a single join clause

You appear to have a query within a query within a query if I'm reading that right. Jesus man.

ben_lubar

begin;
create index posts_created_at on posts (created_at);
create index posts_author_id on posts (author_id);
create temporary table temp_dates as select ((select min(created_at)::date from posts) + day * interval '1 day')::date date from generate_series(0, (select max(created_at)::date from posts) - (select min(created_at)::date from posts)) day;
create temporary table temp_counts_pre as select date, author_id, count(*) from temp_dates, posts where created_at::date = date group by date, author_id;
create index temp_counts_pre_date_author_id on temp_counts_pre (date, author_id);
create temporary table temp_counts as select a.date, author_id, sum(count) count from temp_dates a, temp_counts_pre b where b.date between a.date - interval '1 year' and a.date group by a.date, author_id;
create index temp_counts_date_count on temp_counts (date, count desc);
explain analyze verbose with counts as (select *, rank() over (partition by date order by count desc) from temp_counts) select date, rank, name, count from counts left join users on id = author_id where rank <= 10 order by date asc, rank asc;
with counts as (select *, rank() over (partition by date order by count desc) from temp_counts) select date, rank, name, count from counts left join users on id = author_id where rank <= 10 order by date asc, rank asc;
rollback;

Output:

postgres=# begin;
BEGIN
Time: 0.137 ms
postgres=# create index posts_created_at on posts (created_at);
CREATE INDEX
Time: 3773.139 ms
postgres=# create index posts_author_id on posts (author_id);
CREATE INDEX
Time: 1269.059 ms
postgres=# create temporary table temp_dates as select ((select min(created_at)::date from posts) + day * interval '1 day')::date date from generate_series(0, (select max(created_at)::date from posts) - (select min(created_at)::date from posts)) day;
SELECT 5634
Time: 24.168 ms
postgres=# create temporary table temp_counts_pre as select date, author_id, count(*) from temp_dates, posts where created_at::date = date group by date, author_id;
SELECT 169666
Time: 2678.269 ms
postgres=# create index temp_counts_pre_date_author_id on temp_counts_pre (date, author_id);
CREATE INDEX
Time: 73.029 ms
postgres=# create temporary table temp_counts as select a.date, author_id, sum(count) count from temp_dates a, temp_counts_pre b where b.date between a.date - interval '1 year' and a.date group by a.date, author_id;
SELECT 4509688
Time: 129317.276 ms
postgres=# create index temp_counts_date_count on temp_counts (date, count desc);
CREATE INDEX
Time: 9796.385 ms
postgres=# explain analyze verbose with counts as (select *, rank() over (partition by date order by count desc) from temp_counts) select date, rank, name, count from counts left join users on id = author_id where rank <= 10 order by date asc, rank asc;
                                                                                   QUERY PLAN                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=746886.42..750644.49 rows=1503229 width=55) (actual time=15610.866..15636.024 rows=44877 loops=1)
   Output: counts.date, counts.rank, users.name, counts.count
   Sort Key: counts.date, counts.rank
   Sort Method: external sort  Disk: 1752kB
   CTE counts
     ->  WindowAgg  (cost=0.43..330933.29 rows=4509688 width=44) (actual time=0.054..11095.926 rows=4509688 loops=1)
           Output: temp_counts.date, temp_counts.author_id, temp_counts.count, rank() OVER (?)
           ->  Index Scan using temp_counts_date_count on pg_temp_2.temp_counts  (cost=0.43..252013.75 rows=4509688 width=44) (actual time=0.044..3563.355 rows=4509688 loops=1)
                 Output: temp_counts.date, temp_counts.count, temp_counts.author_id
   ->  Hash Left Join  (cost=6636.20..158964.58 rows=1503229 width=55) (actual time=389.273..15535.957 rows=44877 loops=1)
         Output: counts.date, counts.rank, users.name, counts.count
         Hash Cond: (counts.author_id = users.id)
         ->  CTE Scan on counts  (cost=0.00..101467.98 rows=1503229 width=52) (actual time=0.061..15046.578 rows=44877 loops=1)
               Output: counts.date, counts.rank, counts.count, counts.author_id
               Filter: (counts.rank <= 10)
               Rows Removed by Filter: 4464811
         ->  Hash  (cost=4033.98..4033.98 rows=141698 width=19) (actual time=389.028..389.028 rows=141427 loops=1)
               Output: users.name, users.id
               Buckets: 65536  Batches: 4  Memory Usage: 2422kB
               ->  Seq Scan on public.users  (cost=0.00..4033.98 rows=141698 width=19) (actual time=28.603..288.194 rows=141427 loops=1)
                     Output: users.name, users.id
 Planning time: 34.240 ms
 Execution time: 15717.063 ms
(23 rows)

Time: 16434.860 ms
postgres=# with counts as (select *, rank() over (partition by date order by count desc) from temp_counts) select date, rank, name, count from counts left join users on id = author_id where rank <= 10 order by date asc, rank asc;
    date    | rank |         name         | count 
------------+------+----------------------+-------
[snip]
(44877 rows)

Time: 6726.870 ms
postgres=# rollback;
ROLLBACK
Time: 70.226 ms

ben_lubar

Thanks for ruining my statistics, @abarker
https://what.thedailywtf.com/post/797354

ben_lubar

@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

postgres=# create temporary table temp_counts as select a.date, author_id, sum(count) count from temp_dates a, temp_counts_pre b where b.date between a.date - interval '1 year' and a.date group by a.date, author_id;
SELECT 4509688
Time: 129317.276 ms

postgres=# explain analyze verbose create temporary table temp_counts as select a.date, author_id, sum(count) count from temp_dates a, temp_counts_pre b where b.date between a.date - interval '1 year' and a.date group by a.date, author_id;
                                                                                         QUERY PLAN                                                                                         
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=27137620.94..28339921.77 rows=40000 width=20) (actual time=192270.885..276757.795 rows=4509688 loops=1)
   Output: a.date, b.author_id, sum(b.count)
   Group Key: a.date, b.author_id
   ->  Sort  (cost=27137620.94..27438071.14 rows=120180083 width=20) (actual time=192270.866..238706.102 rows=56896009 loops=1)
         Output: a.date, b.author_id, b.count
         Sort Key: a.date, b.author_id
         Sort Method: external sort  Disk: 1881848kB
         ->  Nested Loop  (cost=0.42..3615179.19 rows=120180083 width=20) (actual time=0.055..91374.964 rows=56896009 loops=1)
               Output: a.date, b.author_id, b.count
               ->  Seq Scan on pg_temp_2.temp_dates a  (cost=0.00..88.75 rows=6375 width=4) (actual time=0.018..7.720 rows=5634 loops=1)
                     Output: a.date
               ->  Index Scan using temp_counts_pre_date_author_id on pg_temp_2.temp_counts_pre b  (cost=0.42..378.55 rows=18852 width=20) (actual time=0.015..6.305 rows=10099 loops=5634)
                     Output: b.date, b.author_id, b.count
                     Index Cond: ((b.date >= (a.date - '1 year'::interval)) AND (b.date <= a.date))
 Planning time: 0.280 ms
 Execution time: 282411.389 ms
(16 rows)

Time: 282419.638 ms

I wonder if I can rewrite that query to insert each day's data separately so it doesn't have to sort a terabyte of data at once...

abarker

@ben_lubar

ben_lubar

create temporary table temp_counts (date date not null, author_id bigint, count bigint not null);
do $$
declare
a record;
begin
for a in select date from temp_dates order by date asc loop
insert into temp_counts select a.date, author_id, sum(count) count from temp_counts_pre b where b.date between a.date - interval '1 year' and a.date group by a.date, author_id;
end loop;
end;
$$ language plpgsql;

Unfortunately, each of the INSERT queries takes about 50ms, which means instead of taking 2 minutes, now it takes 5 minutes.

FrostCat

@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

I wonder if I can rewrite that query to insert each day's data separately so it doesn't have to sort a terabyte of data at once...

It might be faster to just do one table scan, and figure out which records you want outside the SQL. Yes, even if it's the whole table. It would certainly be saner.

FrostCat

@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

Unfortunately, each of the INSERT queries takes about 50ms

I haven't got the spare brainpower to decipher what you're trying to accomplish at the moment. Can you explain it in English instead of SQL?

ben_lubar

@FrostCat said in @ben_lubar tries to parse <del>ServerCooties</del> data:

Can you explain it in English

TL;DR version: Top 10 users for "posts in the last year" for each day.

FrostCat

@ben_lubar Ok. Let me cogitate for a bit. I still think your best bet is going to be--based on what's been said so far--a single table scan of posts into a temp table with some of the indexes you want. And bear in mind SQL per se is not my forte so I may not be able to tell you exactly how to write the SQL.

How would I find the db schema, such as it is? Don't actually give it to me now, I won't be able to process it.

ben_lubar

@FrostCat said in @ben_lubar tries to parse <del>ServerCooties</del> data:

How would I find the db schema, such as it is?

When combined with the next sentence you wrote, I cannot parse what you meant.

In any case, when you want it, it'll be here:

postgres=# \dt           
                  List of relations
 Schema |          Name           | Type  |  Owner   
--------+-------------------------+-------+----------
 public | categories              | table | postgres
 public | groups                  | table | postgres
 public | posts                   | table | postgres
 public | topics                  | table | postgres
 public | user_favorite_posts     | table | postgres
 public | user_groups             | table | postgres
 public | user_ips                | table | postgres
 public | user_post_votes         | table | postgres
 public | user_watched_categories | table | postgres
 public | user_watched_topics     | table | postgres
 public | user_watched_users      | table | postgres
 public | users                   | table | postgres
(12 rows)

postgres=# \d categories 
                                       Table "public.categories"
       Column       |          Type           |                        Modifiers                        
--------------------+-------------------------+---------------------------------------------------------
 id                 | bigint                  | not null default nextval('categories_id_seq'::regclass)
 name               | text                    | not null
 slug               | text                    | not null
 parent_category_id | bigint                  | 
 position           | integer                 | 
 description        | text                    | 
 link               | character varying(2048) | 
 icon               | character varying(255)  | 
 fg_color           | character varying(255)  | 
 bg_color           | character varying(255)  | 
 recent_replies     | integer                 | 
 disabled           | boolean                 | not null default false
Indexes:
    "categories_pkey" PRIMARY KEY, btree (id)
    "categories_slug" UNIQUE, btree (slug)
Check constraints:
    "categories_slug_exists" CHECK (slug <> ''::text)
Foreign-key constraints:
    "categories_parent_category_id_fkey" FOREIGN KEY (parent_category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE SET NULL
Referenced by:
    TABLE "categories" CONSTRAINT "categories_parent_category_id_fkey" FOREIGN KEY (parent_category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE SET NULL
    TABLE "topics" CONSTRAINT "topics_category_id_fkey" FOREIGN KEY (category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE RESTRICT
    TABLE "user_watched_categories" CONSTRAINT "user_watched_categories_category_id_fkey" FOREIGN KEY (category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE CASCADE
Triggers:
    categories_make_slug BEFORE INSERT OR UPDATE OF name, slug ON categories FOR EACH ROW EXECUTE PROCEDURE make_slug()

postgres=# \d groups
                           Table "public.groups"
        Column         |           Type           |       Modifiers        
-----------------------+--------------------------+------------------------
 name                  | text                     | not null
 slug                  | text                     | not null
 create_time           | timestamp with time zone | not null default now()
 user_title            | text                     | 
 description           | text                     | not null
 icon                  | character varying(255)   | 
 label_color           | character varying(255)   | 
 cover_url             | character varying(2048)  | 
 cover_thumb_url       | character varying(2048)  | 
 cover_position        | double precision[]       | 
 hidden                | boolean                  | not null default false
 system                | boolean                  | not null default false
 private               | boolean                  | not null default true
 disable_join_requests | boolean                  | not null default false
 user_title_enabled    | boolean                  | not null default false
Indexes:
    "groups_pkey" PRIMARY KEY, btree (name)
    "groups_slug" UNIQUE, btree (slug)
    "groups_create_time" btree (create_time)
Check constraints:
    "groups_slug_exists" CHECK (slug <> ''::text)
Referenced by:
    TABLE "user_groups" CONSTRAINT "user_groups_group_name_fkey" FOREIGN KEY (group_name) REFERENCES groups(name) ON UPDATE CASCADE ON DELETE CASCADE
    TABLE "users" CONSTRAINT "users_group_title_fkey" FOREIGN KEY (group_title) REFERENCES groups(name) ON UPDATE CASCADE ON DELETE SET NULL
Triggers:
    groups_make_slug BEFORE INSERT OR UPDATE OF name, slug ON groups FOR EACH ROW EXECUTE PROCEDURE make_slug()

postgres=# \d posts
                                      Table "public.posts"
     Column     |           Type           |                     Modifiers                      
----------------+--------------------------+----------------------------------------------------
 id             | bigint                   | not null default nextval('posts_id_seq'::regclass)
 topic_id       | bigint                   | not null
 author_id      | bigint                   | 
 author_name    | character varying(255)   | 
 parent_post_id | bigint                   | 
 content        | text                     | not null
 created_at     | timestamp with time zone | not null default now()
 ip             | inet[]                   | 
 editor_id      | bigint                   | 
 edited_at      | timestamp with time zone | 
 deleted        | boolean                  | not null default false
Indexes:
    "posts_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "posts_author_id_fkey" FOREIGN KEY (author_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL
    "posts_editor_id_fkey" FOREIGN KEY (editor_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL
    "posts_parent_post_id_fkey" FOREIGN KEY (parent_post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE SET NULL
    "posts_topic_id_fkey" FOREIGN KEY (topic_id) REFERENCES topics(id) ON UPDATE CASCADE ON DELETE CASCADE
Referenced by:
    TABLE "posts" CONSTRAINT "posts_parent_post_id_fkey" FOREIGN KEY (parent_post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE SET NULL
    TABLE "topics" CONSTRAINT "topics_main_post_id_fkey" FOREIGN KEY (main_post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE SET NULL
    TABLE "user_favorite_posts" CONSTRAINT "user_favorite_posts_post_id_fkey" FOREIGN KEY (post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE CASCADE
    TABLE "user_post_votes" CONSTRAINT "user_post_votes_post_id_fkey" FOREIGN KEY (post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE CASCADE

postgres=# \d topics
                                     Table "public.topics"
    Column    |           Type           |                      Modifiers                      
--------------+--------------------------+-----------------------------------------------------
 id           | bigint                   | not null default nextval('topics_id_seq'::regclass)
 name         | text                     | not null
 slug         | text                     | not null
 category_id  | bigint                   | not null
 author_id    | bigint                   | 
 main_post_id | bigint                   | 
 created_at   | timestamp with time zone | not null default now()
 view_count   | bigint                   | not null default 0
 locked       | boolean                  | not null default false
 deleted      | boolean                  | not null default false
 pinned       | boolean                  | not null default false
Indexes:
    "topics_pkey" PRIMARY KEY, btree (id)
    "topics_slug" btree (slug)
Foreign-key constraints:
    "topics_author_id_fkey" FOREIGN KEY (author_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL
    "topics_category_id_fkey" FOREIGN KEY (category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE RESTRICT
    "topics_main_post_id_fkey" FOREIGN KEY (main_post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE SET NULL
Referenced by:
    TABLE "posts" CONSTRAINT "posts_topic_id_fkey" FOREIGN KEY (topic_id) REFERENCES topics(id) ON UPDATE CASCADE ON DELETE CASCADE
    TABLE "user_watched_topics" CONSTRAINT "user_watched_topics_topic_id_fkey" FOREIGN KEY (topic_id) REFERENCES topics(id) ON UPDATE CASCADE ON DELETE CASCADE
Triggers:
    topics_make_slug BEFORE INSERT OR UPDATE OF name, slug ON topics FOR EACH ROW EXECUTE PROCEDURE make_slug()

postgres=# \d user_favorite_posts
                Table "public.user_favorite_posts"
    Column    |           Type           |       Modifiers        
--------------+--------------------------+------------------------
 user_id      | bigint                   | not null
 post_id      | bigint                   | not null
 favorited_at | timestamp with time zone | not null default now()
Indexes:
    "user_favorite_posts_pkey" PRIMARY KEY, btree (user_id, post_id)
Foreign-key constraints:
    "user_favorite_posts_post_id_fkey" FOREIGN KEY (post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE CASCADE
    "user_favorite_posts_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE

postgres=# \d user_groups
                   Table "public.user_groups"
   Column   |           Type           |       Modifiers        
------------+--------------------------+------------------------
 user_id    | bigint                   | not null
 group_name | text                     | not null
 is_owner   | boolean                  | not null default false
 join_date  | timestamp with time zone | not null default now()
Indexes:
    "user_groups_pkey" PRIMARY KEY, btree (user_id, group_name)
Foreign-key constraints:
    "user_groups_group_name_fkey" FOREIGN KEY (group_name) REFERENCES groups(name) ON UPDATE CASCADE ON DELETE CASCADE
    "user_groups_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE

postgres=# \d user_ips   
                    Table "public.user_ips"
  Column   |           Type           |       Modifiers        
-----------+--------------------------+------------------------
 user_id   | bigint                   | not null
 ip        | inet                     | not null
 last_seen | timestamp with time zone | not null default now()
Indexes:
    "user_ips_pkey" PRIMARY KEY, btree (user_id, ip)
Foreign-key constraints:
    "user_ips_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE

postgres=# \d user_post_votes
                  Table "public.user_post_votes"
    Column    |           Type           |       Modifiers        
--------------+--------------------------+------------------------
 user_id      | bigint                   | not null
 post_id      | bigint                   | not null
 up           | boolean                  | 
 last_changed | timestamp with time zone | not null default now()
Indexes:
    "user_post_votes_pkey" PRIMARY KEY, btree (user_id, post_id)
Foreign-key constraints:
    "user_post_votes_post_id_fkey" FOREIGN KEY (post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE CASCADE
    "user_post_votes_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE

postgres=# \d user_watched_categories
              Table "public.user_watched_categories"
    Column    |           Type           |       Modifiers        
--------------+--------------------------+------------------------
 user_id      | bigint                   | not null
 category_id  | bigint                   | not null
 watched      | boolean                  | 
 last_changed | timestamp with time zone | not null default now()
Indexes:
    "user_watched_categories_pkey" PRIMARY KEY, btree (user_id, category_id)
Foreign-key constraints:
    "user_watched_categories_category_id_fkey" FOREIGN KEY (category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE CASCADE
    "user_watched_categories_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE

postgres=# \d user_watched_topics    
                Table "public.user_watched_topics"
    Column    |           Type           |       Modifiers        
--------------+--------------------------+------------------------
 user_id      | bigint                   | not null
 topic_id     | bigint                   | not null
 watched      | boolean                  | 
 last_changed | timestamp with time zone | not null default now()
Indexes:
    "user_watched_topics_pkey" PRIMARY KEY, btree (user_id, topic_id)
Foreign-key constraints:
    "user_watched_topics_topic_id_fkey" FOREIGN KEY (topic_id) REFERENCES topics(id) ON UPDATE CASCADE ON DELETE CASCADE
    "user_watched_topics_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE

postgres=# \d user_watched_users 
                 Table "public.user_watched_users"
     Column     |           Type           |       Modifiers        
----------------+--------------------------+------------------------
 user_id        | bigint                   | not null
 target_user_id | bigint                   | not null
 watched        | boolean                  | 
 last_changed   | timestamp with time zone | not null default now()
Indexes:
    "user_watched_users_pkey" PRIMARY KEY, btree (user_id, target_user_id)
Foreign-key constraints:
    "user_watched_users_target_user_id_fkey" FOREIGN KEY (target_user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    "user_watched_users_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE

postgres=# \d users             
                                       Table "public.users"
      Column      |           Type           |                     Modifiers                      
------------------+--------------------------+----------------------------------------------------
 id               | bigint                   | not null default nextval('users_id_seq'::regclass)
 name             | character varying(255)   | not null
 slug             | character varying(255)   | not null
 password         | character varying(60)    | 
 full_name        | text                     | 
 email            | citext                   | 
 email_confirmed  | boolean                  | not null default false
 banned           | boolean                  | not null default false
 ban_expiration   | timestamp with time zone | 
 join_date        | timestamp with time zone | not null default now()
 last_online      | timestamp with time zone | 
 picture          | character varying(2048)  | 
 uploaded_picture | character varying(2048)  | 
 cover_url        | character varying(2048)  | 
 cover_position   | double precision[]       | 
 group_title      | text                     | 
 website          | character varying(2048)  | 
 location         | text                     | 
 signature        | text                     | 
 about_me         | text                     | 
 birthday         | date                     | 
 profile_views    | bigint                   | not null default 0
Indexes:
    "users_pkey" PRIMARY KEY, btree (id)
    "users_email" UNIQUE, btree (email)
    "users_slug" UNIQUE, btree (slug)
    "users_join_date" btree (join_date)
Check constraints:
    "users_slug_exists" CHECK (slug::text <> ''::text)
Foreign-key constraints:
    "users_group_title_fkey" FOREIGN KEY (group_title) REFERENCES groups(name) ON UPDATE CASCADE ON DELETE SET NULL
Referenced by:
    TABLE "posts" CONSTRAINT "posts_author_id_fkey" FOREIGN KEY (author_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL
    TABLE "posts" CONSTRAINT "posts_editor_id_fkey" FOREIGN KEY (editor_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL
    TABLE "topics" CONSTRAINT "topics_author_id_fkey" FOREIGN KEY (author_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL
    TABLE "user_favorite_posts" CONSTRAINT "user_favorite_posts_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    TABLE "user_groups" CONSTRAINT "user_groups_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    TABLE "user_ips" CONSTRAINT "user_ips_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    TABLE "user_post_votes" CONSTRAINT "user_post_votes_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    TABLE "user_watched_categories" CONSTRAINT "user_watched_categories_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    TABLE "user_watched_topics" CONSTRAINT "user_watched_topics_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    TABLE "user_watched_users" CONSTRAINT "user_watched_users_target_user_id_fkey" FOREIGN KEY (target_user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    TABLE "user_watched_users" CONSTRAINT "user_watched_users_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
Triggers:
    users_make_slug BEFORE INSERT OR UPDATE OF name, slug ON users FOR EACH ROW EXECUTE PROCEDURE make_slug()

FrostCat

@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

I cannot parse what you meant.

I haven't got the time or extra brainpower while I'm at work to expend serious thought on it. Tonight or this weekend I bet I can make it run better, though. Hit me up if you see me on GW2.

ben_lubar

postgres=# select case when join_date < timestamp '2014-02-03T11:33:49' then 'Community Server' when join_date < timestamp '2016-03-19T15:00:00' then 'Discourse' else 'NodeBB' end era, extract(year from join_date) as year, count(*) from users left join posts on users.id = posts.author_id left join user_post_votes on users.id = user_post_votes.user_id where posts.id is null and user_post_votes.user_id is null group by era, year order by year asc;
       era        | year | count 
------------------+------+-------
 Community Server | 2004 |   169
 Community Server | 2005 |  1935
 Community Server | 2006 |  2671
 Community Server | 2007 |  2218
 Community Server | 2008 |  2305
 Community Server | 2009 |  4565
 Community Server | 2010 | 33301
 Community Server | 2011 | 38782
 Community Server | 2012 |  7794
 Community Server | 2013 | 36990
 Community Server | 2014 |   303
 Discourse        | 2014 |  2284
 Discourse        | 2015 |   467
 Discourse        | 2016 |    59
 NodeBB           | 2016 |    92
(15 rows)

FrostCat

@ben_lubar Are you counting "# of unliked posts by year"? "# of orphaned likes because deleted the post"?

ben_lubar

@FrostCat no, I'm counting the number of posts a user has made from a year before a given day to that day.

FrostCat

@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

a user

I didn't see a user ID in the sql, but whatever. I figured my second question would be some kind of clue to the nature of my post, but maybe @blakeyrat's borrowed your shoulder aliens.

blakeyrat

@ben_lubar If you're writing loops in SQL, you're doing it wrong.

blakeyrat

@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

no, I'm counting the number of posts a user has made from a year before a given day to that day.

That might be a bit tricky. Give me a bit.

boomzilla

@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

@FrostCat no, I'm counting the number of posts a user has made from a year before a given day to that day.

Ugh...did you look at the spoon (and related) badge query from discourse? NB: I haven't looked seriously at your query because you put it all in one line, and like blakey, ain't nobody got time for that.

blakeyrat

Here:

select p.PostDate, u.UserName, topTen.PostCount
from Post p
	join (
		select lastPost.PostDate as 'LastPostDate',
			allPost.UserID as 'UserID',
			count(allPost.PostID) as 'PostCount'
		from Post lastPost
			join Post allPost on allPost.PostDate > dateadd( yy, -1, lastPost.PostDate )
		group by lastPost.PostDate, allPost.UserID
		--order by lastPost.PostDate desc, count( allPost.PostID ) desc
	) topTen on topTen.LastPostDate = p.PostDate
	join [User] u on u.UserID = topTen.UserID
order by p.PostDate desc, topTen.PostCount desc

I did indeed need to use a subquery, alas. (If you wanted the whole list for each date, not just the top 10, I'm 99.9% sure you can write this without a subquery. But you'd be sending a shitload more data over the wire.)

Considerations:

This query will not "fill in blanks" (if there's a day with zero posts, it won't show up in the results. Posts near the start of the timeline won't have the full 10 people listed, etc.) That's tough to fix without a data warehouse-type "date" table.
This query will not round to the nearest day-- my sample data happened to all be at timecode 00:00:00 but in real life data, of course, you want to use datepart or equivalent to ensure the group bys and joins happen only on dates and not times. In T-SQL you can cast to Date instead of DateTime. Not sure how you do it in Postgres.
This query is in T-SQL, no guarantees it translates into Postgres SQL.
The commented-out order by is helpful when trying to debug the inner query, but obviously useless when you're using it as an inner query. That's why it's commented-out. Why make the DB engine sort when it doesn't need to.
"allPost" is a badly-named table alias. It's "all posts in the year preceding lastPost's date". But that's pretty long.

FrostCat

@blakeyrat I don't know SQL (or T-SQL) as well as I'd like. What's the difference[1] between "from Post lastPost" and "join [User] u" ? I'd assumed lastPost and u are tablename aliases (don't know the right term), but then why does one have [] and the other not?

[1] regarding names, not the sql clause from vs join.

boomzilla

@FrostCat That's (square brackets) how you use a keyword as a name in sql server.

blakeyrat

@FrostCat said in @ben_lubar tries to parse <del>ServerCooties</del> data:

What's the difference[1] between "from Post lastPost" and "join [User] u" ? I'd assumed lastPost and u are tablename aliases (don't know the right term), but then why does one have [] and the other not?

User's a reserved keyword in SQL. The square braces indicate I want to use the User database object, not the reserved keyword.

My demo database I did in like 45 seconds, and I was too lazy once I realized User was a reserved keyword to switch it to "Users" which is the normal solution to this problem.

boomzilla

@blakeyrat said in @ben_lubar tries to parse <del>ServerCooties</del> data:

join Post allPost on allPost.PostDate > dateadd( yy, -1, lastPost.PostDate )

Seems like this should be something like:

join Post allPost on allPost.PostDate between dateadd( yy, -1, lastPost.PostDate ) and lastPost.PostDate

Or you're getting stuff in the future, too.

blakeyrat

@boomzilla Yup, I think you're right.

select p.PostDate, u.UserName, topTen.PostCount
from Post p
	join (
		select lastPost.PostDate as 'LastPostDate',
			allPost.UserID as 'UserID',
			count(allPost.PostID) as 'PostCount'
		from Post lastPost
			join Post allPost on allPost.PostDate between dateadd( yy, -1, lastPost.PostDate ) and lastPost.PostDate
		group by lastPost.PostDate, allPost.UserID
		--order by lastPost.PostDate desc, count( allPost.PostID ) desc
	) topTen on topTen.LastPostDate = p.PostDate
	join [User] u on u.UserID = topTen.UserID
order by p.PostDate desc, topTen.PostCount desc

Good catch. SQL's tricky.

To pontificate a bit, the trick here is to join a table to itself. Something a lot of SQL neophytes don't realize is possible/useful, but once you learn the trick you can solve millions of problems with it. There ain't nothing in the SQL spec saying you can't have 1, 2, or even 50 copies of the Post table in your joins.

ben_lubar

@blakeyrat

postgres=# explain select p.created_at::date, u.name, tt.count from posts p join (select lp.created_at::date last_post_date, ap.author_id user_id, count(ap.id) post_count from posts lp join posts ap on ap.created_at::date between lp.created_at::date - interval '1 year' and lp.created_at::date group by last_post_date, user_id) tt on tt.last_post_date = p.created_at::date join users u on u.id = tt.user_id order by p.created_at::date desc, tt.post_count desc;
ERROR:  column "p.created_at" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: explain select p.created_at::date, u.name, tt.count from pos...
                       ^

Postgres apparently doesn't like GROUP BY inside a join.

blakeyrat

@ben_lubar Ben L, if you post another SQL query on a single line I will hunt you down and kill you, I am not kidding.

Your keyboard has an "enter" key, please use it.

@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

Postgres apparently doesn't like GROUP BY inside a join.

I can't help you with that. I use a DBMS that doesn't suck shit.

EDIT: this StackOverflow question contains a working subquery with a group by in it:

How to do a Postgresql subquery in select clause with join in from clause like SQL Server?

I am trying to write the following query on postgresql: select name, author_id, count(1), (select count(1) from names as n2 where n2.id = n1.id and t2.author_id = t1.author_id...

select n1.name, n1.author_id, cast(count_1 as numeric)/total_count
  from (select id, name, author_id, count(1) as count_1
          from names
          group by id, name, author_id) n1
inner join (select author_id, count(1) as total_count
              from names
              group by author_id) n2
  on (n2.author_id = n1.author_id)

So you can do it. I'm not sure what's wrong with the one-line monstrosity you posted.

boomzilla

FYI, here's the query for the :

WITH exclusions AS ( /* Which topics to exclude from counters */
	SELECT user_id, id, topic_id, post_number
	FROM posts
	WHERE raw LIKE '%c09fa970-5a9a-11e4-8ed6-0800200c9a66%' AND
	user_id IN  (
		SELECT gu.user_id
		FROM group_users gu
		WHERE group_id IN(
			SELECT g.id
			FROM groups g
			WHERE g.name IN ('badge_granters_admin')
		)
	)
),
LastMonth AS ( /* Count eligible posts from last month */
    SELECT row_number() OVER (ORDER BY count(*) DESC, bp.user_id), u.username, bp.user_id, count(*)
    FROM badge_posts bp
    JOIN users u on u.id=bp.user_id AND
        bp.user_id NOT IN ( /* ignore bots */
                SELECT gu.user_id
                FROM group_users gu
                WHERE group_id IN(
                    SELECT g.id
                    FROM groups g
                    WHERE g.name IN ('bots')
                )
        )
    WHERE topic_id NOT IN ( /* short topics */
       SELECT topic_id FROM badge_posts GROUP BY topic_id HAVING count(topic_id) <10
    ) AND topic_id NOT IN ( /* Ineligible topics */
       SELECT topic_id
       FROM exclusions
    ) AND bp.created_at > CURRENT_DATE - INTERVAL '1 month'
    GROUP BY u.username, bp.user_id
    HAVING count(*) > 1
    ORDER BY count(*) DESC, bp.user_id
),
TotalUsers AS (
    SELECT max(row_number) from LastMonth
)
SELECT username, user_id, row_number, (row_number*100/TotalUsers.max), count, CURRENT_DATE granted_at
FROM LastMonth, TotalUsers
WHERE row_number = 1

ben_lubar

@blakeyrat said in @ben_lubar tries to parse <del>ServerCooties</del> data:

Your keyboard has an "enter" key, please use it.

explain select p.created_at::date, u.name, tt.count
from posts p
join (select lp.created_at::date last_post_date, ap.author_id user_id, count(ap.id) post_count
      from posts lp
      join posts ap
      on ap.created_at::date between lp.created_at::date - interval '1 year'
      and lp.created_at::date group by last_post_date, user_id) tt
on tt.last_post_date = p.created_at::date
join users u
on u.id = tt.user_id
group by p.created_at::date, u.name, tt.post_count
order by p.created_at::date desc, tt.post_count desc;

                                                                                             QUERY PLAN                                                                                              
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=3904072989514.71..4088823725666.74 rows=7390029446081 width=71)
   Group Key: ((p.created_at)::date), tt.post_count, u.name
   ->  Sort  (cost=3904072989514.71..3922548063129.91 rows=7390029446081 width=71)
         Sort Key: ((p.created_at)::date) DESC, tt.post_count DESC, u.name
         ->  Merge Join  (cost=54950161682.14..202754857645.02 rows=7390029446081 width=71)
               Merge Cond: (((p.created_at)::date) = tt.last_post_date)
               ->  Sort  (cost=163649.36..165902.26 rows=901161 width=8)
                     Sort Key: ((p.created_at)::date)
                     ->  Seq Scan on posts p  (cost=0.00..62194.61 rows=901161 width=8)
               ->  Materialize  (cost=54949998032.79..54958198597.89 rows=1640113020 width=67)
                     ->  Sort  (cost=54949998032.79..54954098315.34 rows=1640113020 width=67)
                           Sort Key: tt.last_post_date
                           ->  Hash Join  (cost=53432873040.44..54429888270.11 rows=1640113020 width=67)
                                 Hash Cond: (tt.user_id = u.id)
                                 ->  Subquery Scan on tt  (cost=53432866404.24..54372092444.88 rows=1640113020 width=64)
                                       ->  GroupAggregate  (cost=53432866404.24..54355691314.68 rows=1640113020 width=24)
                                             Group Key: ((lp.created_at)::date), ap.author_id
                                             ->  Sort  (cost=53432866404.24..53658447278.66 rows=90232349769 width=24)
                                                   Sort Key: ((lp.created_at)::date), ap.author_id
                                                   ->  Nested Loop  (cost=0.00..27761429835.18 rows=90232349769 width=24)
                                                         Join Filter: (((ap.created_at)::date <= (lp.created_at)::date) AND ((ap.created_at)::date >= ((lp.created_at)::date - '1 year'::interval)))
                                                         ->  Seq Scan on posts ap  (cost=0.00..62194.61 rows=901161 width=24)
                                                         ->  Materialize  (cost=0.00..70221.42 rows=901161 width=8)
                                                               ->  Seq Scan on posts lp  (cost=0.00..62194.61 rows=901161 width=8)
                                 ->  Hash  (cost=4033.98..4033.98 rows=141698 width=19)
                                       ->  Seq Scan on users u  (cost=0.00..4033.98 rows=141698 width=19)
(26 rows)

rows=7390029446081

oh god

blakeyrat

@ben_lubar It should return a maximum of 10 rows per day for each day that had at least one post.

EDIT: oh I see the problem. My inner table NAMED top 10 at some point lost it's top 10. Oops.

ben_lubar

@blakeyrat

                                                                                                QUERY PLAN                                                                                                 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=54391206248.86..54391207375.31 rows=45058 width=71)
   Group Key: ((p.created_at)::date), tt.post_count, u.name
   ->  Sort  (cost=54391206248.86..54391206361.51 rows=45058 width=71)
         Sort Key: ((p.created_at)::date) DESC, tt.post_count DESC, u.name
         ->  Hash Join  (cost=54391133650.85..54391200914.49 rows=45058 width=71)
               Hash Cond: ((p.created_at)::date = tt.last_post_date)
               ->  Seq Scan on posts p  (cost=0.00..62194.61 rows=901161 width=8)
               ->  Hash  (cost=54391133650.73..54391133650.73 rows=10 width=67)
                     ->  Nested Loop  (cost=54391133567.67..54391133650.73 rows=10 width=67)
                           ->  Subquery Scan on tt  (cost=54391133567.38..54391133567.50 rows=10 width=64)
                                 ->  Limit  (cost=54391133567.38..54391133567.40 rows=10 width=24)
                                       ->  Sort  (cost=54391133567.38..54395233849.93 rows=1640113020 width=24)
                                             Sort Key: ((lp.created_at)::date) DESC, (count(ap.id)) DESC
                                             ->  GroupAggregate  (cost=53432866404.24..54355691314.68 rows=1640113020 width=24)
                                                   Group Key: ((lp.created_at)::date), ap.author_id
                                                   ->  Sort  (cost=53432866404.24..53658447278.66 rows=90232349769 width=24)
                                                         Sort Key: ((lp.created_at)::date) DESC, ap.author_id
                                                         ->  Nested Loop  (cost=0.00..27761429835.18 rows=90232349769 width=24)
                                                               Join Filter: (((ap.created_at)::date <= (lp.created_at)::date) AND ((ap.created_at)::date >= ((lp.created_at)::date - '1 year'::interval)))
                                                               ->  Seq Scan on posts ap  (cost=0.00..62194.61 rows=901161 width=24)
                                                               ->  Materialize  (cost=0.00..70221.42 rows=901161 width=8)
                                                                     ->  Seq Scan on posts lp  (cost=0.00..62194.61 rows=901161 width=8)
                           ->  Index Scan using users_pkey on users u  (cost=0.29..8.31 rows=1 width=19)
                                 Index Cond: (id = tt.user_id)
(24 rows)

nested nested nested loops are scary

boomzilla

@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:

oh god

Your idea to pull up 10 users for each day for the last 12 years or whatever based on what we all did for the year previous to that was never going to be a reasonable query. Why do you want this?

Maybe break it down by month? Make a table and put each user's post count by month in there. Aggregating that would be simple. This is probably an even better idea if you really really really want to go by day.

But either way you'll probably want / need to break the insert up or the commit / redo stuff will kill you. Literally. Dead .

ben_lubar

@boomzilla said in @ben_lubar tries to parse <del>ServerCooties</del> data:

never going to be a reasonable query

THIS DATA IS TOO BIG

boomzilla

@ben_lubar Yes. You can't eat the elephant all at once.

blakeyrat

@ben_lubar Hey you asked us for help. If you think you can write it better, knock yourself out.

boomzilla

@ben_lubar Set up a script to do a day at a time, or something, and count all the posts for that day per user. So each day is done inside its own transaction. This is going to suck not matter what, but when you get done, you'll be able to do a lot of interesting things with that data very quickly.

Also...don't create any indexes or foreign keys on your table while you're filling it up. Do that afterwards.

FrostCat

@blakeyrat said in @ben_lubar tries to parse <del>ServerCooties</del> data:

User's a reserved keyword in SQL. The square braces indicate I want to use the User database object, not the reserved keyword

Ah. That's kind of like how in Access, you could have a field with a space in the name, as long as you referred to it with brackets when you used it.

FrostCat

@boomzilla said in @ben_lubar tries to parse <del>ServerCooties</del> data:

count all the posts for that day per user.

Quick question, how many posts are there, total?

define temp-table tt
    field user as int
    field date as date
    field count as int
    .

for each posts no-lock:
    find first tt where tt.user = posts.user_id
      and tt.date = posts.date
      no-error.

      if not available tt then create tt.
      tt.posts = tt.posts + 1.
end.

at the end of this, the temp-table is full of records that show how many posts every user had on every day in which that user had posts. That's likely to be a much nicer data set with which to work.

One pass. No sorting. If you need a sort of some kind, put an appropriate index on the temp table. The query takes about as long to run as select user_id from posts.

Want to know the top ten posters for each day? select top 10 user from posts where date=whatever order by count. Only with the right syntax.

djls45

@FrostCat said in @ben_lubar tries to parse <del>ServerCooties</del> data:

@blakeyrat said in @ben_lubar tries to parse <del>ServerCooties</del> data:

User's a reserved keyword in SQL. The square braces indicate I want to use the User database object, not the reserved keyword

Ah. That's kind of like how in Access, you could have a field with a space in the name, as long as you referred to it with brackets when you used it.

T-SQL does allow spaces in the name if you delimit it with either square brackets or double quotes.

PL/SQL allows aliases to have spaces inside double quotes. Square brackets apparently indicate something involving expressions.

FrostCat

@djls45 said in @ben_lubar tries to parse <del>ServerCooties</del> data:

T-SQL does allow spaces

I was just trying to come up with an analogy. In Blakey's usage, I guess the brackets escape the normal meaning of the word.

clippy

what db is this?

ben_lubar

@clippy said in @ben_lubar tries to parse <del>ServerCooties</del> data:

what db is this?

postgresql

dkf

@djls45 said in @ben_lubar tries to parse <del>ServerCooties</del> data:

PL/SQL allows aliases to have spaces inside double quotes.

That's the syntax that the SQL standard specifies. Apparently MSSQL also supports it (as well as the bracketed form).

ben_lubar

My progress on the !!SCIENCE!! script is here: https://gist.github.com/BenLubar/98fe068162c5ae9b11659b7029da3a76

ben_lubar

2016/07/09 19:46:18 float64 cannot be converted to int64 without losing precision: NaN
main.Int64
        /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1491
main.HandleTopic.func1
        /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1188
main.ForSortedSet
        /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1561
main.HandleTopic
        /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1195
main.ForSortedSet
        /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1561
main.main
        /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:110
runtime.main
        /home/ben/.golang/src/runtime/proc.go:183
runtime.goexit
        /home/ben/.golang/src/runtime/asm_amd64.s:2086
main.HandleTopic.func1
        /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1190: position
main.ForSortedSet
        /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1561: ForSortedSet("tid:18672:bookmarks", ("21", NaN))
main.HandleTopic
        /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1196: bookmarks
main.ForSortedSet
        /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1561: ForSortedSet("topics:tid", ("18672", 9.805716e+11))
exit status 1

ben_lubar

2016/07/10 11:04:40 topic:20281:"Re: the deleted post thread"
2016/07/10 11:04:40 topic:20282:"Re: the deleted post thread"
2016/07/10 11:04:40 topic:20283:"Fork + github + npm"
2016/07/10 11:04:40 topic:20284:"Bot User Agents"
2016/07/10 11:04:40 topic:20285:"Why is polygamy illegal?"
2016/07/10 11:04:40 topic:20286:"GOOOOOGGGGGLLLLLEEEE"
2016/07/10 11:04:40 pq: invalid byte sequence for encoding "UTF8": 0x00
main.HandleTopic
        /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1177: insert topic into database
main.ForSortedSet
        /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1565: ForSortedSet("topics:tid", ("20287", 1.466585097322e+12))
exit status 1

Goddamnit @aliceif