@ben_lubar tries to parse <del>ServerCooties</del> data
-
Here's the query I want to run:
with dates as (select ((select min(created_at)::date from posts) + day * interval '1 day')::date date from generate_series(0, (select max(created_at)::date from posts) - (select min(created_at)::date from posts)) day order by date asc), counts as (select date, author_id, count(*) from dates, posts where created_at between date - interval '1 year' and date group by date, author_id order by date asc, count desc), ranked as (select *, rank() over (partition by date order by count desc) from counts) select date, rank, name, count from ranked left join users on id = author_id where rank <= 10 order by date asc, rank asc;
Here's what I get when I limit it to the last 10 days:
postgres=# explain analyze verbose with dates as (select ((select min(created_at)::date from posts) + day * interval '1 day')::date date from generate_series(0, (select max(created_at)::date from posts) - (select min(created_at)::date from posts)) day order by date desc limit 10), counts as (select date, author_id, count(*) from dates, posts where created_at between date - interval '1 year' and date group by date, author_id order by date asc, count desc), ranked as (select *, rank() over (partition by date order by count desc) from counts) select date, rank, name, count from ranked left join users on id = author_id where rank <= 10 order by date asc, rank asc; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=208088.75..208103.92 rows=6067 width=31) (actual time=7513.875..7513.921 rows=100 loops=1) Output: ranked.date, ranked.rank, users.name, ranked.count Sort Key: ranked.date, ranked.rank Sort Method: quicksort Memory: 32kB CTE dates -> Limit (cost=43.19..43.21 rows=10 width=4) (actual time=12.654..12.678 rows=10 loops=1) Output: ((($1 + ((day.day)::double precision * '1 day'::interval)))::date) InitPlan 2 (returns $1) -> Result (cost=0.51..0.53 rows=1 width=0) (actual time=0.044..0.044 rows=1 loops=1) Output: ($0)::date InitPlan 1 (returns $0) -> Limit (cost=0.42..0.51 rows=1 width=8) (actual time=0.040..0.041 rows=1 loops=1) Output: posts.created_at -> Index Only Scan using posts_created_at on public.posts (cost=0.42..78856.24 rows=901161 width=8) (actual time=0.038..0.038 rows=1 loops=1) Output: posts.created_at Index Cond: (posts.created_at IS NOT NULL) Heap Fetches: 1 InitPlan 4 (returns $3) -> Result (cost=0.51..0.53 rows=1 width=0) (actual time=0.095..0.096 rows=1 loops=1) Output: ($2)::date InitPlan 3 (returns $2) -> Limit (cost=0.42..0.51 rows=1 width=8) (actual time=0.087..0.088 rows=1 loops=1) Output: posts_1.created_at -> Index Only Scan Backward using posts_created_at on public.posts posts_1 (cost=0.42..78856.24 rows=901161 width=8) (actual time=0.085..0.085 rows=1 loops=1) Output: posts_1.created_at Index Cond: (posts_1.created_at IS NOT NULL) Heap Fetches: 1 InitPlan 6 (returns $5) -> Result (cost=0.51..0.53 rows=1 width=0) (actual time=0.041..0.042 rows=1 loops=1) Output: ($4)::date InitPlan 5 (returns $4) -> Limit (cost=0.42..0.51 rows=1 width=8) (actual time=0.037..0.037 rows=1 loops=1) Output: posts_2.created_at -> Index Only Scan using posts_created_at on public.posts posts_2 (cost=0.42..78856.24 rows=901161 width=8) (actual time=0.035..0.035 rows=1 loops=1) Output: posts_2.created_at Index Cond: (posts_2.created_at IS NOT NULL) Heap Fetches: 1 -> Sort (cost=41.61..44.11 rows=1000 width=4) (actual time=12.652..12.663 rows=10 loops=1) Output: ((($1 + ((day.day)::double precision * '1 day'::interval)))::date) Sort Key: ((($1 + ((day.day)::double precision * '1 day'::interval)))::date) DESC Sort Method: top-N heapsort Memory: 25kB -> Function Scan on pg_catalog.generate_series day (cost=0.01..20.01 rows=1000 width=4) (actual time=1.121..8.068 rows=5634 loops=1) Output: (($1 + ((day.day)::double precision * '1 day'::interval)))::date Function Call: generate_series(0, ($3 - $5)) CTE counts -> Sort (cost=198063.98..198109.48 rows=18200 width=12) (actual time=7286.828..7291.778 rows=9868 loops=1) Output: dates.date, posts_3.author_id, (count(*)) Sort Key: dates.date, (count(*)) DESC Sort Method: quicksort Memory: 1155kB -> HashAggregate (cost=196594.18..196776.18 rows=18200 width=12) (actual time=7272.899..7279.037 rows=9868 loops=1) Output: dates.date, posts_3.author_id, count(*) Group Key: dates.date, posts_3.author_id -> Nested Loop (cost=0.43..189084.51 rows=1001290 width=12) (actual time=12.726..5342.047 rows=2792422 loops=1) Output: dates.date, posts_3.author_id -> CTE Scan on dates (cost=0.00..0.20 rows=10 width=4) (actual time=12.655..12.710 rows=10 loops=1) Output: dates.date -> Index Scan using posts_created_at on public.posts posts_3 (cost=0.43..17907.14 rows=100129 width=16) (actual time=0.072..256.479 rows=279242 loops=10) Output: posts_3.id, posts_3.topic_id, posts_3.author_id, posts_3.author_name, posts_3.parent_post_id, posts_3.content, posts_3.created_at, posts_3.ip, posts_3.editor_id, posts_3.edited_at, posts_3.deleted Index Cond: ((posts_3.created_at >= (dates.date - '1 year'::interval)) AND (posts_3.created_at <= dates.date)) CTE ranked -> WindowAgg (cost=1651.80..2015.80 rows=18200 width=20) (actual time=7308.258..7328.054 rows=9868 loops=1) Output: counts.date, counts.author_id, counts.count, rank() OVER (?) -> Sort (cost=1651.80..1697.30 rows=18200 width=20) (actual time=7308.250..7313.102 rows=9868 loops=1) Output: counts.date, counts.count, counts.author_id Sort Key: counts.date, counts.count DESC Sort Method: quicksort Memory: 1155kB -> CTE Scan on counts (cost=0.00..364.00 rows=18200 width=20) (actual time=7286.832..7302.301 rows=9868 loops=1) Output: counts.date, counts.count, counts.author_id -> Merge Right Join (cost=791.01..7539.04 rows=6067 width=31) (actual time=7335.604..7513.793 rows=100 loops=1) Output: ranked.date, ranked.rank, users.name, ranked.count Merge Cond: (users.id = ranked.author_id) -> Index Scan using users_pkey on public.users (cost=0.29..6303.08 rows=141698 width=19) (actual time=0.021..107.009 rows=140880 loops=1) Output: users.id, users.name, users.slug, users.password, users.full_name, users.email, users.email_confirmed, users.banned, users.ban_expiration, users.join_date, users.last_online, users.picture, users.uploaded_picture, users.cover_url, users.cover_position, users.group_title, users.website, users.location, users.signature, users.about_me, users.birthday, users.profile_views -> Sort (cost=790.71..805.88 rows=6067 width=28) (actual time=7335.575..7335.622 rows=100 loops=1) Output: ranked.date, ranked.rank, ranked.count, ranked.author_id Sort Key: ranked.author_id Sort Method: quicksort Memory: 32kB -> CTE Scan on ranked (cost=0.00..409.50 rows=6067 width=28) (actual time=7308.268..7335.507 rows=100 loops=1) Output: ranked.date, ranked.rank, ranked.count, ranked.author_id Filter: (ranked.rank <= 10) Rows Removed by Filter: 9768 Planning time: 0.690 ms Execution time: 7514.711 ms (83 rows)
With these results (it's not an up-to-date backup):
date | rank | name | count ------------+------+-------------+------- 2016-06-21 | 1 | boomzilla | 13416 2016-06-21 | 2 | blakeyrat | 12541 2016-06-21 | 3 | Polygeekery | 9921 2016-06-21 | 4 | FrostCat | 9382 2016-06-21 | 5 | accalia | 9209 2016-06-21 | 6 | loopback0 | 7961 2016-06-21 | 7 | Fox | 7804 2016-06-21 | 8 | ben_lubar | 7435 2016-06-21 | 9 | dkf | 7334 2016-06-21 | 10 | abarker | 6576 2016-06-22 | 1 | boomzilla | 13353 2016-06-22 | 2 | blakeyrat | 12521 2016-06-22 | 3 | Polygeekery | 9919 2016-06-22 | 4 | FrostCat | 9377 2016-06-22 | 5 | accalia | 9198 2016-06-22 | 6 | loopback0 | 7961 2016-06-22 | 7 | Fox | 7804 2016-06-22 | 8 | ben_lubar | 7420 2016-06-22 | 9 | dkf | 7333 2016-06-22 | 10 | abarker | 6569 2016-06-23 | 1 | boomzilla | 13235 2016-06-23 | 2 | blakeyrat | 12472 2016-06-23 | 3 | Polygeekery | 9893 2016-06-23 | 4 | FrostCat | 9371 2016-06-23 | 5 | accalia | 9115 2016-06-23 | 6 | loopback0 | 7961 2016-06-23 | 7 | Fox | 7804 2016-06-23 | 8 | ben_lubar | 7403 2016-06-23 | 9 | dkf | 7320 2016-06-23 | 10 | abarker | 6520 2016-06-24 | 1 | boomzilla | 13184 2016-06-24 | 2 | blakeyrat | 12437 2016-06-24 | 3 | Polygeekery | 9879 2016-06-24 | 4 | FrostCat | 9326 2016-06-24 | 5 | accalia | 9076 2016-06-24 | 6 | loopback0 | 7961 2016-06-24 | 7 | Fox | 7804 2016-06-24 | 8 | ben_lubar | 7394 2016-06-24 | 9 | dkf | 7309 2016-06-24 | 10 | abarker | 6510 2016-06-25 | 1 | boomzilla | 13131 2016-06-25 | 2 | blakeyrat | 12405 2016-06-25 | 3 | Polygeekery | 9868 2016-06-25 | 4 | FrostCat | 9296 2016-06-25 | 5 | accalia | 9009 2016-06-25 | 6 | loopback0 | 7961 2016-06-25 | 7 | Fox | 7804 2016-06-25 | 8 | ben_lubar | 7370 2016-06-25 | 9 | dkf | 7299 2016-06-25 | 10 | abarker | 6468 2016-06-26 | 1 | boomzilla | 13046 2016-06-26 | 2 | blakeyrat | 12363 2016-06-26 | 3 | Polygeekery | 9863 2016-06-26 | 4 | FrostCat | 9256 2016-06-26 | 5 | accalia | 8932 2016-06-26 | 6 | loopback0 | 7961 2016-06-26 | 7 | Fox | 7804 2016-06-26 | 8 | ben_lubar | 7354 2016-06-26 | 9 | dkf | 7289 2016-06-26 | 10 | Tsaukpaetra | 6465 2016-06-27 | 1 | boomzilla | 12960 2016-06-27 | 2 | blakeyrat | 12337 2016-06-27 | 3 | Polygeekery | 9861 2016-06-27 | 4 | FrostCat | 9225 2016-06-27 | 5 | accalia | 8856 2016-06-27 | 6 | loopback0 | 7961 2016-06-27 | 7 | Fox | 7804 2016-06-27 | 8 | ben_lubar | 7329 2016-06-27 | 9 | dkf | 7271 2016-06-27 | 10 | Tsaukpaetra | 6464 2016-06-28 | 1 | boomzilla | 12951 2016-06-28 | 2 | blakeyrat | 12316 2016-06-28 | 3 | Polygeekery | 9859 2016-06-28 | 4 | FrostCat | 9219 2016-06-28 | 5 | accalia | 8847 2016-06-28 | 6 | loopback0 | 7961 2016-06-28 | 7 | Fox | 7804 2016-06-28 | 8 | ben_lubar | 7322 2016-06-28 | 9 | dkf | 7266 2016-06-28 | 10 | Tsaukpaetra | 6464 2016-06-29 | 1 | boomzilla | 12935 2016-06-29 | 2 | blakeyrat | 12305 2016-06-29 | 3 | Polygeekery | 9853 2016-06-29 | 4 | FrostCat | 9205 2016-06-29 | 5 | accalia | 8822 2016-06-29 | 6 | loopback0 | 7961 2016-06-29 | 7 | Fox | 7804 2016-06-29 | 8 | ben_lubar | 7293 2016-06-29 | 9 | dkf | 7265 2016-06-29 | 10 | Tsaukpaetra | 6464 2016-06-30 | 1 | boomzilla | 12908 2016-06-30 | 2 | blakeyrat | 12265 2016-06-30 | 3 | Polygeekery | 9834 2016-06-30 | 4 | FrostCat | 9195 2016-06-30 | 5 | accalia | 8742 2016-06-30 | 6 | loopback0 | 7961 2016-06-30 | 7 | Fox | 7804 2016-06-30 | 8 | ben_lubar | 7272 2016-06-30 | 9 | dkf | 7258 2016-06-30 | 10 | Tsaukpaetra | 6464 (100 rows)
How can I make the query run faster? There are about 6000 days between the first and last posts in the database, so taking over an hour to get this data isn't going to be fun for whoever is waiting for the chart to be generated.
-
@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:
How can I make the query run faster? There are about 6000 days between the first and last posts in the database, so taking over an hour to get this data isn't going to be fun for whoever is waiting for the chart to be generated.
Try materializing some of those subqueries. Temp table or persisted view.
Assuming you're just working with a snapshot exported from Mongo, you can denormalize that shit without too much worry about maintaining consistency.
-
@ben_lubar Well here's a few suggestions:
- Format your query so it's on more than one line, because WTF man
- Get rid of the roughly 57436 subqueries you have and just do a single join clause
You appear to have a query within a query within a query if I'm reading that right. Jesus man.
-
begin; create index posts_created_at on posts (created_at); create index posts_author_id on posts (author_id); create temporary table temp_dates as select ((select min(created_at)::date from posts) + day * interval '1 day')::date date from generate_series(0, (select max(created_at)::date from posts) - (select min(created_at)::date from posts)) day; create temporary table temp_counts_pre as select date, author_id, count(*) from temp_dates, posts where created_at::date = date group by date, author_id; create index temp_counts_pre_date_author_id on temp_counts_pre (date, author_id); create temporary table temp_counts as select a.date, author_id, sum(count) count from temp_dates a, temp_counts_pre b where b.date between a.date - interval '1 year' and a.date group by a.date, author_id; create index temp_counts_date_count on temp_counts (date, count desc); explain analyze verbose with counts as (select *, rank() over (partition by date order by count desc) from temp_counts) select date, rank, name, count from counts left join users on id = author_id where rank <= 10 order by date asc, rank asc; with counts as (select *, rank() over (partition by date order by count desc) from temp_counts) select date, rank, name, count from counts left join users on id = author_id where rank <= 10 order by date asc, rank asc; rollback;
Output:
postgres=# begin; BEGIN Time: 0.137 ms postgres=# create index posts_created_at on posts (created_at); CREATE INDEX Time: 3773.139 ms postgres=# create index posts_author_id on posts (author_id); CREATE INDEX Time: 1269.059 ms postgres=# create temporary table temp_dates as select ((select min(created_at)::date from posts) + day * interval '1 day')::date date from generate_series(0, (select max(created_at)::date from posts) - (select min(created_at)::date from posts)) day; SELECT 5634 Time: 24.168 ms postgres=# create temporary table temp_counts_pre as select date, author_id, count(*) from temp_dates, posts where created_at::date = date group by date, author_id; SELECT 169666 Time: 2678.269 ms postgres=# create index temp_counts_pre_date_author_id on temp_counts_pre (date, author_id); CREATE INDEX Time: 73.029 ms postgres=# create temporary table temp_counts as select a.date, author_id, sum(count) count from temp_dates a, temp_counts_pre b where b.date between a.date - interval '1 year' and a.date group by a.date, author_id; SELECT 4509688 Time: 129317.276 ms postgres=# create index temp_counts_date_count on temp_counts (date, count desc); CREATE INDEX Time: 9796.385 ms postgres=# explain analyze verbose with counts as (select *, rank() over (partition by date order by count desc) from temp_counts) select date, rank, name, count from counts left join users on id = author_id where rank <= 10 order by date asc, rank asc; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=746886.42..750644.49 rows=1503229 width=55) (actual time=15610.866..15636.024 rows=44877 loops=1) Output: counts.date, counts.rank, users.name, counts.count Sort Key: counts.date, counts.rank Sort Method: external sort Disk: 1752kB CTE counts -> WindowAgg (cost=0.43..330933.29 rows=4509688 width=44) (actual time=0.054..11095.926 rows=4509688 loops=1) Output: temp_counts.date, temp_counts.author_id, temp_counts.count, rank() OVER (?) -> Index Scan using temp_counts_date_count on pg_temp_2.temp_counts (cost=0.43..252013.75 rows=4509688 width=44) (actual time=0.044..3563.355 rows=4509688 loops=1) Output: temp_counts.date, temp_counts.count, temp_counts.author_id -> Hash Left Join (cost=6636.20..158964.58 rows=1503229 width=55) (actual time=389.273..15535.957 rows=44877 loops=1) Output: counts.date, counts.rank, users.name, counts.count Hash Cond: (counts.author_id = users.id) -> CTE Scan on counts (cost=0.00..101467.98 rows=1503229 width=52) (actual time=0.061..15046.578 rows=44877 loops=1) Output: counts.date, counts.rank, counts.count, counts.author_id Filter: (counts.rank <= 10) Rows Removed by Filter: 4464811 -> Hash (cost=4033.98..4033.98 rows=141698 width=19) (actual time=389.028..389.028 rows=141427 loops=1) Output: users.name, users.id Buckets: 65536 Batches: 4 Memory Usage: 2422kB -> Seq Scan on public.users (cost=0.00..4033.98 rows=141698 width=19) (actual time=28.603..288.194 rows=141427 loops=1) Output: users.name, users.id Planning time: 34.240 ms Execution time: 15717.063 ms (23 rows) Time: 16434.860 ms postgres=# with counts as (select *, rank() over (partition by date order by count desc) from temp_counts) select date, rank, name, count from counts left join users on id = author_id where rank <= 10 order by date asc, rank asc; date | rank | name | count ------------+------+----------------------+------- [snip] (44877 rows) Time: 6726.870 ms postgres=# rollback; ROLLBACK Time: 70.226 ms
-
Thanks for ruining my statistics, @abarker
https://what.thedailywtf.com/post/797354
-
@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:
postgres=# create temporary table temp_counts as select a.date, author_id, sum(count) count from temp_dates a, temp_counts_pre b where b.date between a.date - interval '1 year' and a.date group by a.date, author_id; SELECT 4509688 Time: 129317.276 ms
postgres=# explain analyze verbose create temporary table temp_counts as select a.date, author_id, sum(count) count from temp_dates a, temp_counts_pre b where b.date between a.date - interval '1 year' and a.date group by a.date, author_id; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=27137620.94..28339921.77 rows=40000 width=20) (actual time=192270.885..276757.795 rows=4509688 loops=1) Output: a.date, b.author_id, sum(b.count) Group Key: a.date, b.author_id -> Sort (cost=27137620.94..27438071.14 rows=120180083 width=20) (actual time=192270.866..238706.102 rows=56896009 loops=1) Output: a.date, b.author_id, b.count Sort Key: a.date, b.author_id Sort Method: external sort Disk: 1881848kB -> Nested Loop (cost=0.42..3615179.19 rows=120180083 width=20) (actual time=0.055..91374.964 rows=56896009 loops=1) Output: a.date, b.author_id, b.count -> Seq Scan on pg_temp_2.temp_dates a (cost=0.00..88.75 rows=6375 width=4) (actual time=0.018..7.720 rows=5634 loops=1) Output: a.date -> Index Scan using temp_counts_pre_date_author_id on pg_temp_2.temp_counts_pre b (cost=0.42..378.55 rows=18852 width=20) (actual time=0.015..6.305 rows=10099 loops=5634) Output: b.date, b.author_id, b.count Index Cond: ((b.date >= (a.date - '1 year'::interval)) AND (b.date <= a.date)) Planning time: 0.280 ms Execution time: 282411.389 ms (16 rows) Time: 282419.638 ms
I wonder if I can rewrite that query to insert each day's data separately so it doesn't have to sort a terabyte of data at once...
-
-
create temporary table temp_counts (date date not null, author_id bigint, count bigint not null); do $$ declare a record; begin for a in select date from temp_dates order by date asc loop insert into temp_counts select a.date, author_id, sum(count) count from temp_counts_pre b where b.date between a.date - interval '1 year' and a.date group by a.date, author_id; end loop; end; $$ language plpgsql;
Unfortunately, each of the INSERT queries takes about 50ms, which means instead of taking 2 minutes, now it takes 5 minutes.
-
@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:
I wonder if I can rewrite that query to insert each day's data separately so it doesn't have to sort a terabyte of data at once...
It might be faster to just do one table scan, and figure out which records you want outside the SQL. Yes, even if it's the whole table. It would certainly be saner.
-
@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:
Unfortunately, each of the INSERT queries takes about 50ms
I haven't got the spare brainpower to decipher what you're trying to accomplish at the moment. Can you explain it in English instead of SQL?
-
@FrostCat said in @ben_lubar tries to parse <del>ServerCooties</del> data:
Can you explain it in English
TL;DR version: Top 10 users for "posts in the last year" for each day.
-
@ben_lubar Ok. Let me cogitate for a bit. I still think your best bet is going to be--based on what's been said so far--a single table scan of posts into a temp table with some of the indexes you want. And bear in mind SQL per se is not my forte so I may not be able to tell you exactly how to write the SQL.
How would I find the db schema, such as it is? Don't actually give it to me now, I won't be able to process it.
-
@FrostCat said in @ben_lubar tries to parse <del>ServerCooties</del> data:
How would I find the db schema, such as it is?
When combined with the next sentence you wrote, I cannot parse what you meant.
In any case, when you want it, it'll be here:
postgres=# \dt List of relations Schema | Name | Type | Owner --------+-------------------------+-------+---------- public | categories | table | postgres public | groups | table | postgres public | posts | table | postgres public | topics | table | postgres public | user_favorite_posts | table | postgres public | user_groups | table | postgres public | user_ips | table | postgres public | user_post_votes | table | postgres public | user_watched_categories | table | postgres public | user_watched_topics | table | postgres public | user_watched_users | table | postgres public | users | table | postgres (12 rows) postgres=# \d categories Table "public.categories" Column | Type | Modifiers --------------------+-------------------------+--------------------------------------------------------- id | bigint | not null default nextval('categories_id_seq'::regclass) name | text | not null slug | text | not null parent_category_id | bigint | position | integer | description | text | link | character varying(2048) | icon | character varying(255) | fg_color | character varying(255) | bg_color | character varying(255) | recent_replies | integer | disabled | boolean | not null default false Indexes: "categories_pkey" PRIMARY KEY, btree (id) "categories_slug" UNIQUE, btree (slug) Check constraints: "categories_slug_exists" CHECK (slug <> ''::text) Foreign-key constraints: "categories_parent_category_id_fkey" FOREIGN KEY (parent_category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE SET NULL Referenced by: TABLE "categories" CONSTRAINT "categories_parent_category_id_fkey" FOREIGN KEY (parent_category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE SET NULL TABLE "topics" CONSTRAINT "topics_category_id_fkey" FOREIGN KEY (category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE RESTRICT TABLE "user_watched_categories" CONSTRAINT "user_watched_categories_category_id_fkey" FOREIGN KEY (category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE CASCADE Triggers: categories_make_slug BEFORE INSERT OR UPDATE OF name, slug ON categories FOR EACH ROW EXECUTE PROCEDURE make_slug() postgres=# \d groups Table "public.groups" Column | Type | Modifiers -----------------------+--------------------------+------------------------ name | text | not null slug | text | not null create_time | timestamp with time zone | not null default now() user_title | text | description | text | not null icon | character varying(255) | label_color | character varying(255) | cover_url | character varying(2048) | cover_thumb_url | character varying(2048) | cover_position | double precision[] | hidden | boolean | not null default false system | boolean | not null default false private | boolean | not null default true disable_join_requests | boolean | not null default false user_title_enabled | boolean | not null default false Indexes: "groups_pkey" PRIMARY KEY, btree (name) "groups_slug" UNIQUE, btree (slug) "groups_create_time" btree (create_time) Check constraints: "groups_slug_exists" CHECK (slug <> ''::text) Referenced by: TABLE "user_groups" CONSTRAINT "user_groups_group_name_fkey" FOREIGN KEY (group_name) REFERENCES groups(name) ON UPDATE CASCADE ON DELETE CASCADE TABLE "users" CONSTRAINT "users_group_title_fkey" FOREIGN KEY (group_title) REFERENCES groups(name) ON UPDATE CASCADE ON DELETE SET NULL Triggers: groups_make_slug BEFORE INSERT OR UPDATE OF name, slug ON groups FOR EACH ROW EXECUTE PROCEDURE make_slug() postgres=# \d posts Table "public.posts" Column | Type | Modifiers ----------------+--------------------------+---------------------------------------------------- id | bigint | not null default nextval('posts_id_seq'::regclass) topic_id | bigint | not null author_id | bigint | author_name | character varying(255) | parent_post_id | bigint | content | text | not null created_at | timestamp with time zone | not null default now() ip | inet[] | editor_id | bigint | edited_at | timestamp with time zone | deleted | boolean | not null default false Indexes: "posts_pkey" PRIMARY KEY, btree (id) Foreign-key constraints: "posts_author_id_fkey" FOREIGN KEY (author_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL "posts_editor_id_fkey" FOREIGN KEY (editor_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL "posts_parent_post_id_fkey" FOREIGN KEY (parent_post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE SET NULL "posts_topic_id_fkey" FOREIGN KEY (topic_id) REFERENCES topics(id) ON UPDATE CASCADE ON DELETE CASCADE Referenced by: TABLE "posts" CONSTRAINT "posts_parent_post_id_fkey" FOREIGN KEY (parent_post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE SET NULL TABLE "topics" CONSTRAINT "topics_main_post_id_fkey" FOREIGN KEY (main_post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE SET NULL TABLE "user_favorite_posts" CONSTRAINT "user_favorite_posts_post_id_fkey" FOREIGN KEY (post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE CASCADE TABLE "user_post_votes" CONSTRAINT "user_post_votes_post_id_fkey" FOREIGN KEY (post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE CASCADE postgres=# \d topics Table "public.topics" Column | Type | Modifiers --------------+--------------------------+----------------------------------------------------- id | bigint | not null default nextval('topics_id_seq'::regclass) name | text | not null slug | text | not null category_id | bigint | not null author_id | bigint | main_post_id | bigint | created_at | timestamp with time zone | not null default now() view_count | bigint | not null default 0 locked | boolean | not null default false deleted | boolean | not null default false pinned | boolean | not null default false Indexes: "topics_pkey" PRIMARY KEY, btree (id) "topics_slug" btree (slug) Foreign-key constraints: "topics_author_id_fkey" FOREIGN KEY (author_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL "topics_category_id_fkey" FOREIGN KEY (category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE RESTRICT "topics_main_post_id_fkey" FOREIGN KEY (main_post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE SET NULL Referenced by: TABLE "posts" CONSTRAINT "posts_topic_id_fkey" FOREIGN KEY (topic_id) REFERENCES topics(id) ON UPDATE CASCADE ON DELETE CASCADE TABLE "user_watched_topics" CONSTRAINT "user_watched_topics_topic_id_fkey" FOREIGN KEY (topic_id) REFERENCES topics(id) ON UPDATE CASCADE ON DELETE CASCADE Triggers: topics_make_slug BEFORE INSERT OR UPDATE OF name, slug ON topics FOR EACH ROW EXECUTE PROCEDURE make_slug() postgres=# \d user_favorite_posts Table "public.user_favorite_posts" Column | Type | Modifiers --------------+--------------------------+------------------------ user_id | bigint | not null post_id | bigint | not null favorited_at | timestamp with time zone | not null default now() Indexes: "user_favorite_posts_pkey" PRIMARY KEY, btree (user_id, post_id) Foreign-key constraints: "user_favorite_posts_post_id_fkey" FOREIGN KEY (post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE CASCADE "user_favorite_posts_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE postgres=# \d user_groups Table "public.user_groups" Column | Type | Modifiers ------------+--------------------------+------------------------ user_id | bigint | not null group_name | text | not null is_owner | boolean | not null default false join_date | timestamp with time zone | not null default now() Indexes: "user_groups_pkey" PRIMARY KEY, btree (user_id, group_name) Foreign-key constraints: "user_groups_group_name_fkey" FOREIGN KEY (group_name) REFERENCES groups(name) ON UPDATE CASCADE ON DELETE CASCADE "user_groups_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE postgres=# \d user_ips Table "public.user_ips" Column | Type | Modifiers -----------+--------------------------+------------------------ user_id | bigint | not null ip | inet | not null last_seen | timestamp with time zone | not null default now() Indexes: "user_ips_pkey" PRIMARY KEY, btree (user_id, ip) Foreign-key constraints: "user_ips_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE postgres=# \d user_post_votes Table "public.user_post_votes" Column | Type | Modifiers --------------+--------------------------+------------------------ user_id | bigint | not null post_id | bigint | not null up | boolean | last_changed | timestamp with time zone | not null default now() Indexes: "user_post_votes_pkey" PRIMARY KEY, btree (user_id, post_id) Foreign-key constraints: "user_post_votes_post_id_fkey" FOREIGN KEY (post_id) REFERENCES posts(id) ON UPDATE CASCADE ON DELETE CASCADE "user_post_votes_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE postgres=# \d user_watched_categories Table "public.user_watched_categories" Column | Type | Modifiers --------------+--------------------------+------------------------ user_id | bigint | not null category_id | bigint | not null watched | boolean | last_changed | timestamp with time zone | not null default now() Indexes: "user_watched_categories_pkey" PRIMARY KEY, btree (user_id, category_id) Foreign-key constraints: "user_watched_categories_category_id_fkey" FOREIGN KEY (category_id) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE CASCADE "user_watched_categories_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE postgres=# \d user_watched_topics Table "public.user_watched_topics" Column | Type | Modifiers --------------+--------------------------+------------------------ user_id | bigint | not null topic_id | bigint | not null watched | boolean | last_changed | timestamp with time zone | not null default now() Indexes: "user_watched_topics_pkey" PRIMARY KEY, btree (user_id, topic_id) Foreign-key constraints: "user_watched_topics_topic_id_fkey" FOREIGN KEY (topic_id) REFERENCES topics(id) ON UPDATE CASCADE ON DELETE CASCADE "user_watched_topics_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE postgres=# \d user_watched_users Table "public.user_watched_users" Column | Type | Modifiers ----------------+--------------------------+------------------------ user_id | bigint | not null target_user_id | bigint | not null watched | boolean | last_changed | timestamp with time zone | not null default now() Indexes: "user_watched_users_pkey" PRIMARY KEY, btree (user_id, target_user_id) Foreign-key constraints: "user_watched_users_target_user_id_fkey" FOREIGN KEY (target_user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE "user_watched_users_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE postgres=# \d users Table "public.users" Column | Type | Modifiers ------------------+--------------------------+---------------------------------------------------- id | bigint | not null default nextval('users_id_seq'::regclass) name | character varying(255) | not null slug | character varying(255) | not null password | character varying(60) | full_name | text | email | citext | email_confirmed | boolean | not null default false banned | boolean | not null default false ban_expiration | timestamp with time zone | join_date | timestamp with time zone | not null default now() last_online | timestamp with time zone | picture | character varying(2048) | uploaded_picture | character varying(2048) | cover_url | character varying(2048) | cover_position | double precision[] | group_title | text | website | character varying(2048) | location | text | signature | text | about_me | text | birthday | date | profile_views | bigint | not null default 0 Indexes: "users_pkey" PRIMARY KEY, btree (id) "users_email" UNIQUE, btree (email) "users_slug" UNIQUE, btree (slug) "users_join_date" btree (join_date) Check constraints: "users_slug_exists" CHECK (slug::text <> ''::text) Foreign-key constraints: "users_group_title_fkey" FOREIGN KEY (group_title) REFERENCES groups(name) ON UPDATE CASCADE ON DELETE SET NULL Referenced by: TABLE "posts" CONSTRAINT "posts_author_id_fkey" FOREIGN KEY (author_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL TABLE "posts" CONSTRAINT "posts_editor_id_fkey" FOREIGN KEY (editor_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL TABLE "topics" CONSTRAINT "topics_author_id_fkey" FOREIGN KEY (author_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE SET NULL TABLE "user_favorite_posts" CONSTRAINT "user_favorite_posts_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE TABLE "user_groups" CONSTRAINT "user_groups_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE TABLE "user_ips" CONSTRAINT "user_ips_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE TABLE "user_post_votes" CONSTRAINT "user_post_votes_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE TABLE "user_watched_categories" CONSTRAINT "user_watched_categories_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE TABLE "user_watched_topics" CONSTRAINT "user_watched_topics_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE TABLE "user_watched_users" CONSTRAINT "user_watched_users_target_user_id_fkey" FOREIGN KEY (target_user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE TABLE "user_watched_users" CONSTRAINT "user_watched_users_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE Triggers: users_make_slug BEFORE INSERT OR UPDATE OF name, slug ON users FOR EACH ROW EXECUTE PROCEDURE make_slug()
-
@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:
I cannot parse what you meant.
I haven't got the time or extra brainpower while I'm at work to expend serious thought on it. Tonight or this weekend I bet I can make it run better, though. Hit me up if you see me on GW2.
-
postgres=# select case when join_date < timestamp '2014-02-03T11:33:49' then 'Community Server' when join_date < timestamp '2016-03-19T15:00:00' then 'Discourse' else 'NodeBB' end era, extract(year from join_date) as year, count(*) from users left join posts on users.id = posts.author_id left join user_post_votes on users.id = user_post_votes.user_id where posts.id is null and user_post_votes.user_id is null group by era, year order by year asc; era | year | count ------------------+------+------- Community Server | 2004 | 169 Community Server | 2005 | 1935 Community Server | 2006 | 2671 Community Server | 2007 | 2218 Community Server | 2008 | 2305 Community Server | 2009 | 4565 Community Server | 2010 | 33301 Community Server | 2011 | 38782 Community Server | 2012 | 7794 Community Server | 2013 | 36990 Community Server | 2014 | 303 Discourse | 2014 | 2284 Discourse | 2015 | 467 Discourse | 2016 | 59 NodeBB | 2016 | 92 (15 rows)
-
@ben_lubar Are you counting "# of unliked posts by year"? "# of orphaned likes because deleted the post"?
-
@FrostCat no, I'm counting the number of posts a user has made from a year before a given day to that day.
-
@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:
a user
I didn't see a user ID in the sql, but whatever. I figured my second question would be some kind of clue to the nature of my post, but maybe @blakeyrat's borrowed your shoulder aliens.
-
@ben_lubar If you're writing loops in SQL, you're doing it wrong.
-
@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:
no, I'm counting the number of posts a user has made from a year before a given day to that day.
That might be a bit tricky. Give me a bit.
-
@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:
@FrostCat no, I'm counting the number of posts a user has made from a year before a given day to that day.
Ugh...did you look at the spoon (and related) badge query from discourse? NB: I haven't looked seriously at your query because you put it all in one line, and like blakey, ain't nobody got time for that.
-
Here:
select p.PostDate, u.UserName, topTen.PostCount from Post p join ( select lastPost.PostDate as 'LastPostDate', allPost.UserID as 'UserID', count(allPost.PostID) as 'PostCount' from Post lastPost join Post allPost on allPost.PostDate > dateadd( yy, -1, lastPost.PostDate ) group by lastPost.PostDate, allPost.UserID --order by lastPost.PostDate desc, count( allPost.PostID ) desc ) topTen on topTen.LastPostDate = p.PostDate join [User] u on u.UserID = topTen.UserID order by p.PostDate desc, topTen.PostCount desc
I did indeed need to use a subquery, alas. (If you wanted the whole list for each date, not just the top 10, I'm 99.9% sure you can write this without a subquery. But you'd be sending a shitload more data over the wire.)
Considerations:
- This query will not "fill in blanks" (if there's a day with zero posts, it won't show up in the results. Posts near the start of the timeline won't have the full 10 people listed, etc.) That's tough to fix without a data warehouse-type "date" table.
- This query will not round to the nearest day-- my sample data happened to all be at timecode 00:00:00 but in real life data, of course, you want to use datepart or equivalent to ensure the
group by
s andjoin
s happen only on dates and not times. In T-SQL you can cast toDate
instead ofDateTime
. Not sure how you do it in Postgres. - This query is in T-SQL, no guarantees it translates into Postgres SQL.
- The commented-out
order by
is helpful when trying to debug the inner query, but obviously useless when you're using it as an inner query. That's why it's commented-out. Why make the DB engine sort when it doesn't need to. - "allPost" is a badly-named table alias. It's "all posts in the year preceding lastPost's date". But that's pretty long.
-
@blakeyrat I don't know SQL (or T-SQL) as well as I'd like. What's the difference[1] between "from Post lastPost" and "join [User] u" ? I'd assumed
lastPost
andu
are tablename aliases (don't know the right term), but then why does one have [] and the other not?[1] regarding names, not the sql clause
from
vsjoin
.
-
@FrostCat That's (square brackets) how you use a keyword as a name in sql server.
-
@FrostCat said in @ben_lubar tries to parse <del>ServerCooties</del> data:
What's the difference[1] between "from Post lastPost" and "join [User] u" ? I'd assumed lastPost and u are tablename aliases (don't know the right term), but then why does one have [] and the other not?
User
's a reserved keyword in SQL. The square braces indicate I want to use the User database object, not the reserved keyword.My demo database I did in like 45 seconds, and I was too lazy once I realized
User
was a reserved keyword to switch it to "Users" which is the normal solution to this problem.
-
@blakeyrat said in @ben_lubar tries to parse <del>ServerCooties</del> data:
join Post allPost on allPost.PostDate > dateadd( yy, -1, lastPost.PostDate )
Seems like this should be something like:
join Post allPost on allPost.PostDate between dateadd( yy, -1, lastPost.PostDate ) and lastPost.PostDate
Or you're getting stuff in the future, too.
-
@boomzilla Yup, I think you're right.
select p.PostDate, u.UserName, topTen.PostCount from Post p join ( select lastPost.PostDate as 'LastPostDate', allPost.UserID as 'UserID', count(allPost.PostID) as 'PostCount' from Post lastPost join Post allPost on allPost.PostDate between dateadd( yy, -1, lastPost.PostDate ) and lastPost.PostDate group by lastPost.PostDate, allPost.UserID --order by lastPost.PostDate desc, count( allPost.PostID ) desc ) topTen on topTen.LastPostDate = p.PostDate join [User] u on u.UserID = topTen.UserID order by p.PostDate desc, topTen.PostCount desc
Good catch. SQL's tricky.
To pontificate a bit, the trick here is to join a table to itself. Something a lot of SQL neophytes don't realize is possible/useful, but once you learn the trick you can solve millions of problems with it. There ain't nothing in the SQL spec saying you can't have 1, 2, or even 50 copies of the
Post
table in your joins.
-
postgres=# explain select p.created_at::date, u.name, tt.count from posts p join (select lp.created_at::date last_post_date, ap.author_id user_id, count(ap.id) post_count from posts lp join posts ap on ap.created_at::date between lp.created_at::date - interval '1 year' and lp.created_at::date group by last_post_date, user_id) tt on tt.last_post_date = p.created_at::date join users u on u.id = tt.user_id order by p.created_at::date desc, tt.post_count desc; ERROR: column "p.created_at" must appear in the GROUP BY clause or be used in an aggregate function LINE 1: explain select p.created_at::date, u.name, tt.count from pos... ^
Postgres apparently doesn't like GROUP BY inside a join.
-
@ben_lubar Ben L, if you post another SQL query on a single line I will hunt you down and kill you, I am not kidding.
Your keyboard has an "enter" key, please use it.
@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:
Postgres apparently doesn't like GROUP BY inside a join.
I can't help you with that. I use a DBMS that doesn't suck shit.
EDIT: this StackOverflow question contains a working subquery with a
group by
in it:select n1.name, n1.author_id, cast(count_1 as numeric)/total_count from (select id, name, author_id, count(1) as count_1 from names group by id, name, author_id) n1 inner join (select author_id, count(1) as total_count from names group by author_id) n2 on (n2.author_id = n1.author_id)
So you can do it. I'm not sure what's wrong with the one-line monstrosity you posted.
-
FYI, here's the query for the :
WITH exclusions AS ( /* Which topics to exclude from counters */ SELECT user_id, id, topic_id, post_number FROM posts WHERE raw LIKE '%c09fa970-5a9a-11e4-8ed6-0800200c9a66%' AND user_id IN ( SELECT gu.user_id FROM group_users gu WHERE group_id IN( SELECT g.id FROM groups g WHERE g.name IN ('badge_granters_admin') ) ) ), LastMonth AS ( /* Count eligible posts from last month */ SELECT row_number() OVER (ORDER BY count(*) DESC, bp.user_id), u.username, bp.user_id, count(*) FROM badge_posts bp JOIN users u on u.id=bp.user_id AND bp.user_id NOT IN ( /* ignore bots */ SELECT gu.user_id FROM group_users gu WHERE group_id IN( SELECT g.id FROM groups g WHERE g.name IN ('bots') ) ) WHERE topic_id NOT IN ( /* short topics */ SELECT topic_id FROM badge_posts GROUP BY topic_id HAVING count(topic_id) <10 ) AND topic_id NOT IN ( /* Ineligible topics */ SELECT topic_id FROM exclusions ) AND bp.created_at > CURRENT_DATE - INTERVAL '1 month' GROUP BY u.username, bp.user_id HAVING count(*) > 1 ORDER BY count(*) DESC, bp.user_id ), TotalUsers AS ( SELECT max(row_number) from LastMonth ) SELECT username, user_id, row_number, (row_number*100/TotalUsers.max), count, CURRENT_DATE granted_at FROM LastMonth, TotalUsers WHERE row_number = 1
-
@blakeyrat said in @ben_lubar tries to parse <del>ServerCooties</del> data:
Your keyboard has an "enter" key, please use it.
explain select p.created_at::date, u.name, tt.count from posts p join (select lp.created_at::date last_post_date, ap.author_id user_id, count(ap.id) post_count from posts lp join posts ap on ap.created_at::date between lp.created_at::date - interval '1 year' and lp.created_at::date group by last_post_date, user_id) tt on tt.last_post_date = p.created_at::date join users u on u.id = tt.user_id group by p.created_at::date, u.name, tt.post_count order by p.created_at::date desc, tt.post_count desc;
QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=3904072989514.71..4088823725666.74 rows=7390029446081 width=71) Group Key: ((p.created_at)::date), tt.post_count, u.name -> Sort (cost=3904072989514.71..3922548063129.91 rows=7390029446081 width=71) Sort Key: ((p.created_at)::date) DESC, tt.post_count DESC, u.name -> Merge Join (cost=54950161682.14..202754857645.02 rows=7390029446081 width=71) Merge Cond: (((p.created_at)::date) = tt.last_post_date) -> Sort (cost=163649.36..165902.26 rows=901161 width=8) Sort Key: ((p.created_at)::date) -> Seq Scan on posts p (cost=0.00..62194.61 rows=901161 width=8) -> Materialize (cost=54949998032.79..54958198597.89 rows=1640113020 width=67) -> Sort (cost=54949998032.79..54954098315.34 rows=1640113020 width=67) Sort Key: tt.last_post_date -> Hash Join (cost=53432873040.44..54429888270.11 rows=1640113020 width=67) Hash Cond: (tt.user_id = u.id) -> Subquery Scan on tt (cost=53432866404.24..54372092444.88 rows=1640113020 width=64) -> GroupAggregate (cost=53432866404.24..54355691314.68 rows=1640113020 width=24) Group Key: ((lp.created_at)::date), ap.author_id -> Sort (cost=53432866404.24..53658447278.66 rows=90232349769 width=24) Sort Key: ((lp.created_at)::date), ap.author_id -> Nested Loop (cost=0.00..27761429835.18 rows=90232349769 width=24) Join Filter: (((ap.created_at)::date <= (lp.created_at)::date) AND ((ap.created_at)::date >= ((lp.created_at)::date - '1 year'::interval))) -> Seq Scan on posts ap (cost=0.00..62194.61 rows=901161 width=24) -> Materialize (cost=0.00..70221.42 rows=901161 width=8) -> Seq Scan on posts lp (cost=0.00..62194.61 rows=901161 width=8) -> Hash (cost=4033.98..4033.98 rows=141698 width=19) -> Seq Scan on users u (cost=0.00..4033.98 rows=141698 width=19) (26 rows)
rows=7390029446081
oh god
-
@ben_lubar It should return a maximum of 10 rows per day for each day that had at least one post.
EDIT: oh I see the problem. My inner table NAMED top 10 at some point lost it's
top 10
. Oops.
-
QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=54391206248.86..54391207375.31 rows=45058 width=71) Group Key: ((p.created_at)::date), tt.post_count, u.name -> Sort (cost=54391206248.86..54391206361.51 rows=45058 width=71) Sort Key: ((p.created_at)::date) DESC, tt.post_count DESC, u.name -> Hash Join (cost=54391133650.85..54391200914.49 rows=45058 width=71) Hash Cond: ((p.created_at)::date = tt.last_post_date) -> Seq Scan on posts p (cost=0.00..62194.61 rows=901161 width=8) -> Hash (cost=54391133650.73..54391133650.73 rows=10 width=67) -> Nested Loop (cost=54391133567.67..54391133650.73 rows=10 width=67) -> Subquery Scan on tt (cost=54391133567.38..54391133567.50 rows=10 width=64) -> Limit (cost=54391133567.38..54391133567.40 rows=10 width=24) -> Sort (cost=54391133567.38..54395233849.93 rows=1640113020 width=24) Sort Key: ((lp.created_at)::date) DESC, (count(ap.id)) DESC -> GroupAggregate (cost=53432866404.24..54355691314.68 rows=1640113020 width=24) Group Key: ((lp.created_at)::date), ap.author_id -> Sort (cost=53432866404.24..53658447278.66 rows=90232349769 width=24) Sort Key: ((lp.created_at)::date) DESC, ap.author_id -> Nested Loop (cost=0.00..27761429835.18 rows=90232349769 width=24) Join Filter: (((ap.created_at)::date <= (lp.created_at)::date) AND ((ap.created_at)::date >= ((lp.created_at)::date - '1 year'::interval))) -> Seq Scan on posts ap (cost=0.00..62194.61 rows=901161 width=24) -> Materialize (cost=0.00..70221.42 rows=901161 width=8) -> Seq Scan on posts lp (cost=0.00..62194.61 rows=901161 width=8) -> Index Scan using users_pkey on users u (cost=0.29..8.31 rows=1 width=19) Index Cond: (id = tt.user_id) (24 rows)
nested nested nested loops are scary
-
@ben_lubar said in @ben_lubar tries to parse <del>ServerCooties</del> data:
oh god
Your idea to pull up 10 users for each day for the last 12 years or whatever based on what we all did for the year previous to that was never going to be a reasonable query. Why do you want this?
Maybe break it down by month? Make a table and put each user's post count by month in there. Aggregating that would be simple. This is probably an even better idea if you really really really want to go by day.
But either way you'll probably want / need to break the insert up or the commit / redo stuff will kill you. Literally. Dead .
-
@boomzilla said in @ben_lubar tries to parse <del>ServerCooties</del> data:
never going to be a reasonable query
THIS DATA IS TOO BIG
-
@ben_lubar Yes. You can't eat the elephant all at once.
-
@ben_lubar Hey you asked us for help. If you think you can write it better, knock yourself out.
-
@ben_lubar Set up a script to do a day at a time, or something, and count all the posts for that day per user. So each day is done inside its own transaction. This is going to suck not matter what, but when you get done, you'll be able to do a lot of interesting things with that data very quickly.
Also...don't create any indexes or foreign keys on your table while you're filling it up. Do that afterwards.
-
@blakeyrat said in @ben_lubar tries to parse <del>ServerCooties</del> data:
User's a reserved keyword in SQL. The square braces indicate I want to use the User database object, not the reserved keyword
Ah. That's kind of like how in Access, you could have a field with a space in the name, as long as you referred to it with brackets when you used it.
-
@boomzilla said in @ben_lubar tries to parse <del>ServerCooties</del> data:
count all the posts for that day per user.
Quick question, how many posts are there, total?
define temp-table tt field user as int field date as date field count as int . for each posts no-lock: find first tt where tt.user = posts.user_id and tt.date = posts.date no-error. if not available tt then create tt. tt.posts = tt.posts + 1. end.
at the end of this, the temp-table is full of records that show how many posts every user had on every day in which that user had posts. That's likely to be a much nicer data set with which to work.
One pass. No sorting. If you need a sort of some kind, put an appropriate index on the temp table. The query takes about as long to run as
select user_id from posts
.Want to know the top ten posters for each day?
select top 10 user from posts where date=whatever order by count
. Only with the right syntax.
-
@FrostCat said in @ben_lubar tries to parse <del>ServerCooties</del> data:
@blakeyrat said in @ben_lubar tries to parse <del>ServerCooties</del> data:
User's a reserved keyword in SQL. The square braces indicate I want to use the User database object, not the reserved keyword
Ah. That's kind of like how in Access, you could have a field with a space in the name, as long as you referred to it with brackets when you used it.
T-SQL does allow spaces in the name if you delimit it with either square brackets or double quotes.
PL/SQL allows aliases to have spaces inside double quotes. Square brackets apparently indicate something involving expressions.
-
@djls45 said in @ben_lubar tries to parse <del>ServerCooties</del> data:
T-SQL does allow spaces
I was just trying to come up with an analogy. In Blakey's usage, I guess the brackets escape the normal meaning of the word.
-
what db is this?
-
@clippy said in @ben_lubar tries to parse <del>ServerCooties</del> data:
what db is this?
postgresql
-
@djls45 said in @ben_lubar tries to parse <del>ServerCooties</del> data:
PL/SQL allows aliases to have spaces inside double quotes.
That's the syntax that the SQL standard specifies. Apparently MSSQL also supports it (as well as the bracketed form).
-
My progress on the !!SCIENCE!! script is here: https://gist.github.com/BenLubar/98fe068162c5ae9b11659b7029da3a76
-
2016/07/09 19:46:18 float64 cannot be converted to int64 without losing precision: NaN main.Int64 /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1491 main.HandleTopic.func1 /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1188 main.ForSortedSet /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1561 main.HandleTopic /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1195 main.ForSortedSet /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1561 main.main /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:110 runtime.main /home/ben/.golang/src/runtime/proc.go:183 runtime.goexit /home/ben/.golang/src/runtime/asm_amd64.s:2086 main.HandleTopic.func1 /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1190: position main.ForSortedSet /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1561: ForSortedSet("tid:18672:bookmarks", ("21", NaN)) main.HandleTopic /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1196: bookmarks main.ForSortedSet /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1561: ForSortedSet("topics:tid", ("18672", 9.805716e+11)) exit status 1
-
2016/07/10 11:04:40 topic:20281:"Re: the deleted post thread" 2016/07/10 11:04:40 topic:20282:"Re: the deleted post thread" 2016/07/10 11:04:40 topic:20283:"Fork + github + npm" 2016/07/10 11:04:40 topic:20284:"Bot User Agents" 2016/07/10 11:04:40 topic:20285:"Why is polygamy illegal?" 2016/07/10 11:04:40 topic:20286:"GOOOOOGGGGGLLLLLEEEE" 2016/07/10 11:04:40 pq: invalid byte sequence for encoding "UTF8": 0x00 main.HandleTopic /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1177: insert topic into database main.ForSortedSet /home/ben/go/src/gist.github.com/98fe068162c5ae9b11659b7029da3a76/main.go:1565: ForSortedSet("topics:tid", ("20287", 1.466585097322e+12)) exit status 1
Goddamnit @aliceif