Discourse SQL 1 minute query...
-
Continuing the discussion from Sockbot needs lessons in etiquite:
Background - trying to come up with some new badges for reading (akin to the 2^n series for writing) and found that in order to just count publicly visible posts, my query is over a minute,
Anyone able to offer any suggestions on how to speed up this query, or am I basically stuffed?:
#SELECT count(1) +FROM posts p +JOIN post_timings pt ON + pt.topic_id = p.topic_id AND + pt.post_number = p.post_number +; count --------- 4796682 (1 row) Time: 49197.787 ms
Database is postgres, the two tables concerned:
#\d posts Table "public.posts" Column | Type | Modifiers -------------------------+-----------------------------+----------------------------------------- ----------- id | integer | not null default nextval('posts_id_seq': :regclass) user_id | integer | topic_id | integer | not null post_number | integer | not null raw | text | not null cooked | text | not null created_at | timestamp without time zone | not null updated_at | timestamp without time zone | not null reply_to_post_number | integer | reply_count | integer | not null default 0 quote_count | integer | not null default 0 deleted_at | timestamp without time zone | off_topic_count | integer | not null default 0 like_count | integer | not null default 0 incoming_link_count | integer | not null default 0 bookmark_count | integer | not null default 0 avg_time | integer | score | double precision | reads | integer | not null default 0 post_type | integer | not null default 1 vote_count | integer | not null default 0 sort_order | integer | last_editor_id | integer | hidden | boolean | not null default false hidden_reason_id | integer | notify_moderators_count | integer | not null default 0 spam_count | integer | not null default 0 illegal_count | integer | not null default 0 inappropriate_count | integer | not null default 0 last_version_at | timestamp without time zone | not null user_deleted | boolean | not null default false reply_to_user_id | integer | percent_rank | double precision | default 1.0 notify_user_count | integer | not null default 0 like_score | integer | not null default 0 deleted_by_id | integer | edit_reason | character varying(255) | word_count | integer | version | integer | not null default 1 cook_method | integer | not null default 1 wiki | boolean | not null default false baked_at | timestamp without time zone | baked_version | integer | hidden_at | timestamp without time zone | self_edits | integer | not null default 0 reply_quoted | boolean | not null default false via_email | boolean | not null default false Indexes: "posts_pkey" PRIMARY KEY, btree (id) "index_posts_on_topic_id_and_post_number" UNIQUE, btree (topic_id, post_number) "idx_posts_created_at_topic_id" btree (created_at, topic_id) WHERE deleted_at IS NULL "idx_posts_user_id_deleted_at" btree (user_id) WHERE deleted_at IS NULL "index_posts_on_reply_to_post_number" btree (reply_to_post_number) #\d post_timings Table "public.post_timings" Column | Type | Modifiers -------------+---------+----------- topic_id | integer | not null post_number | integer | not null user_id | integer | not null msecs | integer | not null Indexes: "post_timings_unique" UNIQUE, btree (topic_id, post_number, user_id) "index_post_timings_on_user_id" btree (user_id) "post_timings_summary" btree (topic_id, post_number)
And the idiocy:
#select count(1) from post_timings; count --------- 4805859 (1 row) Time: 3498.096 ms #select count(1) from posts; count -------- 105585 (1 row) Time: 207.639 ms
Short of trying to find a proxy for
post_timings
, I'm just making sure I haven't missed anything obvious...
-
Sounds like actual I/O may be the bottleneck here. Just a lot of data to deal with. A materialized view might be the way to go. Of course, you probably can't do that, and I'm not familiar enough with them in postgres to know if it would be a reasonable solution.
-
Part of your problem is that there are multiple entries in the
post_timings
table for each post. It might help speed up your query if you add apost_timings.user_id
condition to the query.Looking at your query results, If appears that there may be orphaned records in the
post_timings
table. Basically, records for posts that were read and later deleted.
-
select count(1) from post_timings;
Disclaimer: I am not as familiar of postgres as I am with other databases.
This is going to force a full table scan which will be IO bound.
You need to get the query to use an index by introducing a where clause. If you can, then you can count index nodes which will be considerably faster.EDIT: After doing some research, it appears I am not correct:
PostgreSQL will still need to read the resulting rows to verify that they exist; other database systems may only need to reference the index in this situation.
Taken from: http://wiki.postgresql.org/wiki/Slow_Counting
-
Count is slow in postgres. Things might improve once you add more conditions.
Do you have idea what other conditions you might use?
-
Basically, records for posts that were read and later deleted.
There is no such thing as a deleted post, they're simply made inaccessible. I suspect bugs.
Away from PC atm, but I'll see if I can find out the cause of them.
Do you have idea what other conditions you might use?
No more on post_timings. I'll need to add more joins to replicate the badge_posts table, but I narrowed down the bulk of the time to the part of the query in the OP here. See posts in the linked topic for more background.
-
SELECT count(1)
+FROM posts p
+JOIN post_timings pt ON- pt.topic_id = p.topic_id AND
- pt.post_number = p.post_number
-->
SELECT count(1)
FROM
(select p.topic_id, p.post_number from posts as p inner join post_timings as pt on pt.topic_id = p.topic_id and pt.post_number = p.post_number) as a
-
All my nontrivial experience being with oracle, is there an equivalent of explain plan in postgres? It may not be using indexes to speed up the join.
-
I'll post EXPLAIN tomorrow.
Thinking about this, the sheer size of post_timings is a major problem here. I'm going to have to find a suitable proxy for getting public posts.
Paging @Sam...
-
Request for @PJH
I don't suppose we could have a new bronze level badge that's awarded to the posters of exact powers of 2 in one of our "spam" topics? (i'm thinking the likes thread and the questions thread)
I propose the name "Too much time on my hands"
-
Powers of 2? Too conventional. Try powers of 7:
7, 49, 729, 2401, 16807, …
-
-
Didn't @PJH say - only today - that the counts for TL3 were based on things that were public (i.e. didn't include flagging and by extension PM topics)?
TL3 has post read requirements, maybe that query has found a way to handle the massive amounts of reads that we have?
-
I'll post EXPLAIN tomorrow.
) FROM posts p JOIN post_timings pt ON pt.topic_id = p.topic_id AND pt.post_number = p.post_number ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------------ ---------------------------------------------------------------------------------------------- Aggregate (cost=893684.14..893684.15 rows=1 width=0) (actual time=53583.520..53583.520 rows=1 loops=1) Output: count(1) -> Nested Loop (cost=0.00..674005.99 rows=87871262 width=0) (actual time=0.169..50128.018 rows=4796682 loops=1) -> Seq Scan on public.posts p (cost=0.00..11641.85 rows=105585 width=8) (actual time=0.031..474.069 rows=105585 loo ps=1) Output: p.id, p.user_id, p.topic_id, p.post_number, p.raw, p.cooked, p.created_at, p.updated_at, p.reply_to_pos t_number, p.reply_count, p.quote_count, p.deleted_at, p.off_topic_count, p.like_count, p.incoming_link_count, p.bookmark_count , p.avg_time, p.score, p.reads, p.post_type, p.vote_count, p.sort_order, p.last_editor_id, p.hidden, p.hidden_reason_id, p.not ify_moderators_count, p.spam_count, p.illegal_count, p.inappropriate_count, p.last_version_at, p.user_deleted, p.reply_to_user _id, p.percent_rank, p.notify_user_count, p.like_score, p.deleted_by_id, p.edit_reason, p.word_count, p.version, p.cook_method , p.wiki, p.baked_at, p.baked_version, p.hidden_at, p.self_edits, p.reply_quoted, p.via_email -> Index Scan using post_timings_summary on public.post_timings pt (cost=0.00..6.26 rows=1 width=8) (actual time=0. 047..0.433 rows=45 loops=105585) Output: pt.topic_id, pt.post_number, pt.user_id, pt.msecs Index Cond: ((pt.topic_id = p.topic_id) AND (pt.post_number = p.post_number)) Total runtime: 53584.016 ms (9 rows) Time: 53621.326 ms
-
TL3 has post read requirements, maybe that query has found a way to handle the massive amounts of reads that we have?
Hmm... that's a point...
-
Hm, seems to be doing a nested loops join. Wouldn't expect that, the join condition is on topic and post IDs and both sides have btree indexes on those two columns (in the same order). I'd expect it to pick an index join in that case.
Have statistics been gathered on these tables/indexes? Assuming that applies to postgres. The plan suggests it thinks the post_timings_summary index is tiny.
-
OK, lots of internal consistency here...
[postgres@dellxps-pjh ~]$ sql_tdwtf posts_days # Posts read and days visited SELECT row_number() OVER (ORDER BY sum(posts_read) DESC) as n, sum(posts_read) as "Posts read", count(*) as "Days visited", u.username as "User id" FROM user_visits v JOIN users u on u.id = v.user_id GROUP BY u.username ORDER BY sum(posts_read) DESC LIMIT 25 n | Posts read | Days visited | User id ----+------------+--------------+----------------- 1 | 101605 | 106 | VinDuv 2 | 101568 | 138 | ben_lubar 3 | 101005 | 8 | ben_lubot 4 | 98009 | 137 | PJH 5 | 97499 | 136 | dkf 6 | 97341 | 134 | Keith 7 | 97286 | 47 | accalia 8 | 94076 | 135 | ChaosTheEternal 9 | 92930 | 130 | Maciejasjmj 10 | 92877 | 17 | sockbot 11 | 91209 | 124 | darkmatter 12 | 90628 | 138 | blakeyrat 13 | 89868 | 97 | Yamikuronue 14 | 88346 | 113 | Arantor 15 | 87455 | 78 | Spencer 16 | 86699 | 94 | M_Adams 17 | 83613 | 129 | Luhmann 18 | 83119 | 122 | antiquarian 19 | 82497 | 116 | HardwareGeek 20 | 81140 | 118 | mott555 21 | 81120 | 129 | boomzilla 22 | 81006 | 109 | cartman82 23 | 77616 | 94 | aliceif 24 | 72656 | 116 | jaloopa 25 | 72275 | 45 | tarunik (25 rows) Elapsed: 0.032s Backup taken: 2014-10-03 08:17:40.262149
[postgres@dellxps-pjh ~]$ sql_tdwtf posts_read3 # Number of publicly visible posts read SELECT us.posts_read_count, u.username FROM user_stats us JOIN users u on u.id=us.user_id ORDER BY us.posts_read_count DESC LIMIT 25 posts_read_count | username ------------------+----------------- 101011 | PJH 100523 | accalia 100295 | ben_lubot 99778 | boomzilla 98381 | darkmatter 96547 | sockbot 93485 | ChaosTheEternal 93483 | HardwareGeek 90534 | Luhmann 89244 | Arantor 85636 | VinDuv 83549 | Spencer 83030 | Maciejasjmj 76173 | abarker 75426 | M_Adams 73939 | Keith 73362 | dkf 70778 | blakeyrat 70665 | ben_lubar 70074 | mott555 68360 | aliceif 68048 | antiquarian 60780 | Yamikuronue 57964 | loopback0 57828 | Onyx (25 rows) Elapsed: 0.009s Backup taken: 2014-10-03 08:17:40.262149
:sigh:
-
OK, lots of internal consistency here...
this is discourse we're talking about....
what a rabbit hole we stepped down here isn't it?
-
TL3 has post read requirements, maybe that query has found a way to handle the massive amounts of reads that we have?
Having had a little dig, I'm not sure about the fact that the TL3 requirements do actually ignore non-public posts/topics....
-
what a rabbit hole we stepped down here isn't it?
Quite. I think I'm going to have to knock this one on the head.
-
Having had a little dig, I'm not sure about the fact that the TL3 requirements do actually ignore non-public posts/topics....
i just looked at the requirements.
if its put together the way i think its put together it can use that troublesome table because a lot of things will be filtered out. one would select the posts that were made within the last 100 days, and then join in the timings to see how many of them were read.
or it could be doing something else that means it really doesn't work that way.
discourse.
Quite. I think I'm going to have to knock this one on the head.
want a hitting stick?
-
i just looked at the requirements.
I've been grubbing around the source. Counts seem to be using the user_stats and user_visits tables. Which, as you can see above, doesn't appear to be doing any filtering.
-
I've been grubbing around the source. Counts seem to be using the user_stats and user_visits tables. Which, as you can see above, doesn't appear to be doing any filtering.
wouldn't be the first time the code didn't match the documentation
-
91209 | 124 | darkmatter
98381 | darkmatter
sweet i read 7000 posts apparently on a day i didn't visit the site since it didn't get tracked into user_visits
-
Did you try my variant?
-
I apparently read 30903 more posts on days I visited than in total.
-
Did you try my variant?
[postgres@sofa ~]$ sql_tdwtf posts_read4 # @Matches variant SELECT count(1) FROM (select p.topic_id, p.post_number from posts as p inner join post_timings as pt on pt.topic_id = p.topic_id and pt.post_number = p.post_number) as a LIMIT 25 count --------- 4542370 (1 row) Elapsed: 125.788s Backup taken: 2014-10-01 14:05:50.514682
-
Yay - another formatting bug :)
-
Yes! Slower than the original. Crushed it.
That formating bug is old news
-
Yup - back here..
[postgres@sofa ~]$ sql_tdwtf posts_read5 # Another attempt WITH exclusions AS ( /* Which categories to exclude from counters */ SELECT user_id, id, topic_id, post_number FROM posts WHERE raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND user_id IN ( SELECT gu.user_id FROM group_users gu WHERE group_id IN( SELECT g.id FROM groups g WHERE g.name IN ('admins') ) ) ), posts_read AS ( SELECT u.username, u.id user_id, count(*), log(2, count(*)) ln2, log(3.0, count(*)) ln3 FROM post_timings pt JOIN users u on u.id=pt.user_id WHERE pt.topic_id NOT IN ( SELECT topic_id FROM exclusions ) GROUP BY u.username, u.id ) SELECT * from posts_read WHERE ln3 >= 10 and ln3 <11 username | user_id | count | ln2 | ln3 -----------------+---------+--------+---------------------+--------------------- PleegWat | 997 | 81622 | 16.3166704412203324 | 10.2946728605858280 hungrier | 681 | 91820 | 16.4865208115997284 | 10.4018365129133211 chubertdev | 43 | 104475 | 16.6727982330971338 | 10.5193644805546056 Keith | 617 | 121576 | 16.8914989326727485 | 10.6573492590437535 boomzilla | 18 | 165544 | 17.3368551968774856 | 10.9383377770699532 Zoidberg | 1407 | 156347 | 17.2543920114058973 | 10.8863092997816466 sockbot | 1223 | 163990 | 17.3232483173882633 | 10.9297527919469417 Spencer | 697 | 104947 | 16.6793014009812850 | 10.5234675226651869 loopback0 | 606 | 134806 | 17.0405251844403095 | 10.7513743553471388 ijij | 958 | 65278 | 15.9943092359660440 | 10.0912855847937418 abarker | 603 | 104756 | 16.6766733524649746 | 10.5218094086624174 Zecc | 30 | 81605 | 16.3163699294694114 | 10.2944832587808741 blakeyrat | 17 | 122657 | 16.9042700448762408 | 10.6654069337191363 reverendryan | 565 | 61113 | 15.8991916833348286 | 10.0312730907518090 accalia | 671 | 166889 | 17.3485293412497326 | 10.9457033421018926 ookami | 616 | 85153 | 16.3777697375082411 | 10.3332222245361477 aliceif | 922 | 117233 | 16.8390192066115114 | 10.6242382384124396 Buddy | 676 | 93011 | 16.5051137270281438 | 10.4135673364627462 mott555 | 69 | 116734 | 16.8328652967271107 | 10.6203555535655737 Luhmann | 628 | 144016 | 17.1358695766485217 | 10.8115299692274865 OffByOne | 1163 | 95534 | 16.5437266509361405 | 10.4379293790286919 Onyx | 579 | 79690 | 16.2821110767192360 | 10.2728683292575651 another_sam | 914 | 66931 | 16.0303869489738806 | 10.1140480873711979 ben_lubar | 558 | 108956 | 16.7333861195542528 | 10.5575911808264111 Kuro | 785 | 80357 | 16.2941360839154187 | 10.2804552640845477 CodingHorrorBot | 1253 | 79045 | 16.2703865871888297 | 10.2654709999673947 ben_lubot | 1318 | 78217 | 16.2551945823535843 | 10.2558859121004369 Yamikuronue | 788 | 143019 | 17.1258472954608542 | 10.8052066138275276 antiquarian | 598 | 145035 | 17.1460415693892198 | 10.8179477821007058 PJH | 20 | 166822 | 17.3479500341904008 | 10.9453378400417062 HardwareGeek | 123 | 147903 | 17.1742917901680985 | 10.8357716869350630 riking | 561 | 65304 | 15.9948837419708348 | 10.0916480577257698 Maciejasjmj | 261 | 142144 | 17.1169936775469807 | 10.7996206028589115 ChaosTheEternal | 589 | 143121 | 17.1268758472215564 | 10.8058555577364429 RaceProUK | 294 | 151514 | 17.2090915803291070 | 10.8577279099656865 Arantor | 762 | 109022 | 16.7342597662122499 | 10.5581423904970498 cvi | 588 | 68534 | 16.0645322721776405 | 10.1355913877257640 antipattern | 619 | 76693 | 16.2268072841857539 | 10.2379755210628482 Polygeekery | 824 | 70057 | 16.0962415895308388 | 10.1555977395093365 cartman82 | 817 | 90238 | 16.4614474723076888 | 10.3860169971325810 Jaloopa | 587 | 131139 | 17.0007372731871906 | 10.7262709783050855 M_Adams | 622 | 72132 | 16.1383518054427747 | 10.1821663276574951 VinDuv | 602 | 165061 | 17.3326397603262237 | 10.9356781327254694 tarunik | 1096 | 134175 | 17.0337563625477891 | 10.7471037042185218 FrostCat | 110 | 109448 | 16.7398860660827176 | 10.5616921904878431 jaming | 1070 | 62798 | 15.9384309922327776 | 10.0560303382451062 NetBot | 1862 | 154844 | 17.2404559560096026 | 10.8775166277847042 JBert | 1004 | 63237 | 15.9484813065786794 | 10.0623713805986817 izzion | 1094 | 61037 | 15.8973964342598232 | 10.0301404146953164 darkmatter | 564 | 163208 | 17.3163522503580768 | 10.9254018580749738 locallunatic | 83 | 102943 | 16.6514862062172384 | 10.5059180886871649 dkf | 33 | 156285 | 17.2538197917117665 | 10.8859482693510399 (52 rows) Elapsed: 17.084s Backup taken: 2015-02-13 03:59:31.196191 [postgres@sofa ~]$
Better than a minute, but this query is still going to have to be run for each of the badges (so, that's at least 10, and coming up to 11 times, making it 3 minutes just for those badges. The
:backfill
queries already running (of which these will, currently, necessarily be part of) already take 2 minutes, and the query doesn't exclude PMs (though not sure that's really a problem..)[postgres@sofa ~]$ sql_tdwtf posts_read5 # Another attempt WITH exclusions AS ( /* Which categories to exclude from counters */ SELECT user_id, id, topic_id, post_number FROM posts WHERE raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND user_id IN ( SELECT gu.user_id FROM group_users gu WHERE group_id IN( SELECT g.id FROM groups g WHERE g.name IN ('admins') ) ) ), posts_read AS ( SELECT u.username, u.id user_id, count(*), log(2, count(*)) ln2, log(3.0, count(*)) ln3 FROM post_timings pt JOIN users u on u.id=pt.user_id WHERE pt.topic_id NOT IN ( SELECT topic_id FROM exclusions ) GROUP BY u.username, u.id ) SELECT * from posts_read WHERE ln3 >= 10 and ln3 <11 QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ CTE Scan on posts_read (cost=402912.02..402956.07 rows=9 width=124) (actual time=72608.869..73218.880 rows=52 loops=1) Filter: ((ln3 >= 10::numeric) AND (ln3 < 11::numeric)) Rows Removed by Filter: 1288 CTE exclusions -> Nested Loop Semi Join (cost=0.28..26400.83 rows=20 width=16) (actual time=22.527..797.612 rows=7 loops=1) -> Seq Scan on posts (cost=0.00..26181.46 rows=23 width=16) (actual time=22.438..797.083 rows=12 loops=1) Filter: (raw ~~ '%[UUID removed to prevent this thread accidentally being marked as excluded]%'::text) Rows Removed by Filter: 229465 -> Nested Loop (cost=0.28..9.53 rows=1 width=4) (actual time=0.032..0.032 rows=1 loops=12) -> Seq Scan on groups g (cost=0.00..1.21 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=12) Filter: ((name)::text = 'admins'::text) Rows Removed by Filter: 11 -> Index Only Scan using index_group_users_on_group_id_and_user_id on group_users gu (cost=0.28..8.30 rows=1 width=8) (actual time=0.014..0.014 rows=1 loops=12) Index Cond: ((group_id = g.id) AND (user_id = posts.user_id)) Heap Fetches: 7 CTE posts_read -> HashAggregate (cost=376475.95..376511.19 rows=1762 width=13) (actual time=72605.438..73214.686 rows=1340 loops=1) -> Hash Join (cost=107.10..302166.23 rows=5944778 width=13) (actual time=808.886..52969.495 rows=10074744 loops=1) Hash Cond: (pt.user_id = u.id) -> Seq Scan on post_timings pt (cost=0.45..212887.91 rows=5944778 width=4) (actual time=797.745..20098.182 rows=10074744 loops=1) Filter: (NOT (hashed SubPlan 2)) Rows Removed by Filter: 1814792 SubPlan 2 -> CTE Scan on exclusions (cost=0.00..0.40 rows=20 width=4) (actual time=22.533..797.663 rows=7 loops=1) -> Hash (cost=84.62..84.62 rows=1762 width=13) (actual time=11.116..11.116 rows=1762 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 60kB -> Seq Scan on users u (cost=0.00..84.62 rows=1762 width=13) (actual time=0.008..8.554 rows=1762 loops=1) Total runtime: 73219.277 ms (28 rows) Elapsed: 73.259s Backup taken: 2015-02-13 03:59:31.196191 [postgres@sofa ~]$
Paging @hhaamu if you're still around, and interested in prodding this one.
-
I'll take a look at it, but my brains are all out of juice right now; might take a day or two.
-
Better than a minute, but this query is still going to have to be run for each of the badges (so, that's at least 10, and coming up to 11 times, making it 3 minutes just for those badges
Seems like this is probably I/O bound. Probably acceptable if it's a daily run? I dunno...
-
On a short notice, if all the queries start with those two CTEs, you would benefit the most by shoving their result into temporary tables at the start of a run and query those instead.
CREATE TEMPORARY TABLE exclusions AS SELECT ...
etc.
-
I know...
Sadly each query is run independantly (think sandboxed) and I'm not allowed to modify the db environment like that.
I wish...
-
-
Frankly, I'm not all surprised that reading ten million rows from
post_timings
(takes 19 seconds) and joining them againstusers
(takes 52 seconds), and finally aggregating them (takes 20 seconds) takes that long.One flaw you have is doing the join against
users
before aggregating. The aggregating criteria is all in thepost_timings
table. Move the join a bit further about:WITH posts_read AS ( SELECT u.username, u.id AS user_id, agg.count, agg.ln2, agg.ln3 FROM ( SELECT user_id, count(*), log(2, count(*)) AS ln2, log(3.0, count(*)) AS ln3 FROM post_timings pt WHERE topic_id NOT IN ( SELECT topic_id FROM exclusions) GROUP BY user_id ) agg JOIN users u ON u.id = agg.user_id )
While I'm at it, let's convert the
NOT IN ()
into a left anti-join. (I doubt there's performance gain, but why not.)WITH posts_read AS ( SELECT u.username, u.id AS user_id, agg.count, agg.ln2, agg.ln3 FROM ( SELECT user_id, count(*), log(2, count(*)) AS ln2, log(3.0, count(*)) AS ln3 FROM post_timings pt LEFT JOIN exclusions e ON pt.topic_id = e.topic_id WHERE e.topic_id IS NULL GROUP BY user_id ) agg JOIN users u ON u.id = agg.user_id )
-
Depending on the DB, left anti-joins performance can suck hard... on those I've had more luck with not exists() clauses like:
[code]
not exists (select 1 from exclusions e where e.topic_id = pt.topic_id)
[/code]though that method probably wont be any better here either given the table stats.
-
Sadly each query is run independantly (think sandboxed) and I'm not allowed to modify the db environment like that.
temp tables are per-connection, so if you're not tearing down and reconnecting between each run, it should be doable. (Of course it depends on what you are 'allowed' to do.)
-
Also, exclusions would need an index on topic_id for it to work optimally.
-
Also, exclusions would need an index on topic_id for it to work optimally.
That should be pretty small, so I'd think an index would be useless. Gotta be less than 10 topics with the magic UUID.
-
That should be pretty small, so I'd think an index would be useless. Gotta be less than 10 topics with the magic UUID.
and it has to be matched against at least 60k posts. that's where the index savings would be (unless i very much misunderstand postgres..... which is quite possible)
-
Seven topics. Overhead from navigating a binary tree is way higher than just matching against those seven numbers over and over again.
-
and it has to be matched against at least 60k posts. that's where the index savings would be (unless i very much misunderstand postgres..... which is quite possible)
That seems to be handled. @PJH looked at that and it turns out it's really fast. I think there's some kind of full text search thingy going on. But finding the UUID itself isn't related to indexing topic_id on the the exclusions stuff, which is what @darkmatter was talking about.
Overhead from navigating a binary tree is way higher than just matching against those seven numbers over and over again.
Exactly. I'd assume the DB will ignore an index, so you're just wasting your time.
-
But finding the UUID itself isn't related to indexing topic_id on the the exclusions stuff, which is what @darkmatter was talking about.
Beat me to it.
I think there's some kind of full text search thingy going on.
If it's like Oracle then a
like
with the wildcard at either end is going to mean that any index on that field (raw, here) will be ignored anyway so it's a full scan.
-
If it's like Oracle then a like with the wildcard at either end is going to mean that any index on that field (raw, here) will be ignored anyway so it's a full scan.
No, it's definitely not something horrible like that. I mean...@PJH actually tested it and it was fractions of a second or something, so there's something else going on.
-
Fair point.
At some point, though, the answer may well end up being that we need more hamsters in the server.
-
Something has to eat the cooties.
-
That should be pretty small, so I'd think an index would be useless. Gotta be less than 10 topics with the magic UUID.
That's true, with that few rows, an index is a waste and won't even be used.
I am interested as to whether the left anti-join or the
not exists
method runs faster. I'm almost certain that thenot exists
is faster than anot in
in almost every scenario especially where you are searching key fields like this, so if @hhaamu thinks thenot in
and the left anti-join would perform about the same, then surely anot exists
is better.
-
I am interested as to whether the left anti-join or the not exists method runs faster. I'm almost certain that the not exists is faster than a not in in almost every scenario especially where you are searching key fields like this, so if @hhaamu thinks the not in and the left anti-join would perform about the same, then surely a not exists is better.
Eh...I've never actually touched postgres, so I couldn't say.
-
But my experience is mainly at the macro level - as in, search a 2 billion row table for items that are not in a 100 million row table. Left anti-joining to do that is server homicide.
Not In
a 100 million row selection is also going to murder the server.Not Exists
on an indexed column / set of columns is the only sane way to go.But where you have a full table scan for the
not exists
and are using the same full table for theNot in
.... that I couldn't tell you which would be faster.