Postgres full-text search
-
Continuing the discussion from Did Discourse just fall over?:
For the record, our slow query logs are WAY quieter post this change.
Maybe just change that to use fulltext search, it should find the guid quick smart without a table scan.
Just learning about full-text searches and either I or the database are doing it wrong. Given the 'full text' search appears to be
- taking longer (Sam says it should be a lot quicker than using
LIKE
) - to find fewer rows (I'm sure I've got the syntax wrong)
... I'm fully prepared to concede it's both.
I'm after
post.id
's that have the search string inpost.raw
. The current query, with the correct results, is at the bottom of this post.[pjh@sofa ~]$ time psql -d discourse -c "SELECT id from posts where to_tsvector(raw) @@ to_tsquery('b3fe22f0-a01d-11e4-bcd8-0800200c9a6') limit 10" NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. id ---- (0 rows) real 1m21.435s user 0m0.003s sys 0m0.008s
[pjh@sofa ~]$ time psql -d discourse -c "SELECT id from posts where raw like '%b3fe22f0-a01d-11e4-bcd8-0800200c9a6%' limit 10" id -------- 247013 186601 131276 260217 410075 (5 rows) real 0m1.264s user 0m0.002s sys 0m0.000s
- taking longer (Sam says it should be a lot quicker than using
-
Yeah you are going to need to query a table with a fts index like
posts_search_data
-
posts_search_data
Like this?:
[pjh@sofa ~]$ time psql -d discourse -c "SELECT post_id from post_search_data where search_data @@ to_tsquery('b3fe22f0-a01d-11e4-bcd8-0800200c9a6') limit 10" post_id --------- (0 rows) real 0m0.165s user 0m0.001s sys 0m0.002s [pjh@sofa ~]$
-
As I posted elsewhere, I think the better (or easier / more straightforward) solution is just to filter the posts better (except on backfill, but 600ms once per day per badge seems ok:
SELECT p.user_id, p.id post_id, p.created_at granted_at, ap.id FROM posts p join posts ap -- admin's post on ap.reply_to_post_number = p.post_number and ap.topic_id = p.topic_id WHERE ap.raw LIKE '%87c69ccc-3c6f-47d3-8c66-4f3fdb9ghi26%' AND (:backfill OR ap.id IN (:post_ids) ) AND ap.user_id IN ( SELECT gu.user_id FROM group_users gu WHERE group_id IN( SELECT g.id FROM groups g WHERE g.name IN ('staff') )) ;
-
NOTICE: word is too long to be indexed
DETAIL: Words longer than 2047 characters are ignored.It seems to be what it says on the tin. http://www.postgresql.org/message-id/Pine.LNX.4.64.0902012044110.9554@sn.sai.msu.ru
Very long words are ignored. Can you do a select on word length and see which words it is? It may be some zalgo text or something like that. If that's true we can ignore it. Also we know the word we're looking for (guid) is way shorter than 2k chars, so er can ignore the warning. Except if something removes the whitespace around the guid
-
Yeah something like that but it's probably scraped out of cooked
-
I think the better (or easier / more straightforward) solution is just to filter the posts better
Not much change actually, unless I'm missing some subtlety in your proposal. Original followed by proposed, followed by
EXPLAIN ANALYZE
for both (I know the columns have shifted - noticed after I'd posted and CBA to change in this one):[pjh@sofa ~]$ sql_tdwtf uuid1 # Original query SELECT user_id, id, topic_id, post_number, reply_to_post_number FROM posts WHERE raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND reply_to_post_number IS NOT NULL AND user_id IN ( SELECT gu.user_id FROM group_users gu WHERE group_id IN( SELECT g.id FROM groups g WHERE g.name IN ('staff') ) ) user_id | id | topic_id | post_number | reply_to_post_number ---------+--------+----------+-------------+---------------------- 18 | 247013 | 8513 | 4 | 2 18 | 186601 | 5668 | 24 | 19 18 | 260217 | 8913 | 88 | 60 603 | 410075 | 7333 | 1350 | 1345 (4 rows) Elapsed: 1.076s Backup taken: 2015-05-20 03:55:28.64354
[pjh@sofa ~]$ sql_tdwtf uuid2 # Proposed change SELECT p.user_id, p.id, p.topic_id, p.post_number, p.reply_to_post_number FROM posts p JOIN posts bg /* badge granter */ ON bg.reply_to_post_number = p.post_number AND bg.topic_id = p.topic_id WHERE bg.raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND bg.user_id IN ( SELECT gu.user_id FROM group_users gu WHERE group_id IN( SELECT g.id FROM groups g WHERE g.name IN ('staff') ) ) user_id | id | topic_id | post_number | reply_to_post_number ---------+--------+----------+-------------+---------------------- 922 | 247005 | 8513 | 2 | 1758 | 186566 | 5668 | 19 | 18 564 | 260147 | 8913 | 60 | 40 606 | 410008 | 7333 | 1345 | 1340 (4 rows) Elapsed: 1.22s Backup taken: 2015-05-20 03:55:28.64354 [pjh@sofa ~]$
[pjh@sofa ~]$ EXPLAIN="EXPLAIN ANALYZE" sql_tdwtf uuid1 # Original query SELECT user_id, id, topic_id, post_number, reply_to_post_number FROM posts WHERE raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND reply_to_post_number IS NOT NULL AND user_id IN ( SELECT gu.user_id FROM group_users gu WHERE group_id IN( SELECT g.id FROM groups g WHERE g.name IN ('staff') ) ) QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop Semi Join (cost=0.42..51579.57 rows=34 width=20) (actual time=3.641..1087.817 rows=4 loops=1) -> Seq Scan on posts (cost=0.00..51250.45 rows=34 width=20) (actual time=3.552..1087.591 rows=4 loops=1) Filter: ((reply_to_post_number IS NOT NULL) AND (raw ~~ '%[UUID removed to prevent this thread accidentally being marked as excluded]%'::text)) Rows Removed by Filter: 420672 -> Nested Loop (cost=0.42..9.67 rows=1 width=4) (actual time=0.041..0.041 rows=1 loops=4) -> Seq Scan on groups g (cost=0.00..1.21 rows=1 width=4) (actual time=0.011..0.011 rows=1 loops=4) Filter: ((name)::text = 'staff'::text) Rows Removed by Filter: 13 -> Index Only Scan using index_group_users_on_group_id_and_user_id on group_users gu (cost=0.42..8.45 rows=1 width=8) (actual time=0.021..0.021 rows=1 loops=4) Index Cond: ((group_id = g.id) AND (user_id = posts.user_id)) Heap Fetches: 4 Total runtime: 1087.940 ms (12 rows) Elapsed: 1.16s Backup taken: 2015-05-20 03:55:28.64354
[pjh@sofa ~]$ EXPLAIN="EXPLAIN ANALYZE" sql_tdwtf uuid2 # Proposed change SELECT p.user_id, p.id, p.topic_id, p.post_number, p.reply_to_post_number FROM posts p JOIN posts bg /* badge granter */ ON bg.reply_to_post_number = p.post_number AND bg.topic_id = p.topic_id WHERE bg.raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND bg.user_id IN ( SELECT gu.user_id FROM group_users gu WHERE group_id IN( SELECT g.id FROM groups g WHERE g.name IN ('staff') ) ) QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.84..52012.22 rows=1480 width=20) (actual time=3.915..1253.336 rows=4 loops=1) -> Nested Loop Semi Join (cost=0.42..51657.01 rows=42 width=8) (actual time=3.884..1253.215 rows=5 loops=1) -> Seq Scan on posts bg (cost=0.00..51250.45 rows=42 width=12) (actual time=3.786..1252.883 rows=5 loops=1) Filter: (raw ~~ '%[UUID removed to prevent this thread accidentally being marked as excluded]%'::text) Rows Removed by Filter: 420671 -> Nested Loop (cost=0.42..9.67 rows=1 width=4) (actual time=0.042..0.042 rows=1 loops=5) -> Seq Scan on groups g (cost=0.00..1.21 rows=1 width=4) (actual time=0.012..0.012 rows=1 loops=5) Filter: ((name)::text = 'staff'::text) Rows Removed by Filter: 13 -> Index Only Scan using index_group_users_on_group_id_and_user_id on group_users gu (cost=0.42..8.45 rows=1 width=8) (actual time=0.022..0.022 rows=1 loops=5) Index Cond: ((group_id = g.id) AND (user_id = bg.user_id)) Heap Fetches: 5 -> Index Scan using index_posts_on_topic_id_and_post_number on posts p (cost=0.42..8.45 rows=1 width=20) (actual time=0.014..0.016 rows=1 loops=5) Index Cond: ((topic_id = bg.topic_id) AND (post_number = bg.reply_to_post_number)) Total runtime: 1253.514 ms (15 rows) Elapsed: 1.28s Backup taken: 2015-05-20 03:55:28.64354 [pjh@sofa ~]$
-
I know the columns have shifted - noticed after I'd posted and CBA to change in this one
With change - same results:
[pjh@sofa ~]$ sql_tdwtf uuid2 # Proposed change SELECT bg.user_id, bg.id, p.topic_id, bg.post_number, bg.reply_to_post_number FROM posts p JOIN posts bg /* badge granter */ ON bg.reply_to_post_number = p.post_number AND bg.topic_id = p.topic_id WHERE bg.raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND bg.user_id IN ( SELECT gu.user_id FROM group_users gu WHERE group_id IN( SELECT g.id FROM groups g WHERE g.name IN ('staff') ) ) user_id | id | topic_id | post_number | reply_to_post_number ---------+--------+----------+-------------+---------------------- 18 | 247013 | 8513 | 4 | 2 18 | 186601 | 5668 | 24 | 19 18 | 260217 | 8913 | 88 | 60 603 | 410075 | 7333 | 1350 | 1345 (4 rows) Elapsed: 1.193s
-
Not much change actually, unless I'm missing some subtlety in your proposal.
The queries you posted are what would be run on a backfill, AIUI. My proposal fixes the case where the query is run after a post is made / edited and the
:post_ids
parameter is used. It removes the Seq Scan (full table scan) on theposts
table.
-
:post_ids
parameter is usedThat was the subtlety. Artificially constructing one for this query:
[pjh@sofa ~]$ sql_tdwtf uuid2 # Proposed change SELECT bg.user_id, bg.id, p.topic_id, bg.post_number, bg.reply_to_post_number FROM posts p JOIN posts bg /* badge granter */ ON bg.reply_to_post_number = p.post_number AND bg.topic_id = p.topic_id WHERE bg.raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND p.id IN (247005, 186566, 260147, 410008) AND bg.user_id IN ( SELECT gu.user_id FROM group_users gu WHERE group_id IN( SELECT g.id FROM groups g WHERE g.name IN ('staff') ) ) user_id | id | topic_id | post_number | reply_to_post_number ---------+--------+----------+-------------+---------------------- 18 | 186601 | 5668 | 24 | 19 18 | 247013 | 8513 | 4 | 2 18 | 260217 | 8913 | 88 | 60 603 | 410075 | 7333 | 1350 | 1345 (4 rows) Elapsed: 0.023s Backup taken: 2015-05-20 03:55:28.64354
It's a bit faster...
-
The second query I posted (first / only in this topic) is also the full query. It doesn't use a CTE at all.