Postgres full-text search


  • Discourse touched me in a no-no place

    Continuing the discussion from Did Discourse just fall over?:

    @sam said:

    For the record, our slow query logs are WAY quieter post this change.

    Maybe just change that to use fulltext search, it should find the guid quick smart without a table scan.


    Just learning about full-text searches and either I or the database are doing it wrong. Given the 'full text' search appears to be

    • taking longer (Sam says it should be a lot quicker than using LIKE)
    • to find fewer rows (I'm sure I've got the syntax wrong)

    ... I'm fully prepared to concede it's both.

    I'm after post.id's that have the search string in post.raw. The current query, with the correct results, is at the bottom of this post.

    [pjh@sofa ~]$ time psql -d discourse -c "SELECT id from posts where to_tsvector(raw) @@ to_tsquery('b3fe22f0-a01d-11e4-bcd8-0800200c9a6') limit 10"
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
    NOTICE:  word is too long to be indexed
    DETAIL:  Words longer than 2047 characters are ignored.
     id
    ----
    (0 rows)
    
    
    real    1m21.435s
    user    0m0.003s
    sys     0m0.008s
    
    
    [pjh@sofa ~]$ time psql -d discourse -c "SELECT id from posts where raw like '%b3fe22f0-a01d-11e4-bcd8-0800200c9a6%' limit 10"
       id
    --------
     247013
     186601
     131276
     260217
     410075
    (5 rows)
    
    
    real    0m1.264s
    user    0m0.002s
    sys     0m0.000s
    

  • Banned

    Yeah you are going to need to query a table with a fts index like
    posts_search_data


  • Discourse touched me in a no-no place

    @sam said:

    posts_search_data

    Like this?:

    [pjh@sofa ~]$ time psql -d discourse -c "SELECT post_id from post_search_data where search_data @@ to_tsquery('b3fe22f0-a01d-11e4-bcd8-0800200c9a6') limit 10"
     post_id
    ---------
    (0 rows)
    
    
    real    0m0.165s
    user    0m0.001s
    sys     0m0.002s
    [pjh@sofa ~]$
    

  • ♿ (Parody)

    As I posted elsewhere, I think the better (or easier / more straightforward) solution is just to filter the posts better (except on backfill, but 600ms once per day per badge seems ok:

    SELECT p.user_id, p.id post_id, p.created_at granted_at, ap.id
    FROM posts p
    join posts ap -- admin's post
      on ap.reply_to_post_number = p.post_number
      and ap.topic_id = p.topic_id
    WHERE ap.raw LIKE '%87c69ccc-3c6f-47d3-8c66-4f3fdb9ghi26%'
      AND (:backfill OR ap.id IN (:post_ids) )
      AND ap.user_id IN  (
    		SELECT gu.user_id
    		FROM group_users gu
    		WHERE group_id IN(
    			SELECT g.id
    			FROM groups g
    			WHERE g.name IN ('staff')
    		))
    ;
    


  • @PJH said:

    NOTICE: word is too long to be indexed
    DETAIL: Words longer than 2047 characters are ignored.

    It seems to be what it says on the tin. http://www.postgresql.org/message-id/Pine.LNX.4.64.0902012044110.9554@sn.sai.msu.ru
    Very long words are ignored. Can you do a select on word length and see which words it is? It may be some zalgo text or something like that. If that's true we can ignore it. Also we know the word we're looking for (guid) is way shorter than 2k chars, so er can ignore the warning. Except if something removes the whitespace around the guid


  • Banned

    Yeah something like that but it's probably scraped out of cooked


  • Discourse touched me in a no-no place

    @boomzilla said:

    I think the better (or easier / more straightforward) solution is just to filter the posts better

    Not much change actually, unless I'm missing some subtlety in your proposal. Original followed by proposed, followed by EXPLAIN ANALYZE for both (I know the columns have shifted - noticed after I'd posted and CBA to change in this one):

    [pjh@sofa ~]$ sql_tdwtf uuid1
    # Original query
            SELECT user_id, id, topic_id, post_number, reply_to_post_number
                    FROM posts
                    WHERE raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND
                    reply_to_post_number IS NOT NULL AND
                    user_id IN  (
                            SELECT gu.user_id
                            FROM group_users gu
                            WHERE group_id IN(
                                    SELECT g.id
                                    FROM groups g
                                    WHERE g.name IN ('staff')
                            )
                    )
    
    
     user_id |   id   | topic_id | post_number | reply_to_post_number
    ---------+--------+----------+-------------+----------------------
          18 | 247013 |     8513 |           4 |                    2
          18 | 186601 |     5668 |          24 |                   19
          18 | 260217 |     8913 |          88 |                   60
         603 | 410075 |     7333 |        1350 |                 1345
    (4 rows)
    
    Elapsed: 1.076s
    Backup taken:  2015-05-20 03:55:28.64354
    
    [pjh@sofa ~]$ sql_tdwtf uuid2
    # Proposed change
            SELECT p.user_id, p.id, p.topic_id, p.post_number, p.reply_to_post_number
                    FROM posts p
                    JOIN posts bg /* badge granter */
                            ON bg.reply_to_post_number = p.post_number
                            AND bg.topic_id = p.topic_id
                    WHERE bg.raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND
                    bg.user_id IN  (
                            SELECT gu.user_id
                            FROM group_users gu
                            WHERE group_id IN(
                                    SELECT g.id
                                    FROM groups g
                                    WHERE g.name IN ('staff')
                            )
                    )
    
    
     user_id |   id   | topic_id | post_number | reply_to_post_number
    ---------+--------+----------+-------------+----------------------
         922 | 247005 |     8513 |           2 |
        1758 | 186566 |     5668 |          19 |                   18
         564 | 260147 |     8913 |          60 |                   40
         606 | 410008 |     7333 |        1345 |                 1340
    (4 rows)
    
    Elapsed: 1.22s
    Backup taken:  2015-05-20 03:55:28.64354
    [pjh@sofa ~]$
    
    [pjh@sofa ~]$ EXPLAIN="EXPLAIN ANALYZE" sql_tdwtf uuid1 
    # Original query
            SELECT user_id, id, topic_id, post_number, reply_to_post_number
                    FROM posts
                    WHERE raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND
                    reply_to_post_number IS NOT NULL AND
                    user_id IN  (
                            SELECT gu.user_id
                            FROM group_users gu
                            WHERE group_id IN(
                                    SELECT g.id
                                    FROM groups g
                                    WHERE g.name IN ('staff')
                            )
                    )
    
    
                                                                                    QUERY PLAN                                                                                 
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Nested Loop Semi Join  (cost=0.42..51579.57 rows=34 width=20) (actual time=3.641..1087.817 rows=4 loops=1)
       ->  Seq Scan on posts  (cost=0.00..51250.45 rows=34 width=20) (actual time=3.552..1087.591 rows=4 loops=1)
             Filter: ((reply_to_post_number IS NOT NULL) AND (raw ~~ '%[UUID removed to prevent this thread accidentally being marked as excluded]%'::text))
             Rows Removed by Filter: 420672
       ->  Nested Loop  (cost=0.42..9.67 rows=1 width=4) (actual time=0.041..0.041 rows=1 loops=4)
             ->  Seq Scan on groups g  (cost=0.00..1.21 rows=1 width=4) (actual time=0.011..0.011 rows=1 loops=4)
                   Filter: ((name)::text = 'staff'::text)
                   Rows Removed by Filter: 13
             ->  Index Only Scan using index_group_users_on_group_id_and_user_id on group_users gu  (cost=0.42..8.45 rows=1 width=8) (actual time=0.021..0.021 rows=1 loops=4)
                   Index Cond: ((group_id = g.id) AND (user_id = posts.user_id))
                   Heap Fetches: 4
     Total runtime: 1087.940 ms
    (12 rows)
    
    Elapsed: 1.16s
    Backup taken:  2015-05-20 03:55:28.64354
    
    [pjh@sofa ~]$ EXPLAIN="EXPLAIN ANALYZE" sql_tdwtf uuid2
    # Proposed change
            SELECT p.user_id, p.id, p.topic_id, p.post_number, p.reply_to_post_number
                    FROM posts p
                    JOIN posts bg /* badge granter */
                            ON bg.reply_to_post_number = p.post_number
                            AND bg.topic_id = p.topic_id
                    WHERE bg.raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND
                    bg.user_id IN  (
                            SELECT gu.user_id
                            FROM group_users gu
                            WHERE group_id IN(
                                    SELECT g.id
                                    FROM groups g
                                    WHERE g.name IN ('staff')
                            )
                    )
    
    
                                                                                       QUERY PLAN
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Nested Loop  (cost=0.84..52012.22 rows=1480 width=20) (actual time=3.915..1253.336 rows=4 loops=1)
       ->  Nested Loop Semi Join  (cost=0.42..51657.01 rows=42 width=8) (actual time=3.884..1253.215 rows=5 loops=1)
             ->  Seq Scan on posts bg  (cost=0.00..51250.45 rows=42 width=12) (actual time=3.786..1252.883 rows=5 loops=1)
                   Filter: (raw ~~ '%[UUID removed to prevent this thread accidentally being marked as excluded]%'::text)
                   Rows Removed by Filter: 420671
             ->  Nested Loop  (cost=0.42..9.67 rows=1 width=4) (actual time=0.042..0.042 rows=1 loops=5)
                   ->  Seq Scan on groups g  (cost=0.00..1.21 rows=1 width=4) (actual time=0.012..0.012 rows=1 loops=5)
                         Filter: ((name)::text = 'staff'::text)
                         Rows Removed by Filter: 13
                   ->  Index Only Scan using index_group_users_on_group_id_and_user_id on group_users gu  (cost=0.42..8.45 rows=1 width=8) (actual time=0.022..0.022 rows=1 loops=5)
                         Index Cond: ((group_id = g.id) AND (user_id = bg.user_id))
                         Heap Fetches: 5
       ->  Index Scan using index_posts_on_topic_id_and_post_number on posts p  (cost=0.42..8.45 rows=1 width=20) (actual time=0.014..0.016 rows=1 loops=5)
             Index Cond: ((topic_id = bg.topic_id) AND (post_number = bg.reply_to_post_number))
     Total runtime: 1253.514 ms
    (15 rows)
    
    Elapsed: 1.28s
    Backup taken:  2015-05-20 03:55:28.64354
    [pjh@sofa ~]$
    

  • Discourse touched me in a no-no place

    @PJH said:

    I know the columns have shifted - noticed after I'd posted and CBA to change in this one

    With change - same results:

    [pjh@sofa ~]$ sql_tdwtf uuid2
    # Proposed change
            SELECT bg.user_id, bg.id, p.topic_id, bg.post_number, bg.reply_to_post_number
                    FROM posts p
                    JOIN posts bg /* badge granter */
                            ON bg.reply_to_post_number = p.post_number
                            AND bg.topic_id = p.topic_id
                    WHERE bg.raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND
                    bg.user_id IN  (
                            SELECT gu.user_id
                            FROM group_users gu
                            WHERE group_id IN(
                                    SELECT g.id
                                    FROM groups g
                                    WHERE g.name IN ('staff')
                            )
                    )
    
    
     user_id |   id   | topic_id | post_number | reply_to_post_number
    ---------+--------+----------+-------------+----------------------
          18 | 247013 |     8513 |           4 |                    2
          18 | 186601 |     5668 |          24 |                   19
          18 | 260217 |     8913 |          88 |                   60
         603 | 410075 |     7333 |        1350 |                 1345
    (4 rows)
    
    Elapsed: 1.193s
    

  • ♿ (Parody)

    @PJH said:

    Not much change actually, unless I'm missing some subtlety in your proposal.

    The queries you posted are what would be run on a backfill, AIUI. My proposal fixes the case where the query is run after a post is made / edited and the :post_ids parameter is used. It removes the Seq Scan (full table scan) on the posts table.


  • Discourse touched me in a no-no place

    @boomzilla said:

    :post_ids parameter is used

    That was the subtlety. Artificially constructing one for this query:

    [pjh@sofa ~]$ sql_tdwtf uuid2
    # Proposed change
            SELECT bg.user_id, bg.id,  p.topic_id, bg.post_number, bg.reply_to_post_number
                    FROM posts p
                    JOIN posts bg /* badge granter */
                            ON bg.reply_to_post_number = p.post_number
                            AND bg.topic_id = p.topic_id
                    WHERE bg.raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND
                    p.id IN (247005, 186566, 260147, 410008) AND
                    bg.user_id IN  (
                            SELECT gu.user_id
                            FROM group_users gu
                            WHERE group_id IN(
                                    SELECT g.id
                                    FROM groups g
                                    WHERE g.name IN ('staff')
                            )
                    )
    
    
     user_id |   id   | topic_id | post_number | reply_to_post_number
    ---------+--------+----------+-------------+----------------------
          18 | 186601 |     5668 |          24 |                   19
          18 | 247013 |     8513 |           4 |                    2
          18 | 260217 |     8913 |          88 |                   60
         603 | 410075 |     7333 |        1350 |                 1345
    (4 rows)
    
    Elapsed: 0.023s
    Backup taken:  2015-05-20 03:55:28.64354
    

    It's a bit faster... 🚎


  • ♿ (Parody)

    The second query I posted (first / only in this topic) is also the full query. It doesn't use a CTE at all.


Log in to reply