Discourse SQL 1 minute query...


  • Discourse touched me in a no-no place

    Continuing the discussion from Sockbot needs lessons in etiquite:

    Background - trying to come up with some new badges for reading (akin to the 2^n series for writing) and found that in order to just count publicly visible posts, my query is over a minute,

    Anyone able to offer any suggestions on how to speed up this query, or am I basically stuffed?:

    #SELECT count(1)
    +FROM posts p
    +JOIN post_timings pt ON
    +   pt.topic_id = p.topic_id AND
    +   pt.post_number = p.post_number
    +;
      count  
    ---------
     4796682
    (1 row)
    
    Time: 49197.787 ms
    

    Database is postgres, the two tables concerned:

    #\d posts
                                                Table "public.posts"
             Column          |            Type             |                     Modifiers           
               
    -------------------------+-----------------------------+-----------------------------------------
    -----------
     id                      | integer                     | not null default nextval('posts_id_seq':
    :regclass)
     user_id                 | integer                     | 
     topic_id                | integer                     | not null
     post_number             | integer                     | not null
     raw                     | text                        | not null
     cooked                  | text                        | not null
     created_at              | timestamp without time zone | not null
     updated_at              | timestamp without time zone | not null
     reply_to_post_number    | integer                     | 
     reply_count             | integer                     | not null default 0
     quote_count             | integer                     | not null default 0
     deleted_at              | timestamp without time zone | 
     off_topic_count         | integer                     | not null default 0
     like_count              | integer                     | not null default 0
     incoming_link_count     | integer                     | not null default 0
     bookmark_count          | integer                     | not null default 0
     avg_time                | integer                     | 
     score                   | double precision            | 
     reads                   | integer                     | not null default 0
     post_type               | integer                     | not null default 1
     vote_count              | integer                     | not null default 0
     sort_order              | integer                     | 
     last_editor_id          | integer                     | 
     hidden                  | boolean                     | not null default false
     hidden_reason_id        | integer                     | 
     notify_moderators_count | integer                     | not null default 0
     spam_count              | integer                     | not null default 0
     illegal_count           | integer                     | not null default 0
     inappropriate_count     | integer                     | not null default 0
     last_version_at         | timestamp without time zone | not null
     user_deleted            | boolean                     | not null default false
     reply_to_user_id        | integer                     | 
     percent_rank            | double precision            | default 1.0
     notify_user_count       | integer                     | not null default 0
     like_score              | integer                     | not null default 0
     deleted_by_id           | integer                     | 
     edit_reason             | character varying(255)      | 
     word_count              | integer                     | 
     version                 | integer                     | not null default 1
     cook_method             | integer                     | not null default 1
     wiki                    | boolean                     | not null default false
     baked_at                | timestamp without time zone | 
     baked_version           | integer                     | 
     hidden_at               | timestamp without time zone | 
     self_edits              | integer                     | not null default 0
     reply_quoted            | boolean                     | not null default false
     via_email               | boolean                     | not null default false
    Indexes:
        "posts_pkey" PRIMARY KEY, btree (id)
        "index_posts_on_topic_id_and_post_number" UNIQUE, btree (topic_id, post_number)
        "idx_posts_created_at_topic_id" btree (created_at, topic_id) WHERE deleted_at IS NULL
        "idx_posts_user_id_deleted_at" btree (user_id) WHERE deleted_at IS NULL
        "index_posts_on_reply_to_post_number" btree (reply_to_post_number)
    
    #\d post_timings
        Table "public.post_timings"
       Column    |  Type   | Modifiers 
    -------------+---------+-----------
     topic_id    | integer | not null
     post_number | integer | not null
     user_id     | integer | not null
     msecs       | integer | not null
    Indexes:
        "post_timings_unique" UNIQUE, btree (topic_id, post_number, user_id)
        "index_post_timings_on_user_id" btree (user_id)
        "post_timings_summary" btree (topic_id, post_number)
    

    And the idiocy:

    #select count(1) from post_timings;
      count  
    ---------
     4805859
    (1 row)
    
    Time: 3498.096 ms
    #select count(1) from posts;
     count  
    --------
     105585
    (1 row)
    
    Time: 207.639 ms
    

    Short of trying to find a proxy for post_timings, I'm just making sure I haven't missed anything obvious...



  • Sounds like actual I/O may be the bottleneck here. Just a lot of data to deal with. A materialized view might be the way to go. Of course, you probably can't do that, and I'm not familiar enough with them in postgres to know if it would be a reasonable solution.


  • mod

    Part of your problem is that there are multiple entries in the post_timings table for each post. It might help speed up your query if you add a post_timings.user_id condition to the query.

    Looking at your query results, If appears that there may be orphaned records in the post_timings table. Basically, records for posts that were read and later deleted.



  • @PJH said:

    select count(1) from post_timings;

    Disclaimer: I am not as familiar of postgres as I am with other databases.

    This is going to force a full table scan which will be IO bound. You need to get the query to use an index by introducing a where clause. If you can, then you can count index nodes which will be considerably faster.

    EDIT: After doing some research, it appears I am not correct:

    PostgreSQL will still need to read the resulting rows to verify that they exist; other database systems may only need to reference the index in this situation.

    Taken from: http://wiki.postgresql.org/wiki/Slow_Counting



  • Count is slow in postgres. Things might improve once you add more conditions.
    Do you have idea what other conditions you might use?


  • Discourse touched me in a no-no place

    @abarker said:

    Basically, records for posts that were read and later deleted.

    There is no such thing as a deleted post, they're simply made inaccessible. I suspect bugs.

    Away from PC atm, but I'll see if I can find out the cause of them.

    @cartman82 said:

    Do you have idea what other conditions you might use?

    No more on post_timings. I'll need to add more joins to replicate the badge_posts table, but I narrowed down the bulk of the time to the part of the query in the OP here. See posts in the linked topic for more background.



  • SELECT count(1)
    +FROM posts p
    +JOIN post_timings pt ON

    • pt.topic_id = p.topic_id AND
    • pt.post_number = p.post_number

    -->

    SELECT count(1)
    FROM
    (select p.topic_id, p.post_number from posts as p inner join post_timings as pt on pt.topic_id = p.topic_id and pt.post_number = p.post_number) as a



  • All my nontrivial experience being with oracle, is there an equivalent of explain plan in postgres? It may not be using indexes to speed up the join.


  • Discourse touched me in a no-no place

    I'll post EXPLAIN tomorrow.

    Thinking about this, the sheer size of post_timings is a major problem here. I'm going to have to find a suitable proxy for getting public posts.

    Paging @Sam...


  • sockdevs

    Request for @PJH

    I don't suppose we could have a new bronze level badge that's awarded to the posters of exact powers of 2 in one of our "spam" topics? (i'm thinking the likes thread and the questions thread)

    I propose the name "Too much time on my hands"


  • Discourse touched me in a no-no place

    Powers of 2? Too conventional. Try powers of 7:
    7, 49, 729, 2401, 16807, …


  • sockdevs

    but that would make it too rare!

    also i really like the name!

    and maybe so does @pjh?


  • sockdevs

    @Arantor said:

    Didn't @PJH say - only today - that the counts for TL3 were based on things that were public (i.e. didn't include flagging and by extension PM topics)?

    TL3 has post read requirements, maybe that query has found a way to handle the massive amounts of reads that we have?


  • Discourse touched me in a no-no place

    @PJH said:

    I'll post EXPLAIN tomorrow.

    )
    FROM posts p
    JOIN post_timings pt ON
       pt.topic_id = p.topic_id AND
       pt.post_number = p.post_number
    ;
                                                                                                                                  
                                                                                                                                  
                                                                                                             QUERY PLAN           
                                                                                                                                  
                                                                                                                                  
                                                                                                  
    ------------------------------------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------
     Aggregate  (cost=893684.14..893684.15 rows=1 width=0) (actual time=53583.520..53583.520 rows=1 loops=1)
       Output: count(1)
       ->  Nested Loop  (cost=0.00..674005.99 rows=87871262 width=0) (actual time=0.169..50128.018 rows=4796682 loops=1)
             ->  Seq Scan on public.posts p  (cost=0.00..11641.85 rows=105585 width=8) (actual time=0.031..474.069 rows=105585 loo
    ps=1)
                   Output: p.id, p.user_id, p.topic_id, p.post_number, p.raw, p.cooked, p.created_at, p.updated_at, p.reply_to_pos
    t_number, p.reply_count, p.quote_count, p.deleted_at, p.off_topic_count, p.like_count, p.incoming_link_count, p.bookmark_count
    , p.avg_time, p.score, p.reads, p.post_type, p.vote_count, p.sort_order, p.last_editor_id, p.hidden, p.hidden_reason_id, p.not
    ify_moderators_count, p.spam_count, p.illegal_count, p.inappropriate_count, p.last_version_at, p.user_deleted, p.reply_to_user
    _id, p.percent_rank, p.notify_user_count, p.like_score, p.deleted_by_id, p.edit_reason, p.word_count, p.version, p.cook_method
    , p.wiki, p.baked_at, p.baked_version, p.hidden_at, p.self_edits, p.reply_quoted, p.via_email
             ->  Index Scan using post_timings_summary on public.post_timings pt  (cost=0.00..6.26 rows=1 width=8) (actual time=0.
    047..0.433 rows=45 loops=105585)
                   Output: pt.topic_id, pt.post_number, pt.user_id, pt.msecs
                   Index Cond: ((pt.topic_id = p.topic_id) AND (pt.post_number = p.post_number))
     Total runtime: 53584.016 ms
    (9 rows)
    
    Time: 53621.326 ms
    
    

  • Discourse touched me in a no-no place

    @accalia said:

    TL3 has post read requirements, maybe that query has found a way to handle the massive amounts of reads that we have?

    Hmm... that's a point...



  • Hm, seems to be doing a nested loops join. Wouldn't expect that, the join condition is on topic and post IDs and both sides have btree indexes on those two columns (in the same order). I'd expect it to pick an index join in that case.

    Have statistics been gathered on these tables/indexes? Assuming that applies to postgres. The plan suggests it thinks the post_timings_summary index is tiny.


  • Discourse touched me in a no-no place

    OK, lots of internal consistency here...

    [postgres@dellxps-pjh ~]$ sql_tdwtf posts_days
    # Posts read and days visited
    SELECT row_number() OVER (ORDER BY sum(posts_read) DESC) as n,
       sum(posts_read) as "Posts read",
       count(*) as "Days visited",
       u.username as "User id"
    FROM user_visits v
    JOIN users u on u.id = v.user_id
    GROUP BY u.username
    ORDER BY sum(posts_read) DESC
    LIMIT 25
    
     n  | Posts read | Days visited |     User id     
    ----+------------+--------------+-----------------
      1 |     101605 |          106 | VinDuv
      2 |     101568 |          138 | ben_lubar
      3 |     101005 |            8 | ben_lubot
      4 |      98009 |          137 | PJH
      5 |      97499 |          136 | dkf
      6 |      97341 |          134 | Keith
      7 |      97286 |           47 | accalia
      8 |      94076 |          135 | ChaosTheEternal
      9 |      92930 |          130 | Maciejasjmj
     10 |      92877 |           17 | sockbot
     11 |      91209 |          124 | darkmatter
     12 |      90628 |          138 | blakeyrat
     13 |      89868 |           97 | Yamikuronue
     14 |      88346 |          113 | Arantor
     15 |      87455 |           78 | Spencer
     16 |      86699 |           94 | M_Adams
     17 |      83613 |          129 | Luhmann
     18 |      83119 |          122 | antiquarian
     19 |      82497 |          116 | HardwareGeek
     20 |      81140 |          118 | mott555
     21 |      81120 |          129 | boomzilla
     22 |      81006 |          109 | cartman82
     23 |      77616 |           94 | aliceif
     24 |      72656 |          116 | jaloopa
     25 |      72275 |           45 | tarunik
    (25 rows)
    
    Elapsed: 0.032s
    Backup taken:  2014-10-03 08:17:40.262149
    
    [postgres@dellxps-pjh ~]$ sql_tdwtf posts_read3
    # Number of publicly visible posts read
    SELECT us.posts_read_count, u.username
    FROM user_stats us
    JOIN users u on u.id=us.user_id
    ORDER BY us.posts_read_count DESC
    LIMIT 25
    
     posts_read_count |    username     
    ------------------+-----------------
               101011 | PJH
               100523 | accalia
               100295 | ben_lubot
                99778 | boomzilla
                98381 | darkmatter
                96547 | sockbot
                93485 | ChaosTheEternal
                93483 | HardwareGeek
                90534 | Luhmann
                89244 | Arantor
                85636 | VinDuv
                83549 | Spencer
                83030 | Maciejasjmj
                76173 | abarker
                75426 | M_Adams
                73939 | Keith
                73362 | dkf
                70778 | blakeyrat
                70665 | ben_lubar
                70074 | mott555
                68360 | aliceif
                68048 | antiquarian
                60780 | Yamikuronue
                57964 | loopback0
                57828 | Onyx
    (25 rows)
    
    Elapsed: 0.009s
    Backup taken:  2014-10-03 08:17:40.262149
    

    :sigh:


  • sockdevs

    @PJH said:

    OK, lots of internal consistency here...

    this is discourse we're talking about....

    what a rabbit hole we stepped down here isn't it?


  • Discourse touched me in a no-no place

    @accalia said:

    TL3 has post read requirements, maybe that query has found a way to handle the massive amounts of reads that we have?

    Having had a little dig, I'm not sure about the fact that the TL3 requirements do actually ignore non-public posts/topics....


  • Discourse touched me in a no-no place

    @accalia said:

    what a rabbit hole we stepped down here isn't it?

    Quite. I think I'm going to have to knock this one on the head.


  • sockdevs

    @PJH said:

    Having had a little dig, I'm not sure about the fact that the TL3 requirements do actually ignore non-public posts/topics....

    i just looked at the requirements.

    if its put together the way i think its put together it can use that troublesome table because a lot of things will be filtered out. one would select the posts that were made within the last 100 days, and then join in the timings to see how many of them were read.

    or it could be doing something else that means it really doesn't work that way.

    discourse.

    @PJH said:

    Quite. I think I'm going to have to knock this one on the head.

    want a hitting stick?


  • Discourse touched me in a no-no place

    @accalia said:

    i just looked at the requirements.

    I've been grubbing around the source. Counts seem to be using the user_stats and user_visits tables. Which, as you can see above, doesn't appear to be doing any filtering.


  • sockdevs

    @PJH said:

    I've been grubbing around the source. Counts seem to be using the user_stats and user_visits tables. Which, as you can see above, doesn't appear to be doing any filtering.

    wouldn't be the first time the code didn't match the documentation

    :laughing:


  • :belt_onion:

    @PJH said:

    91209 | 124 | darkmatter

    @PJH said:

    98381 | darkmatter

    sweet i read 7000 posts apparently on a day i didn't visit the site since it didn't get tracked into user_visits :trollface:



  • Did you try my variant?



  • I apparently read 30903 more posts on days I visited than in total.


  • Discourse touched me in a no-no place

    @Matches said:

    Did you try my variant?

    [postgres@sofa ~]$ sql_tdwtf posts_read4 
    # @Matches variant
    SELECT count(1)
    FROM
    (select p.topic_id, p.post_number from posts as p inner join post_timings as pt on pt.topic_id = p.topic_id and pt.post_number = p.post_number) as a
    LIMIT 25
    
      count  
    ---------
     4542370
    (1 row)
    
    Elapsed: 125.788s
    Backup taken:  2014-10-01 14:05:50.514682
    
    

  • Discourse touched me in a no-no place

    Yay - another formatting bug :)



  • Yes! Slower than the original. Crushed it.

    That formating bug is old news


  • Discourse touched me in a no-no place

    Yup - back here..

    [postgres@sofa ~]$ sql_tdwtf posts_read5
    # Another attempt
            WITH exclusions AS ( /* Which categories to exclude from counters */
            SELECT user_id, id, topic_id, post_number
            FROM posts
            WHERE raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND
            user_id IN  (
                    SELECT gu.user_id
                    FROM group_users gu
                    WHERE group_id IN(
                            SELECT g.id
                            FROM groups g
                            WHERE g.name IN ('admins')
                    )
            )
    ),
    posts_read AS (
            SELECT u.username, u.id user_id, count(*), log(2, count(*)) ln2, log(3.0, count(*)) ln3
            FROM post_timings pt
            JOIN users u on u.id=pt.user_id
            WHERE pt.topic_id NOT IN (
                    SELECT topic_id 
                    FROM exclusions
            )
            GROUP BY u.username, u.id
    )
    SELECT * from posts_read
    WHERE ln3 >= 10 and ln3 <11
    
    
        username     | user_id | count  |         ln2         |         ln3         
    -----------------+---------+--------+---------------------+---------------------
     PleegWat        |     997 |  81622 | 16.3166704412203324 | 10.2946728605858280
     hungrier        |     681 |  91820 | 16.4865208115997284 | 10.4018365129133211
     chubertdev      |      43 | 104475 | 16.6727982330971338 | 10.5193644805546056
     Keith           |     617 | 121576 | 16.8914989326727485 | 10.6573492590437535
     boomzilla       |      18 | 165544 | 17.3368551968774856 | 10.9383377770699532
     Zoidberg        |    1407 | 156347 | 17.2543920114058973 | 10.8863092997816466
     sockbot         |    1223 | 163990 | 17.3232483173882633 | 10.9297527919469417
     Spencer         |     697 | 104947 | 16.6793014009812850 | 10.5234675226651869
     loopback0       |     606 | 134806 | 17.0405251844403095 | 10.7513743553471388
     ijij            |     958 |  65278 | 15.9943092359660440 | 10.0912855847937418
     abarker         |     603 | 104756 | 16.6766733524649746 | 10.5218094086624174
     Zecc            |      30 |  81605 | 16.3163699294694114 | 10.2944832587808741
     blakeyrat       |      17 | 122657 | 16.9042700448762408 | 10.6654069337191363
     reverendryan    |     565 |  61113 | 15.8991916833348286 | 10.0312730907518090
     accalia         |     671 | 166889 | 17.3485293412497326 | 10.9457033421018926
     ookami          |     616 |  85153 | 16.3777697375082411 | 10.3332222245361477
     aliceif         |     922 | 117233 | 16.8390192066115114 | 10.6242382384124396
     Buddy           |     676 |  93011 | 16.5051137270281438 | 10.4135673364627462
     mott555         |      69 | 116734 | 16.8328652967271107 | 10.6203555535655737
     Luhmann         |     628 | 144016 | 17.1358695766485217 | 10.8115299692274865
     OffByOne        |    1163 |  95534 | 16.5437266509361405 | 10.4379293790286919
     Onyx            |     579 |  79690 | 16.2821110767192360 | 10.2728683292575651
     another_sam     |     914 |  66931 | 16.0303869489738806 | 10.1140480873711979
     ben_lubar       |     558 | 108956 | 16.7333861195542528 | 10.5575911808264111
     Kuro            |     785 |  80357 | 16.2941360839154187 | 10.2804552640845477
     CodingHorrorBot |    1253 |  79045 | 16.2703865871888297 | 10.2654709999673947
     ben_lubot       |    1318 |  78217 | 16.2551945823535843 | 10.2558859121004369
     Yamikuronue     |     788 | 143019 | 17.1258472954608542 | 10.8052066138275276
     antiquarian     |     598 | 145035 | 17.1460415693892198 | 10.8179477821007058
     PJH             |      20 | 166822 | 17.3479500341904008 | 10.9453378400417062
     HardwareGeek    |     123 | 147903 | 17.1742917901680985 | 10.8357716869350630
     riking          |     561 |  65304 | 15.9948837419708348 | 10.0916480577257698
     Maciejasjmj     |     261 | 142144 | 17.1169936775469807 | 10.7996206028589115
     ChaosTheEternal |     589 | 143121 | 17.1268758472215564 | 10.8058555577364429
     RaceProUK       |     294 | 151514 | 17.2090915803291070 | 10.8577279099656865
     Arantor         |     762 | 109022 | 16.7342597662122499 | 10.5581423904970498
     cvi             |     588 |  68534 | 16.0645322721776405 | 10.1355913877257640
     antipattern     |     619 |  76693 | 16.2268072841857539 | 10.2379755210628482
     Polygeekery     |     824 |  70057 | 16.0962415895308388 | 10.1555977395093365
     cartman82       |     817 |  90238 | 16.4614474723076888 | 10.3860169971325810
     Jaloopa         |     587 | 131139 | 17.0007372731871906 | 10.7262709783050855
     M_Adams         |     622 |  72132 | 16.1383518054427747 | 10.1821663276574951
     VinDuv          |     602 | 165061 | 17.3326397603262237 | 10.9356781327254694
     tarunik         |    1096 | 134175 | 17.0337563625477891 | 10.7471037042185218
     FrostCat        |     110 | 109448 | 16.7398860660827176 | 10.5616921904878431
     jaming          |    1070 |  62798 | 15.9384309922327776 | 10.0560303382451062
     NetBot          |    1862 | 154844 | 17.2404559560096026 | 10.8775166277847042
     JBert           |    1004 |  63237 | 15.9484813065786794 | 10.0623713805986817
     izzion          |    1094 |  61037 | 15.8973964342598232 | 10.0301404146953164
     darkmatter      |     564 | 163208 | 17.3163522503580768 | 10.9254018580749738
     locallunatic    |      83 | 102943 | 16.6514862062172384 | 10.5059180886871649
     dkf             |      33 | 156285 | 17.2538197917117665 | 10.8859482693510399
    (52 rows)
    
    Elapsed: 17.084s
    Backup taken:  2015-02-13 03:59:31.196191
    [postgres@sofa ~]$
    

    Better than a minute, but this query is still going to have to be run for each of the badges (so, that's at least 10, and coming up to 11 times, making it 3 minutes just for those badges. The :backfill queries already running (of which these will, currently, necessarily be part of) already take 2 minutes, and the query doesn't exclude PMs (though not sure that's really a problem..)

    [postgres@sofa ~]$ sql_tdwtf posts_read5
    # Another attempt
            WITH exclusions AS ( /* Which categories to exclude from counters */
            SELECT user_id, id, topic_id, post_number
            FROM posts
            WHERE raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND
            user_id IN  (
                    SELECT gu.user_id
                    FROM group_users gu
                    WHERE group_id IN(
                            SELECT g.id
                            FROM groups g
                            WHERE g.name IN ('admins')
                    )
            )
    ),
    posts_read AS (
            SELECT u.username, u.id user_id, count(*), log(2, count(*)) ln2, log(3.0, count(*)) ln3
            FROM post_timings pt
            JOIN users u on u.id=pt.user_id
            WHERE pt.topic_id NOT IN (
                    SELECT topic_id
                    FROM exclusions
            )
            GROUP BY u.username, u.id
    )
    SELECT * from posts_read
    WHERE ln3 >= 10 and ln3 <11
    
    
                                                                                         QUERY PLAN
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     CTE Scan on posts_read  (cost=402912.02..402956.07 rows=9 width=124) (actual time=72608.869..73218.880 rows=52 loops=1)
       Filter: ((ln3 >= 10::numeric) AND (ln3 < 11::numeric))
       Rows Removed by Filter: 1288
       CTE exclusions
         ->  Nested Loop Semi Join  (cost=0.28..26400.83 rows=20 width=16) (actual time=22.527..797.612 rows=7 loops=1)
               ->  Seq Scan on posts  (cost=0.00..26181.46 rows=23 width=16) (actual time=22.438..797.083 rows=12 loops=1)
                     Filter: (raw ~~ '%[UUID removed to prevent this thread accidentally being marked as excluded]%'::text)
                     Rows Removed by Filter: 229465
               ->  Nested Loop  (cost=0.28..9.53 rows=1 width=4) (actual time=0.032..0.032 rows=1 loops=12)
                     ->  Seq Scan on groups g  (cost=0.00..1.21 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=12)
                           Filter: ((name)::text = 'admins'::text)
                           Rows Removed by Filter: 11
                     ->  Index Only Scan using index_group_users_on_group_id_and_user_id on group_users gu  (cost=0.28..8.30 rows=1 width=8) (actual time=0.014..0.014 rows=1 loops=12)
                           Index Cond: ((group_id = g.id) AND (user_id = posts.user_id))
                           Heap Fetches: 7
       CTE posts_read
         ->  HashAggregate  (cost=376475.95..376511.19 rows=1762 width=13) (actual time=72605.438..73214.686 rows=1340 loops=1)
               ->  Hash Join  (cost=107.10..302166.23 rows=5944778 width=13) (actual time=808.886..52969.495 rows=10074744 loops=1)
                     Hash Cond: (pt.user_id = u.id)
                     ->  Seq Scan on post_timings pt  (cost=0.45..212887.91 rows=5944778 width=4) (actual time=797.745..20098.182 rows=10074744 loops=1)
                           Filter: (NOT (hashed SubPlan 2))
                           Rows Removed by Filter: 1814792
                           SubPlan 2
                             ->  CTE Scan on exclusions  (cost=0.00..0.40 rows=20 width=4) (actual time=22.533..797.663 rows=7 loops=1)
                     ->  Hash  (cost=84.62..84.62 rows=1762 width=13) (actual time=11.116..11.116 rows=1762 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 60kB
                           ->  Seq Scan on users u  (cost=0.00..84.62 rows=1762 width=13) (actual time=0.008..8.554 rows=1762 loops=1)
     Total runtime: 73219.277 ms
    (28 rows)
    
    Elapsed: 73.259s
    Backup taken:  2015-02-13 03:59:31.196191
    [postgres@sofa ~]$
    

    Paging @hhaamu if you're still around, and interested in prodding this one.



  • I'll take a look at it, but my brains are all out of juice right now; might take a day or two.



  • @PJH said:

    Better than a minute, but this query is still going to have to be run for each of the badges (so, that's at least 10, and coming up to 11 times, making it 3 minutes just for those badges

    Seems like this is probably I/O bound. Probably acceptable if it's a daily run? I dunno...



  • On a short notice, if all the queries start with those two CTEs, you would benefit the most by shoving their result into temporary tables at the start of a run and query those instead.

    CREATE TEMPORARY TABLE exclusions AS SELECT ... etc.


  • Discourse touched me in a no-no place

    I know...

    Sadly each query is run independantly (think sandboxed) and I'm not allowed to modify the db environment like that.

    I wish...


  • Discourse touched me in a no-no place



  • Frankly, I'm not all surprised that reading ten million rows from post_timings (takes 19 seconds) and joining them against users (takes 52 seconds), and finally aggregating them (takes 20 seconds) takes that long.

    One flaw you have is doing the join against users before aggregating. The aggregating criteria is all in the post_timings table. Move the join a bit further about:

    WITH posts_read AS (
    	SELECT u.username, u.id AS user_id, agg.count, agg.ln2, agg.ln3
    	FROM (
    		SELECT user_id, count(*), log(2, count(*)) AS ln2, log(3.0, count(*)) AS ln3
    		FROM post_timings pt
    		WHERE topic_id NOT IN (
    			SELECT topic_id FROM exclusions)
    		GROUP BY user_id
    	) agg
    	JOIN users u ON u.id = agg.user_id
    )
    

    While I'm at it, let's convert the NOT IN () into a left anti-join. (I doubt there's performance gain, but why not.)

    WITH posts_read AS (
    	SELECT u.username, u.id AS user_id, agg.count, agg.ln2, agg.ln3
    	FROM (
    		SELECT user_id, count(*), log(2, count(*)) AS ln2, log(3.0, count(*)) AS ln3
    		FROM post_timings pt
    		LEFT JOIN exclusions e ON pt.topic_id = e.topic_id
    		WHERE e.topic_id IS NULL
    		GROUP BY user_id
    	) agg
    	JOIN users u ON u.id = agg.user_id
    )

  • :belt_onion:

    Depending on the DB, left anti-joins performance can suck hard... on those I've had more luck with not exists() clauses like:
    [code]
    not exists (select 1 from exclusions e where e.topic_id = pt.topic_id)
    [/code]

    though that method probably wont be any better here either given the table stats.



  • @PJH said:

    Sadly each query is run independantly (think sandboxed) and I'm not allowed to modify the db environment like that.

    temp tables are per-connection, so if you're not tearing down and reconnecting between each run, it should be doable. (Of course it depends on what you are 'allowed' to do.)


  • :belt_onion:

    Also, exclusions would need an index on topic_id for it to work optimally.



  • @darkmatter said:

    Also, exclusions would need an index on topic_id for it to work optimally.

    That should be pretty small, so I'd think an index would be useless. Gotta be less than 10 topics with the magic UUID.


  • sockdevs

    @boomzilla said:

    That should be pretty small, so I'd think an index would be useless. Gotta be less than 10 topics with the magic UUID.

    and it has to be matched against at least 60k posts. that's where the index savings would be (unless i very much misunderstand postgres..... which is quite possible)



  • Seven topics. Overhead from navigating a binary tree is way higher than just matching against those seven numbers over and over again.



  • @accalia said:

    and it has to be matched against at least 60k posts. that's where the index savings would be (unless i very much misunderstand postgres..... which is quite possible)

    That seems to be handled. @PJH looked at that and it turns out it's really fast. I think there's some kind of full text search thingy going on. But finding the UUID itself isn't related to indexing topic_id on the the exclusions stuff, which is what @darkmatter was talking about.

    @hhaamu said:

    Overhead from navigating a binary tree is way higher than just matching against those seven numbers over and over again.

    Exactly. I'd assume the DB will ignore an index, so you're just wasting your time.



  • @boomzilla said:

    But finding the UUID itself isn't related to indexing topic_id on the the exclusions stuff, which is what @darkmatter was talking about.

    Beat me to it.

    @boomzilla said:

    I think there's some kind of full text search thingy going on.

    If it's like Oracle then a like with the wildcard at either end is going to mean that any index on that field (raw, here) will be ignored anyway so it's a full scan.



  • @loopback0 said:

    If it's like Oracle then a like with the wildcard at either end is going to mean that any index on that field (raw, here) will be ignored anyway so it's a full scan.

    No, it's definitely not something horrible like that. I mean...@PJH actually tested it and it was fractions of a second or something, so there's something else going on.



  • Fair point.

    At some point, though, the answer may well end up being that we need more hamsters in the server.



  • Something has to eat the cooties.


  • :belt_onion:

    @boomzilla said:

    That should be pretty small, so I'd think an index would be useless. Gotta be less than 10 topics with the magic UUID.

    That's true, with that few rows, an index is a waste and won't even be used.

    I am interested as to whether the left anti-join or the not exists method runs faster. I'm almost certain that the not exists is faster than a not in in almost every scenario especially where you are searching key fields like this, so if @hhaamu thinks the not in and the left anti-join would perform about the same, then surely a not exists is better.



  • @darkmatter said:

    I am interested as to whether the left anti-join or the not exists method runs faster. I'm almost certain that the not exists is faster than a not in in almost every scenario especially where you are searching key fields like this, so if @hhaamu thinks the not in and the left anti-join would perform about the same, then surely a not exists is better.

    Eh...I've never actually touched postgres, so I couldn't say.


  • :belt_onion:

    But my experience is mainly at the macro level - as in, search a 2 billion row table for items that are not in a 100 million row table. Left anti-joining to do that is server homicide. Not In a 100 million row selection is also going to murder the server. Not Exists on an indexed column / set of columns is the only sane way to go.

    But where you have a full table scan for the not exists and are using the same full table for the Not in.... that I couldn't tell you which would be faster.


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.