Discourse SQL 1 minute query...
-
I always test stuff like this, because I have a difficult time predicting how the DB will react to stuff like this. So...I'm interested in how it comes out, but honestly, I think this query is I/O bound, looking up all those posts.
-
hm... I have dbs with 10s of millions of rows and keys in 10row or less tables... i could just simulate PJH's query over those to see performance.
-
SQL Server can get me a result over very similarly sized tables using the
Not In
in 11 seconds.
UsingNot Exists
takes 10 seconds.
But about 7 of those seconds are IO, there are ~55,000 results in the particular DB I'm querying.So it's more like a 3s vs 4s runtime for the actual query, with 7s of IO following.
I dont have a Postgres DB to test on though.
-
I dont have a Postgres DB to test on though.
Yeah. For this sort of optimization, I don't see the point, because it seems very unlikely to transfer.
-
I dont have a Postgres DB to test on though.
I have a small Discourse dump (not from this forum), but I don't think it's big enough for results to be relevant.
-
My previous as reference, 15.148s: http://pastebin.com/y1EPVSJw
One flaw you have is doing the join against users before aggregating. The aggregating criteria is all in the post_timings table. Move the join a bit further about:
Down to 8.988 seconds: http://pastebin.com/RDND64m9
While I'm at it, let's convert the NOT IN () into a left anti-join. (I doubt there's performance gain, but why not.)
Comparable at 8.839: http://pastebin.com/UHzgxj2w
temp tables are per-connection, so if you're not tearing down and reconnecting between each run, it should be doable. (Of course it depends on what you are 'allowed' to do.)
I've no idea, but suspect new connection. This is where it happens I think:
With post_clause being the badge query itself.
and it has to be matched against at least 60k posts. that's where the index savings would be (unless i very much misunderstand postgres..... which is quite possible)
The part of the query doing the search for the UUID is surprisingly fast:[postgres@sofa ~]$ sql_tdwtf exclusions # Excluded posts WITH exclusions AS ( /* Which categories to exclude from counters */ SELECT user_id, id, topic_id, post_number FROM posts WHERE raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND user_id IN ( SELECT gu.user_id FROM group_users gu WHERE group_id IN( SELECT g.id FROM groups g WHERE g.name IN ('admins') ) ) ) SELECT * FROM exclusions user_id | id | topic_id | post_number ---------+--------+----------+------------- 20 | 136772 | 2929 | 17 20 | 203068 | 6979 | 1186 20 | 131170 | 3125 | 4585 20 | 131125 | 1673 | 3632 20 | 131276 | 4292 | 1 20 | 131160 | 1673 | 3634 20 | 131167 | 1000 | 27836 (7 rows) Elapsed: 0.575s Backup taken: 2015-02-14 03:59:29.758072
-
Comparable at 8.839: http://pastebin.com/UHzgxj2w
EXPLAIN ANALYZE
for the interested: http://pastebin.com/FcL3xBgJ
-
Moving that JOIN seems to have been a good win.
-
well then....
of course @pjh might want to exclude bots from getting that badger. ;-)
-
I went with a simpler query.
[postgres@sofa ~]$ sql_tdwtf posts_read8 # SELECT pt.user_id, count(*) c, CURRENT_DATE as granted_at FROM post_timings pt GROUP BY pt.user_id HAVING count(*) >= pow(3, 11) AND count(*) < pow(3, 12) ORDER BY c DESC user_id | c | granted_at ---------+--------+------------ 671 | 219634 | 2015-02-14 20 | 218481 | 2015-02-14 602 | 218154 | 2015-02-14 18 | 217475 | 2015-02-14 564 | 216551 | 2015-02-14 1223 | 216301 | 2015-02-14 1862 | 208096 | 2015-02-14 1407 | 207994 | 2015-02-14 294 | 203474 | 2015-02-14 123 | 196786 | 2015-02-14 628 | 196037 | 2015-02-14 589 | 194404 | 2015-02-14 606 | 181145 | 2015-02-14 (13 rows) Elapsed: 6.669s Backup taken: 2015-02-14 03:59:29.758072
-
-
hmm.... fair enough.
-
If he was going to exclude bots, he should probably have excluded people with reader bots too ;)
Luckily it's a moo point.
-
i'll have you know that as far back as my logs go my cyborg has read 10 posts for me.
so there! :-P
-
That works, but of course doesn't exclude the spam topics.
Anyway, reading
posts_timings
and aggregating over it will take the 20+ seconds no matter what and it'll scale linearly and you'll likely be in this mire again once the table doubles in size within 6-12 months or so. Better try to software-engineer your way out -- you can't fix it by tweaking a single SQL query.
-
The lone recommendation on my mind would've been to make the exclusions join be done on as few rows as possible (i.e. not 10M). This requires splitting the aggregates into two:
WITH posts_read AS ( SELECT u.username, u.id AS user_id, log(2, foo.count) FROM ( SELECT agg.user_id, sum(agg.count) AS count FROM ( SELECT pt.* FROM ( SELECT pt.user_id, pt.topic_id, Count(*) AS count FROM post_timings pt GROUP BY 1, 2 ) pt LEFT JOIN exclusions e ON e.topic_id = pt.topic_id WHERE e.topic_id IS NULL ) agg GROUP BY agg.user_id ) foo JOIN users u ON u.id = foo.user_id )
I quite doubt it'll speed it up at all, if not actually slow it down.
-
That works, but of course doesn't exclude the spam topics.
It doesn't exclude quite a few things I'd have normally liked to exclude like restricted categories, PM's and flags.
Anyway, reading posts_timings and aggregating over it will take the 20+ seconds no matter what and it'll scale linearly and you'll likely be in this mire again once the table doubles in size within 6-12 months or so. Better try to software-engineer your way out -- you can't fix it by tweaking a single SQL query.
I'll just have to keep an eye on it every couple of months to see if it behaves. Maybe request a schema change of some sort over at meta.d...
-
Did you try the earlier query with exclusions without u.username? Grouping on a string column and an int column, rather than just an int column, may be more expensive.
-
I removed
users
as an interim (but not documented here) step since the final query isn't exactly interested in usernames. Not that it had any discernible impact on the query time.
-
I've no idea, but suspect new connection. This is where it happens I think:
I think that the only new connections are made in
RailsMultisite::ConnectionManagement
. And all badges are awarded in a loop on the backfill job. So all the2^n and 3^nbadges should be sharing a connection during a single backfill run.
-
So could I get away with a temp table?
create if not exists...
sort of thing? That said I get the impression that might still not work.@sam?
-
Can you control which query runs first?
-
I'm guessing (no, I know) no. We currently have over 30 queries over and above the stock ones...
Which brings me back to the rather suboptimal badge admin interface...
-
Nope, subqueries can't modify data! ~~ wheeee ~~
-
Then if even if you could use a temp table, unless you could ensure which query runs first it, wouldn't work anyway.
But it's moo, because
subqueries can't modify data!
-
The badge infrastructure is not going to allow you to run DML there. At some point if you need this level of caching its probably better to define a plugin that runs a task to populate the temp table. It sounds all very elaborate but can be pulled off in a single file.