Discourse SQL 1 minute query...

boomzilla

I always test stuff like this, because I have a difficult time predicting how the DB will react to stuff like this. So...I'm interested in how it comes out, but honestly, I think this query is I/O bound, looking up all those posts.

darkmatter

hm... I have dbs with 10s of millions of rows and keys in 10row or less tables... i could just simulate PJH's query over those to see performance.

darkmatter

SQL Server can get me a result over very similarly sized tables using the Not In in 11 seconds.
Using Not Exists takes 10 seconds.
But about 7 of those seconds are IO, there are ~55,000 results in the particular DB I'm querying.

So it's more like a 3s vs 4s runtime for the actual query, with 7s of IO following.

I dont have a Postgres DB to test on though.

boomzilla

@darkmatter said:

I dont have a Postgres DB to test on though.

Yeah. For this sort of optimization, I don't see the point, because it seems very unlikely to transfer.

Onyx

@darkmatter said:

I dont have a Postgres DB to test on though.

I have a small Discourse dump (not from this forum), but I don't think it's big enough for results to be relevant.

PJH

My previous as reference, 15.148s: http://pastebin.com/y1EPVSJw

@hhaamu said:

One flaw you have is doing the join against users before aggregating. The aggregating criteria is all in the post_timings table. Move the join a bit further about:

Down to 8.988 seconds: http://pastebin.com/RDND64m9

@hhaamu said:

While I'm at it, let's convert the NOT IN () into a left anti-join. (I doubt there's performance gain, but why not.)

Comparable at 8.839: http://pastebin.com/UHzgxj2w

@hhaamu said:

temp tables are per-connection, so if you're not tearing down and reconnecting between each run, it should be doable. (Of course it depends on what you are 'allowed' to do.)

I've no idea, but suspect new connection. This is where it happens I think:

discourse/app/services/badge_granter.rb at main · discourse/discourse

A platform for community discussion. Free, open, simple. - discourse/discourse

With post_clause being the badge query itself.

@accalia said:

and it has to be matched against at least 60k posts. that's where the index savings would be (unless i very much misunderstand postgres..... which is quite possible)

The part of the query doing the search for the UUID is surprisingly fast:

[postgres@sofa ~]$ sql_tdwtf exclusions
# Excluded posts
WITH exclusions AS ( /* Which categories to exclude from counters */
        SELECT user_id, id, topic_id, post_number
        FROM posts
        WHERE raw LIKE '%[UUID removed to prevent this thread accidentally being marked as excluded]%' AND
        user_id IN  (
                SELECT gu.user_id
                FROM group_users gu
                WHERE group_id IN(
                        SELECT g.id
                        FROM groups g
                        WHERE g.name IN ('admins')
                )
        )
)
SELECT * FROM exclusions

 user_id |   id   | topic_id | post_number 
---------+--------+----------+-------------
      20 | 136772 |     2929 |          17
      20 | 203068 |     6979 |        1186
      20 | 131170 |     3125 |        4585
      20 | 131125 |     1673 |        3632
      20 | 131276 |     4292 |           1
      20 | 131160 |     1673 |        3634
      20 | 131167 |     1000 |       27836
(7 rows)

Elapsed: 0.575s
Backup taken:  2015-02-14 03:59:29.758072

PJH

@PJH said:

Comparable at 8.839: http://pastebin.com/UHzgxj2w

EXPLAIN ANALYZE for the interested: http://pastebin.com/FcL3xBgJ

dkf

Moving that JOIN seems to have been a good win.

accalia

well then....

of course @pjh might want to exclude bots from getting that badger. ;-)

http://what.thedailywtf.com/badges/186/3-11-posts

PJH

I went with a simpler query.

[postgres@sofa ~]$ sql_tdwtf posts_read8 
# 
SELECT pt.user_id, count(*) c, CURRENT_DATE as granted_at
FROM post_timings pt
GROUP BY pt.user_id
HAVING count(*) >= pow(3, 11) AND count(*) < pow(3, 12) 
ORDER BY c DESC


 user_id |   c    | granted_at 
---------+--------+------------
     671 | 219634 | 2015-02-14
      20 | 218481 | 2015-02-14
     602 | 218154 | 2015-02-14
      18 | 217475 | 2015-02-14
     564 | 216551 | 2015-02-14
    1223 | 216301 | 2015-02-14
    1862 | 208096 | 2015-02-14
    1407 | 207994 | 2015-02-14
     294 | 203474 | 2015-02-14
     123 | 196786 | 2015-02-14
     628 | 196037 | 2015-02-14
     589 | 194404 | 2015-02-14
     606 | 181145 | 2015-02-14
(13 rows)

Elapsed: 6.669s
Backup taken:  2015-02-14 03:59:29.758072

PJH

@accalia said:

of course @pjh might want to exclude bots from getting that badger.

Nah.

accalia

hmm.... fair enough.

loopback0

If he was going to exclude bots, he should probably have excluded people with reader bots too ;)

Luckily it's a moo point.

accalia

i'll have you know that as far back as my logs go my cyborg has read 10 posts for me.

so there! :-P

hhaamu

That works, but of course doesn't exclude the spam topics.

Anyway, reading posts_timings and aggregating over it will take the 20+ seconds no matter what and it'll scale linearly and you'll likely be in this mire again once the table doubles in size within 6-12 months or so. Better try to software-engineer your way out -- you can't fix it by tweaking a single SQL query.

hhaamu

The lone recommendation on my mind would've been to make the exclusions join be done on as few rows as possible (i.e. not 10M). This requires splitting the aggregates into two:

WITH posts_read AS (
	SELECT u.username, u.id AS user_id, log(2, foo.count)
	FROM (
		SELECT agg.user_id, sum(agg.count) AS count
		FROM (
			SELECT pt.*
			FROM (
				SELECT pt.user_id, pt.topic_id, Count(*) AS count
				FROM post_timings pt
				GROUP BY 1, 2
			) pt
			LEFT JOIN exclusions e ON e.topic_id = pt.topic_id
			WHERE e.topic_id IS NULL
		) agg
		GROUP BY agg.user_id
	) foo
	JOIN users u ON u.id = foo.user_id
)

I quite doubt it'll speed it up at all, if not actually slow it down.

PJH

@hhaamu said:

That works, but of course doesn't exclude the spam topics.

It doesn't exclude quite a few things I'd have normally liked to exclude like restricted categories, PM's and flags.

@hhaamu said:

Anyway, reading posts_timings and aggregating over it will take the 20+ seconds no matter what and it'll scale linearly and you'll likely be in this mire again once the table doubles in size within 6-12 months or so. Better try to software-engineer your way out -- you can't fix it by tweaking a single SQL query.

I'll just have to keep an eye on it every couple of months to see if it behaves. Maybe request a schema change of some sort over at meta.d...

PleegWat

Did you try the earlier query with exclusions without u.username? Grouping on a string column and an int column, rather than just an int column, may be more expensive.

PJH

I removed users as an interim (but not documented here) step since the final query isn't exactly interested in usernames. Not that it had any discernible impact on the query time.

riking

@PJH said:

I've no idea, but suspect new connection. This is where it happens I think:

I think that the only new connections are made in RailsMultisite::ConnectionManagement. And all badges are awarded in a loop on the backfill job. So all the ~~2^n and 3^n~~ badges should be sharing a connection during a single backfill run.

PJH

So could I get away with a temp table? create if not exists... sort of thing? That said I get the impression that might still not work.

@sam?

loopback0

Can you control which query runs first?

PJH

I'm guessing (no, I know) no. We currently have over 30 queries over and above the stock ones...

Which brings me back to the rather suboptimal badge admin interface...

riking

Nope, subqueries can't modify data! ~~ wheeee ~~

loopback0

Then if even if you could use a temp table, unless you could ensure which query runs first it, wouldn't work anyway.

But it's moo, because

@riking said:

subqueries can't modify data!

sam

The badge infrastructure is not going to allow you to run DML there. At some point if you need this level of caching its probably better to define a plugin that runs a task to populate the temp table. It sounds all very elaborate but can be pulled off in a single file.