I converted NodeBB to use a PostgreSQL database. AMA.



  • @blakeyrat said in I converted NodeBB to use a PostgreSQL database. AMA.:

    @raceprouk said in I converted NodeBB to use a PostgreSQL database. AMA.:

    You want Postgres to index something that isn't a column? I'm not sure that's po-

    In SQL Server you can add a computed column and index that.

    @onyx said in I converted NodeBB to use a PostgreSQL database. AMA.:

    Yes, you can create indexes on expressions:

    And of course SQL Server will do that too, although I don't know how complex the expressions can be. EDIT: no, nevermind, it basically just directs you to make an index on a computed column, like I first suggested. Same thing really, just the computed column has a name and will show up in SELECT *. You can put expressions in constraints, but not in indexes.

    You can emulate a computed column in Postgres using triggers, but I'm probably just going to write an upgrade script that splits out the database into separate tables with at least somewhat less crazy schemas.


  • Impossible Mission Players - A

    @ben_lubar said in I converted NodeBB to use a PostgreSQL database. AMA.:

    tables with at least somewhat less crazy schemas.

    We've discussed this before, there is no schema right now, so literally anything with a schema would be less crazy.



  • postgres=# explain analyze verbose select 'post:' || value from sorted_sets where _key = 'posts:pid' order by score desc;
                                                                             QUERY PLAN                                                                         
    ------------------------------------------------------------------------------------------------------------------------------------------------------------
     Sort  (cost=930634.60..930637.73 rows=1253 width=64) (actual time=4944.430..6550.536 rows=1200692 loops=1)
       Output: (('post:'::text || (objects.data ->> 'value'::text))), (((objects.data ->> 'score'::text))::numeric)
       Sort Key: (((objects.data ->> 'score'::text))::numeric) DESC
       Sort Method: external merge  Disk: 38808kB
       ->  Bitmap Heap Scan on public.objects  (cost=37587.90..930570.13 rows=1253 width=64) (actual time=592.502..3089.097 rows=1200692 loops=1)
             Output: ('post:'::text || (objects.data ->> 'value'::text)), ((objects.data ->> 'score'::text))::numeric
             Recheck Cond: ((objects.data ->> '_key'::text) = 'posts:pid'::text)
             Filter: (objects.data ? 'score'::text)
             Heap Blocks: exact=15839
             ->  Bitmap Index Scan on uniq__objects__key__value  (cost=0.00..37587.59 rows=1252670 width=0) (actual time=571.343..571.343 rows=1200692 loops=1)
                   Index Cond: ((objects.data ->> '_key'::text) = 'posts:pid'::text)
     Planning time: 161.706 ms
     Execution time: 7135.741 ms
    (13 rows)
    
    Time: 7383.248 ms
    
    postgres=# explain analyze verbose select 'post:' || value from sorted_sets_materialized where _key = 'posts:pid' order by score desc;
                                                                                                   QUERY PLAN                                                                                               
    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Index Only Scan Backward using idx__sorted_sets__key__score__value on public.sorted_sets_materialized  (cost=0.56..382219.03 rows=1170064 width=41) (actual time=0.079..1073.163 rows=1200692 loops=1)
       Output: ('post:'::text || value), score
       Index Cond: (sorted_sets_materialized._key = 'posts:pid'::text)
       Heap Fetches: 1200692
     Planning time: 0.151 ms
     Execution time: 1633.143 ms
    (6 rows)
    
    Time: 1633.691 ms
    


  • @tsaukpaetra said in I converted NodeBB to use a PostgreSQL database. AMA.:

    @ben_lubar said in I converted NodeBB to use a PostgreSQL database. AMA.:

    @tsaukpaetra Now to fix that!

    How are you going to spay a database so that it performs better??? That seems counter-intuitive...

    Why wouldn't it perform better? It'll have more cycles for databasing if it's not chasing skirt


  • Impossible Mission Players - A

    @hungrier said in I converted NodeBB to use a PostgreSQL database. AMA.:

    @tsaukpaetra said in I converted NodeBB to use a PostgreSQL database. AMA.:

    @ben_lubar said in I converted NodeBB to use a PostgreSQL database. AMA.:

    @tsaukpaetra Now to fix that!

    How are you going to spay a database so that it performs better??? That seems counter-intuitive...

    Why wouldn't it perform better? It'll have more cycles for databasing if it's not chasing skirt

    Are you insinuating that the database is gay?



  • @onyx said in I converted NodeBB to use a PostgreSQL database. AMA.:

    @ben_lubar said in I converted NodeBB to use a PostgreSQL database. AMA.:

    Does anyone know if there's a way to tell Postgres to index something like rank() over (partition by foo order by bar asc)?

    Yes, you can create indexes on expressions:

    Ah, yes, coming from Oracland I'd call those Function Based Indexes. The view aspect of this was a red herring.


  • area_can

    @hungrier said in I converted NodeBB to use a PostgreSQL database. AMA.:

    chasing skirt

    Good ol' college days


  • BINNED

    @boomzilla I'm pretty sure you can actually set them on actual user-defined functions as well, but I can't find the docs for it for the life of me right now, and I'm not sure what limitations (if any) are set on your functions when doing so, so I didn't want to mention it without being able to back it up.



  • @tsaukpaetra said in I converted NodeBB to use a PostgreSQL database. AMA.:

    We've discussed this before, there is no schema right now, so literally anything with a schema would be less crazy.

    Even full EAV?


  • Impossible Mission Players - A

    @pleegwat said in I converted NodeBB to use a PostgreSQL database. AMA.:

    @tsaukpaetra said in I converted NodeBB to use a PostgreSQL database. AMA.:

    We've discussed this before, there is no schema right now, so literally anything with a schema would be less crazy.

    Even full EAV?

    I said less crazy, not less stupid.



  • @onyx said in I converted NodeBB to use a PostgreSQL database. AMA.:

    @boomzilla I'm pretty sure you can actually set them on actual user-defined functions as well, but I can't find the docs for it for the life of me right now, and I'm not sure what limitations (if any) are set on your functions when doing so, so I didn't want to mention it without being able to back it up.

    I think the function has to be declared IMMUTABLE for it to work.



  • Representative error message:

    ERROR: invalid input syntax for type date: "I ALREADY TOLD YOU THAT I AM NOT A COMPUTER PERSON, YOU'RE REFUSING TO HELP ME SO I'M GOING TO HANG UP"


  • SockDev

    @pleegwat said in I converted NodeBB to use a PostgreSQL database. AMA.:

    @tsaukpaetra said in I converted NodeBB to use a PostgreSQL database. AMA.:

    We've discussed this before, there is no schema right now, so literally anything with a schema would be less crazy.

    Even full EAV?

    It has legitimate uses for when you do stupid shit like letting your users arbitrarily add data to entities but not let them play with DDL.


  • Impossible Mission - B

    @arantor so what you're saying is, the legitimate use for this undefined solution is to fix a undefined problem you got yourself into?

    Isn't that literally the class of scenario that the term "technical debt" was coined to describe? 🚎


  • SockDev

    @masonwheeler no, I'm saying that there are times when you let the end user add arbitrary data to your entities and having the application do its own DDL to modify its own schema might be bad. But sure, you know best, right?

    It might be stupid from an ivory tower perspective but sometimes you just can't model everything in advance, and you might transfer the work away from your developers to other people so developers can get on building other stuff. But sure, go with your original interpretation because you do everything so perfectly and never have to deal with users who change their minds, or changes in requirements, and having a solution that meets the bulk of needs and use cases while having a supposedly WTF design... yeah, thought so.



  • Which one of these runs faster?

    A

    UPDATE "foo"
       SET "field_that_references_bar" = NULL
     WHERE "field_that_references_bar" NOT IN (SELECT "field_referenced_by_foo" FROM "bar");
    

    B

    CREATE TEMPORARY TABLE "temp_1" AS
    	SELECT "field_referenced_by_foo"
    	  FROM "bar"
    	 ORDER BY "field_referenced_by_foo";
    CREATE TEMPORARY TABLE "temp_2" AS
    	SELECT "id", "field_that_references_bar"
    	  FROM "foo"
    	 WHERE "field_that_references_bar" IS NOT NULL
    	 ORDER BY "id";
    ANALYZE "temp_1";
    ANALYZE "temp_2";
    WITH p AS (
    	SELECT t2."id"
    	  FROM "temp_2" t2
    	  LEFT OUTER JOIN "temp_1" t1
    	    ON t1."field_referenced_by_foo" = t2."field_that_references_bar"
    	 WHERE t1."field_referenced_by_foo" IS NULL
    )
    UPDATE "foo"
       SET "field_that_references_bar" = NULL
     WHERE "id" IN (SELECT "id" FROM p);
    DROP TABLE "temp_1", "temp_2";
    
    AnswerB finishes in minutes while A does not finish over 24 hours.


  • @ben_lubar said in I converted NodeBB to use a PostgreSQL database. AMA.:

    Which one of these runs faster?

    How does this one compare?

    UPDATE "foo"
       SET "field_that_references_bar" = NULL
    FROM "foo" AS "f1"
    LEFT OUTER JOIN "bar" ON "f1"."field_that_references_bar" = "bar"."field_referenced_by_foo"
    WHERE "f1"."id" = "foo"."id"
      AND "bar"."field_referenced_by_foo" IS NULL;
    


  • @djls45 said in I converted NodeBB to use a PostgreSQL database. AMA.:

    @ben_lubar said in I converted NodeBB to use a PostgreSQL database. AMA.:

    Which one of these runs faster?

    How does this one compare?

    UPDATE "foo"
       SET "field_that_references_bar" = NULL
    FROM "foo" AS "f1"
    LEFT OUTER JOIN "bar" ON "f1"."field_that_references_bar" = "bar"."field_referenced_by_foo"
    WHERE "f1"."id" = "foo"."id"
      AND "bar"."field_referenced_by_foo" IS NULL;
    

    Right now I'm re-running the convert-from-a-twenty-nine-million-row-one-column-table-to-tables-with-actual-useful-columns step, and I've added ROLLBACK to the end of the foreign keys part for when I run that.

    $ for i in users categories topics posts; do ./local.sh $i & done; wait
    [1] 3812
    [2] 3813
    [3] 3814
    [4] 3815
    2017/08/21 09:38:46 starting transaction (posts)
    2017/08/21 09:38:46 creating posts table
    2017/08/21 09:38:46 starting transaction (users)
    2017/08/21 09:38:46 starting transaction (topics)
    2017/08/21 09:38:46 creating topics table
    2017/08/21 09:38:46 creating users table
    2017/08/21 09:38:46 starting transaction (categories)
    2017/08/21 09:38:46 creating categories table
    2017/08/21 09:38:46 clustering for improved performance
    2017/08/21 09:38:46 clustering for improved performance
    2017/08/21 09:38:46 clustering for improved performance
    2017/08/21 09:38:47 extracting posts
    2017/08/21 09:38:47 clustering table for improved performance
    2017/08/21 09:38:47 extracting topics
    2017/08/21 09:38:47 extracting users
    2017/08/21 09:38:47 extracting categories
    2017/08/21 09:38:48 setting next category ID
    2017/08/21 09:38:48 creating index on (order)
    2017/08/21 09:38:48 creating index on (parentCid)
    2017/08/21 09:38:48 committing transaction (categories)
    2017/08/21 09:39:34 setting next topic ID
    2017/08/21 09:39:34 creating index on (cid, pinned)
    2017/08/21 09:39:35 creating index on (uid)
    2017/08/21 09:39:35 creating index on (timestamp)
    2017/08/21 09:39:36 creating index on (lastposttime)
    2017/08/21 09:39:37 creating full text index for topics
    2017/08/21 09:39:38 committing transaction (topics)
    

    If anyone has a big NodeBB database at home and would like to follow along, I posted my code here yesterday: https://gist.github.com/BenLubar/8d43f7aac304c42e2e5c6fc728038a51

    Add tx.Rollback() to the end of the first function in step3_foreignkeys.go and adjust the giant queries in the two immediately following functions to test stuff. You'll probably also need to edit local.sh, but if you don't use Docker with a private network and dynamic IPs, you can just put the connection strings in directly.



  • @ben_lubar

    2017/08/21 10:45:08 setting next user ID
    2017/08/21 10:45:09 committing transaction (users)
    


  • @ben_lubar

    2017/08/21 10:49:43 setting next post ID
    2017/08/21 10:49:44 creating index on (tid)
    2017/08/21 10:49:47 creating index on (timestamp)
    2017/08/21 10:49:49 creating index on (uid)
    2017/08/21 10:49:51 creating index on (toPid)
    2017/08/21 10:49:55 creating index on (ip)
    2017/08/21 10:49:56 creating full text index for posts
    2017/08/21 10:52:57 committing transaction (posts)
    

    SCIENCE TIME:

    CREATE TEMPORARY TABLE "temp_1" AS SELECT "pid" FROM "posts" ORDER BY "pid";
    CREATE TEMPORARY TABLE "temp_2" AS SELECT "pid", "toPid" FROM "posts" WHERE "toPid" IS NOT NULL ORDER BY "pid";
    ANALYZE "temp_1";
    ANALYZE "temp_2";
    WITH p AS (
    	SELECT t2."pid"
    	  FROM "temp_2" t2
    	  LEFT OUTER JOIN "temp_1" t1
    	    ON t1."pid" = t2."toPid"
    	 WHERE t1."pid" IS NULL
    )
    UPDATE "posts"
       SET "toPid" = NULL
     WHERE "pid" IN (SELECT "pid" FROM p);
    DROP TABLE "temp_1", "temp_2";
    
                                                                                                                                QUERY PLAN                                                                                                                             
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Update on public.posts  (cost=66139.47..66147.51 rows=1 width=957) (actual time=2883.228..2883.228 rows=0 loops=1)
       CTE p
         ->  Hash Anti Join  (cost=37200.49..66139.02 rows=1 width=8) (actual time=2277.304..2652.917 rows=15 loops=1)
               Output: t2.pid
               Hash Cond: (t2."toPid" = t1.pid)
               ->  Seq Scan on pg_temp_3.temp_2 t2  (cost=0.00..12903.66 rows=837566 width=16) (actual time=0.025..451.031 rows=837566 loops=1)
                     Output: t2.pid, t2."toPid"
               ->  Hash  (cost=17404.55..17404.55 rows=1206555 width=8) (actual time=1361.781..1361.781 rows=1206555 loops=1)
                     Output: t1.pid
                     Buckets: 131072  Batches: 16  Memory Usage: 3978kB
                     ->  Seq Scan on pg_temp_3.temp_1 t1  (cost=0.00..17404.55 rows=1206555 width=8) (actual time=0.024..643.426 rows=1206555 loops=1)
                           Output: t1.pid
       ->  Nested Loop  (cost=0.45..8.49 rows=1 width=957) (actual time=2653.135..2653.403 rows=15 loops=1)
             Output: posts.pid, posts.uid, posts.tid, NULL::bigint, posts."timestamp", posts.ip, posts.content, posts.votes, posts.upvotes, posts.downvotes, posts.replies, posts.deleted, posts."deleterUid", posts.edited, posts.editor, posts.data, posts.ctid, p.*
             ->  HashAggregate  (cost=0.02..0.03 rows=1 width=40) (actual time=2653.065..2653.087 rows=15 loops=1)
                   Output: p.*, p.pid
                   Group Key: p.pid
                   ->  CTE Scan on p  (cost=0.00..0.02 rows=1 width=40) (actual time=2277.345..2653.014 rows=15 loops=1)
                         Output: p.*, p.pid
             ->  Index Scan using posts_pkey on public.posts  (cost=0.43..8.45 rows=1 width=917) (actual time=0.016..0.018 rows=1 loops=15)
                   Output: posts.pid, posts.uid, posts.tid, posts."timestamp", posts.ip, posts.content, posts.votes, posts.upvotes, posts.downvotes, posts.replies, posts.deleted, posts."deleterUid", posts.edited, posts.editor, posts.data, posts.ctid
                   Index Cond: (posts.pid = p.pid)
     Planning time: 0.470 ms
     Execution time: 2883.425 ms
    
    UPDATE "posts"
       SET "toPid" = NULL
      FROM "posts" t1
      LEFT OUTER JOIN "posts" t2
        ON t1."toPid" = t2."pid"
     WHERE t1."pid" = "posts"."pid"
       AND t2."pid" IS NULL
       AND t1."toPid" IS NOT NULL;
    
                                                                                                                                       QUERY PLAN                                                                                                                                   
    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Update on public.posts  (cost=651169.59..997114.06 rows=470471 width=937) (actual time=6932.368..6932.368 rows=0 loops=1)
       ->  Merge Join  (cost=651169.59..997114.06 rows=470471 width=937) (actual time=5539.097..6929.416 rows=15 loops=1)
             Output: posts.pid, posts.uid, posts.tid, NULL::bigint, posts."timestamp", posts.ip, posts.content, posts.votes, posts.upvotes, posts.downvotes, posts.replies, posts.deleted, posts."deleterUid", posts.edited, posts.editor, posts.data, posts.ctid, t1.ctid, t2.ctid
             Merge Cond: (posts.pid = t1.pid)
             ->  Index Scan using posts_pkey on public.posts  (cost=0.43..332166.06 rows=2218242 width=917) (actual time=0.010..2333.617 rows=1146876 loops=1)
                   Output: posts.pid, posts.uid, posts.tid, posts."timestamp", posts.ip, posts.content, posts.votes, posts.upvotes, posts.downvotes, posts.replies, posts.deleted, posts."deleterUid", posts.edited, posts.editor, posts.data, posts.ctid
             ->  Materialize  (cost=651169.16..653521.51 rows=470471 width=20) (actual time=4040.664..4040.711 rows=15 loops=1)
                   Output: t1.ctid, t1.pid, t2.ctid
                   ->  Sort  (cost=651169.16..652345.33 rows=470471 width=20) (actual time=4040.660..4040.685 rows=15 loops=1)
                         Output: t1.ctid, t1.pid, t2.ctid
                         Sort Key: t1.pid
                         Sort Method: quicksort  Memory: 25kB
                         ->  Merge Anti Join  (cost=5101.45..597192.48 rows=470471 width=20) (actual time=0.023..4040.614 rows=15 loops=1)
                               Output: t1.ctid, t1.pid, t2.ctid
                               Merge Cond: (t1."toPid" = t2.pid)
                               ->  Index Scan using "idx__posts__toPid" on public.posts t1  (cost=0.43..244922.86 rows=1541013 width=22) (actual time=0.015..1712.370 rows=837566 loops=1)
                                     Output: t1.ctid, t1."toPid", t1.pid
                                     Index Cond: (t1."toPid" IS NOT NULL)
                               ->  Index Scan using posts_pkey on public.posts t2  (cost=0.43..332166.06 rows=2218242 width=14) (actual time=0.006..1192.370 rows=1206552 loops=1)
                                     Output: t2.ctid, t2.pid
     Planning time: 0.719 ms
     Execution time: 6932.466 ms
    
    UPDATE "posts" t1
       SET "toPid" = NULL
     WHERE "toPid" IS NOT NULL
       AND NOT EXISTS (
    	SELECT t2."pid"
    	  FROM "posts" t2
    	 WHERE t1."toPid" = t2."pid"
           );
    
                                                                                                          QUERY PLAN                                                                                                       
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Update on public.posts t1  (cost=5101.45..592012.86 rows=470469 width=931) (actual time=84336.476..84336.476 rows=0 loops=1)
       ->  Merge Anti Join  (cost=5101.45..592012.86 rows=470469 width=931) (actual time=0.093..78922.606 rows=15 loops=1)
             Output: t1.pid, t1.uid, t1.tid, NULL::bigint, t1."timestamp", t1.ip, t1.content, t1.votes, t1.upvotes, t1.downvotes, t1.replies, t1.deleted, t1."deleterUid", t1.edited, t1.editor, t1.data, t1.ctid, t2.ctid
             Merge Cond: (t1."toPid" = t2.pid)
             ->  Index Scan using "idx__posts__toPid" on public.posts t1  (cost=0.43..239744.47 rows=1541006 width=925) (actual time=0.029..74118.155 rows=837566 loops=1)
                   Output: t1.pid, t1.uid, t1.tid, t1."timestamp", t1.ip, t1.content, t1.votes, t1.upvotes, t1.downvotes, t1.replies, t1.deleted, t1."deleterUid", t1.edited, t1.editor, t1.data, t1.ctid, t1."toPid"
                   Index Cond: (t1."toPid" IS NOT NULL)
             ->  Index Scan using posts_pkey on public.posts t2  (cost=0.43..332164.92 rows=2218233 width=14) (actual time=0.060..3487.159 rows=1206552 loops=1)
                   Output: t2.ctid, t2.pid
     Planning time: 61.094 ms
     Execution time: 84396.291 ms
    

    Edit: It turns out the most time was spent updating rows that already had null foreign keys. Updated the benchmarks to fix that.



  • 2017/08/21 12:59:18 starting transaction (foreignkeys)
    2017/08/21 12:59:18 clearing invalid foreign keys "categories"("parentCid") -> "categories"("cid")
    2017/08/21 12:59:19 0 missing keys set to null.
    2017/08/21 12:59:19 creating foreign key on "categories"("parentCid") -> "categories"("cid")
    2017/08/21 12:59:20 clearing invalid foreign keys "posts"("uid") -> "users"("uid")
    2017/08/21 12:59:28 0 missing keys set to null.
    2017/08/21 12:59:28 creating foreign key on "posts"("uid") -> "users"("uid")
    2017/08/21 12:59:32 deleting invalid foreign keys "posts"("tid") -> "topics"("tid")
    2017/08/21 12:59:33 0 missing keys deleted.
    2017/08/21 12:59:33 creating foreign key on "posts"("tid") -> "topics"("tid")
    2017/08/21 12:59:35 clearing invalid foreign keys "posts"("toPid") -> "posts"("pid")
    2017/08/21 12:59:51 15 missing keys set to null.
    2017/08/21 12:59:51 creating foreign key on "posts"("toPid") -> "posts"("pid")
    2017/08/21 12:59:52 clearing invalid foreign keys "posts"("deleterUid") -> "users"("uid")
    2017/08/21 12:59:54 0 missing keys set to null.
    2017/08/21 12:59:54 creating foreign key on "posts"("deleterUid") -> "users"("uid")
    2017/08/21 12:59:56 clearing invalid foreign keys "posts"("editor") -> "users"("uid")
    2017/08/21 12:59:57 0 missing keys set to null.
    2017/08/21 12:59:57 creating foreign key on "posts"("editor") -> "users"("uid")
    2017/08/21 12:59:58 deleting invalid foreign keys "topics"("cid") -> "categories"("cid")
    2017/08/21 12:59:59 0 missing keys deleted.
    2017/08/21 12:59:59 creating foreign key on "topics"("cid") -> "categories"("cid")
    2017/08/21 12:59:59 clearing invalid foreign keys "topics"("oldCid") -> "categories"("cid")
    2017/08/21 12:59:59 0 missing keys set to null.
    2017/08/21 12:59:59 creating foreign key on "topics"("oldCid") -> "categories"("cid")
    2017/08/21 12:59:59 clearing invalid foreign keys "topics"("uid") -> "users"("uid")
    2017/08/21 12:59:59 0 missing keys set to null.
    2017/08/21 12:59:59 creating foreign key on "topics"("uid") -> "users"("uid")
    2017/08/21 12:59:59 clearing invalid foreign keys "topics"("deleterUid") -> "users"("uid")
    2017/08/21 12:59:59 0 missing keys set to null.
    2017/08/21 12:59:59 creating foreign key on "topics"("deleterUid") -> "users"("uid")
    2017/08/21 12:59:59 clearing invalid foreign keys "topics"("mainPid") -> "posts"("pid")
    2017/08/21 13:00:00 0 missing keys set to null.
    2017/08/21 13:00:00 creating foreign key on "topics"("mainPid") -> "posts"("pid")
    2017/08/21 13:00:00 clearing invalid foreign keys "topics"("teaserPid") -> "posts"("pid")
    2017/08/21 13:00:01 0 missing keys set to null.
    2017/08/21 13:00:01 creating foreign key on "topics"("teaserPid") -> "posts"("pid")
    2017/08/21 13:00:01 committing transaction (foreignkeys)
    

Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.