Scaling serverless aurora postgresql

dangeRuss

So we have an api that hits a serverless postgresql v2 reader. It works fine and the reader scales up when more load is applied. The problem comes when we are load testing. We have a query that us fairly fast- takes under 250ms, but when put under extreme load (100 tps) the query takes over 30 seconds which causes the api gateway to timeout and return 500s (api gateway max timeout is 30s I believe)

Once the cluster scales up we get far fewer to no 500s.

What's the best way to fix this. I think I'm close to squeezing the max performance out of the query. I can try doing some partitioning, but doubt it will get much faster.

Any ideas on how to fix this without just using a larger min instance size for the RDS?

I was thinking some sort of queue that queues up requests not letting them overwhelm the RDS before it has a chance to scale. Or perhaps some sort of retry mechanism in front of API gateway that retries the 500s.

BernieTheBernie

@dangeRuss What about always measuring the query time, and scaling up when it starts to become slow?

dangeRuss

@BernieTheBernie said in Scaling serverless aurora postgresql:

@dangeRuss What about always measuring the query time, and scaling up when it starts to become slow?

The problem is we don't handle the scaling, it's all automatic. And it seems to scale pretty well, it just doesn't scale when you hit a 1 ECU instance with 100 concurrent queries. All of a sudden the query time goes from 200ms to 30+s.

The question is what to do to stop idiots from hitting it with 100tps while it's at this vulnerable state. I was thinking something with max workers or something. Can't process 100 requests if you only have 10 workers. Queue the rest.

dangeRuss

@BernieTheBernie also forgot to mention that these are synthetic tests and it will probably never be a problem in practice, but these guys want to prove that we can handle 100tps, so they're doing 100 tps with 1s scaleup time in jmeter.

BernieTheBernie

@dangeRuss said in Scaling serverless aurora postgresql:

100 concurrent queries. All of a sudden the query time goes from 200ms to 30+s.

Since 100 time 200 ms is only 20 s which is less than 30 something s, a "queue" might really work here. Or other mechanism (semaphore/hemisemaphore, C# Monitor class (lock{x}), ...) which reduces to one query at a time. But make sure that this new queue does not take unreasonably many system resources.
That looks like a cloudy environment, so message queues might be available anyway already, and all that asynchronous overhead...

dangeRuss

@BernieTheBernie said in Scaling serverless aurora postgresql:

@dangeRuss said in Scaling serverless aurora postgresql:

100 concurrent queries. All of a sudden the query time goes from 200ms to 30+s.

Since 100 time 200 ms is only 20 s which is less than 30 something s, a "queue" might really work here. Or other mechanism (semaphore/hemisemaphore, C# Monitor class (lock{x}), ...) which reduces to one query at a time. But make sure that this new queue does not take unreasonably many system resources.
That looks like a cloudy environment, so message queues might be available anyway already, and all that asynchronous overhead...

Right, but I need that queue to be on the postgresql side.