Re: API-Call Hell (from Status Thread)



  • @Zenith said in The Official Status Thread:

    @Benjamin-Hall Connection pooling was a good place to start. I used to support an application that used the built-in IIS/ASP.NET pooler. For years it would run fine for awhile and then slow down horrifically causing tons of timeouts. Turned out they weren't releasing connections, so it was basically filling up all of the slots and relying on garbage collection to free them...eventually.

    Yeah. I mean...it was a very likely guess. Especially since the slowdown was abrupt--two days ago, it was ok. Yesterday, it went to pot. And then today, for a few hours, it got horrible. With plenty of resources available. And no code changes at those times.

    @Tsaukpaetra I'm not sure how to do either of those. It's an ancient dialect of mysql (5.5-ish). And it's not a single query--the calls that are slow make 2 + 5N (where N is related to the number of high-level objects returned, each of which is relatively complex) database queries.

    Background on the :wtf: setup---
    Browser JS --> web backend (PHP) --> service container (Node), all HTTP(s) calls. The internal calls are within a single Docker swarm, so those are really fast. The DB lives on a different instance. The service container has way more resources than it needs--there's barely any traffic to that (in this particular case); the web backend is multiply replicated and doesn't show signs of overloading.

    In production, the Docker swarm and the DB are (now, which changed a few days ago) separately located--the Docker swarm is co-hosted "locally" on our own servers, the DB has migrated to EC2 instances in AWS. All incoming connections to the production DB pass through a proxysql instance, which is operating normally (no excessive load, not hitting connection limits, etc).

    Both the web backend and the service container make DB calls--the PHP makes (usually) 1 call to translate some data (basically translate one id into a different id), while the service container makes 1 + 5N database queries per API call. All selects (for this code path), no writes. And the service DB is a different database (same physical database node, different logical db) from the web-service container's query.

    Most of the queries are "select X,Y,Z from A.b where id=?", where the id is either the primary key or an indexed column. None of these queries have significant cross-table joins. Instead, the "joining" is done "in code" via JS (really typescript, but that shouldn't matter at runtime). Mostly just stitching these "complex" objects together and returning them.

    The web backend acts mostly as just a dumb transfer layer, but isn't anywhere near RESTful or pretty.

    Oh, and a complicating factor--when I make the (equivalent) calls directly against the service-container API (bypassing the web backend entirely), they consistently return within ~700 ms to 1.5s (usually 1 at 1.5, then the rest at 750 ms or so). Even when the front end calls are varying between 2s and 20s. Which is why I'd abandoned any theories that it was the service layer primarily at fault.



  • And....that's screwy. Like seriously screwy.

    I just instrumented my local instance to report the timing in two places for these calls:

    1. At the PHP layer, timing 3 two different pieces. The initial database lookup, then the service container call.
    2. At the service container, determining how long it thinks its taking.

    A third time comes from the browser web inspector. Note--these instances (other than the DB) are running locally. 0 networking latency here.

    Call # Browser time PHP time (database) PHP time (service) service time
    1 2.6s 0.02s 0.46s ~0.4s
    2 2.1s 0.02s 0.46s ~0.4s
    3 3.1s 0.02s 0.46s ~0.4s
    4 1.0s 0.02s 0.5s ~0.4s
    5 1.3s 0.02s 0.45s ~0.4s

    Dur....:wtf: ?



  • @Benjamin-Hall Found it. It's not that piece of code at all. It's the brain-dead, moronic, absolutely :wtf: "authentication check" that happens on every single API call. And that's taking between 60 and 80% of the total time for the calls. And likely running into locking contention pretty badly, since that means hitting the same columns over and over again. And I have a sneaking suspicion that at least some of the columns it's referencing are not indexed.


  • Notification Spam Recipient

    @Benjamin-Hall said in Re: API-Call Hell (from Status Thread):

    brain-dead, moronic, absolutely

    Ohboi! Are they at least storing the passwords in plain text? 🙃


  • Discourse touched me in a no-no place

    @Benjamin-Hall said in Re: API-Call Hell (from Status Thread):

    It's the brain-dead, moronic, absolutely :wtf: "authentication check" that happens on every single API call. And that's taking between 60 and 80% of the total time for the calls.

    I remember having that class of problem long ago. Service calls were taking ages (which mattered as there were a lot of them). It was the auth check. In my case, the database it was checking against was fast enough… but bcrypt isn't a fast algorithm and I wasn't using sessions…



  • @dkf said in Re: API-Call Hell (from Status Thread):

    @Benjamin-Hall said in Re: API-Call Hell (from Status Thread):

    It's the brain-dead, moronic, absolutely :wtf: "authentication check" that happens on every single API call. And that's taking between 60 and 80% of the total time for the calls.

    I remember having that class of problem long ago. Service calls were taking ages (which mattered as there were a lot of them). It was the auth check. In my case, the database it was checking against was fast enough… but bcrypt isn't a fast algorithm and I wasn't using sessions…

    In this case, I've tracked it down further. The (multitude) of SELECT queries? Super fast and stable. 20 ms, most of which is probably network latency to AWS and back. The UPDATE on the cookie expiration? Slow and variably so. 99% of the auth time is spent there, which means ~70% of the total time per query is spent there. And it's a pretty simple query with indexes, keying off of a primary key.

    Hypothesis--since each auth call (ie every API call) does a SELECT from the logins (ie cookie) table, they're locking the rows. So if you make a single call, all is well. The SELECT has released the lock well before the UPDATE tries to change things. But if you make a bunch in short succession, or you have a bunch of API calls that all get fired in parallel, query 4 locks the row just as query 1 gets around to updating that same row. Or whatever. And thus, wildly inconsistent and variable timings happen.

    Except...mysql doesn't seem (by default) to do read locks on select statements. So no.

    The query in question ($login_index was read earlier in the process and is a primary key to that table):

    UPDATE logins SET expires=DATE_ADD( NOW(), INTERVAL 20 MINUTE ) WHERE expires < DATE_ADD( NOW(), INTERVAL 20 MINUTE ) AND id = $login_index LIMIT 1
    

    And yes, this is being done as mysqli_query($db, $query);.


  • Discourse touched me in a no-no place

    @Benjamin-Hall Is it possible to move the UPDATE to a background worker thread? Having some extra delay of up to 1% of the timeout won't matter too much.



  • @dkf Does PHP have background worker threads? Or async anything? If so, is just fire it async and forget about it.

    What I did as a pallative was make it only try to do the update when 5 minutes or less remained before expiration. That way, you'd only see the hit every 20 min or so, and only if you happened to be in one part of loading one page right then. We're likely to completely overhaul our auth process relatively soon, so...



  • I see I didn't post what I discovered was the cause of the inconsistency. Not row locking on the SELECT--row locking on the UPDATE itself.

    Before a couple days ago, our database and our web server were co-located in the same datacenter. Same rack even. So network latency was effectively 0. So if you'd fire off 4 requests, you'd rarely have contention for the update command--even small differences in timing due to being fired in a loop means that the first ones would have released the lock before the next one came in.

    Now, we've got the DB in AWS. Sure, it's in the same region (the Oregon datacenter) as the web server, but the added ~20ms of just pure network latency means that you've got a wider window for the lock to be held when another one gets there. Which means that the first customer blocks (taking only a few milliseconds, but holding the lock for what appears to be a bit longer), which blocks the next one, which blocks the next one, and exponential backoff means things go to pot real fast.

    This wasn't a problem for the other pieces of the website--they mostly made 3-5 requests, all in serial (not firing the next one until the first returned to the browser). So the lock was never contentious, and all was happy. Our team's piece fired requests in parallel....


  • Considered Harmful

    @Benjamin-Hall said in Re: API-Call Hell (from Status Thread):

    I see I didn't post what I discovered was the cause of the inconsistency. Not row locking on the SELECT--row locking on the UPDATE itself.

    Before a couple days ago, our database and our web server were co-located in the same datacenter. Same rack even. So network latency was effectively 0. So if you'd fire off 4 requests, you'd rarely have contention for the update command--even small differences in timing due to being fired in a loop means that the first ones would have released the lock before the next one came in.

    Now, we've got the DB in AWS. Sure, it's in the same region (the Oregon datacenter) as the web server, but the added ~20ms of just pure network latency means that you've got a wider window for the lock to be held when another one gets there. Which means that the first customer blocks (taking only a few milliseconds, but holding the lock for what appears to be a bit longer), which blocks the next one, which blocks the next one, and exponential backoff means things go to pot real fast.

    This wasn't a problem for the other pieces of the website--they mostly made 3-5 requests, all in serial (not firing the next one until the first returned to the browser). So the lock was never contentious, and all was happy. Our team's piece fired requests in parallel....

    Well gol-ly sounds like you gotcher self a dee-stributed system there.


  • Discourse touched me in a no-no place

    @Benjamin-Hall said in Re: API-Call Hell (from Status Thread):

    Does PHP have background worker threads? Or async anything? If so, is just fire it async and forget about it.

    The answer probably seems to be no. Some of the workarounds sound like they are what happens when Rube Goldberg travels to Innsmouth.



  • @dkf said in Re: API-Call Hell (from Status Thread):

    @Benjamin-Hall said in Re: API-Call Hell (from Status Thread):

    Does PHP have background worker threads? Or async anything? If so, is just fire it async and forget about it.

    The answer probably seems to be no. Some of the workarounds sound like they are what happens when Rube Goldberg travels to Innsmouth.

    Figured as much. And yes, codethullu is already present in the codebase, no need to summon it more.


  • Considered Harmful

    @dkf he has to go there to confer with the PHP maintainers


  • Java Dev

    @dkf Yeah, AFAIK no multi-threading or anything like it in PHP. We use two workarounds in limited places: For dashboarding, each widget is fetched in a separate request from the browser. And graph rendering is done by shellouts to librsvg, which can be parallelised. No, we do not support browser-side svg rasterising yet.


Log in to reply