Related to HTTP timeouts, I’ve run into database clients without default timeout...

znep · on Aug 29, 2020

That isn't due to a missing timeout, that is due to not properly communicating aborted requests down the stack which, admittedly, isn't always easy and some clients/languages/etc. are very bad at. A hardcoded timeout, while a fine workaround in some applications, is not a good default and not the proper fix for that.

Default timeouts in the database layers are hidden time bombs that turn operations that just legitimately take a bit longer than some value the library author set that you didn't even know existed into failures that get retried over and over causing even more load than just doing the thing once. Don't get me wrong there are lots of uses for setting strict timeouts and being able to do so is very important, but as a default no thanks.

nitrogen · on Aug 29, 2020

You sometimes won't know a TCP connection has been closed unless you try to write to it (possibly there's a select/epoll/etc way to test), so if you are using blocking I/O, you won't know that the HTTP client went away long ago.

user5994461 · on Aug 29, 2020

Highly advise to turn on TCP keepalive to detect dropped connections.

Matthias247 · on Aug 29, 2020

Sure. But the pointer of the parent poster was that you still won't observe the error unless you are interacting with the socket again. If you have a blocking thread per request model and your thread is blocked on the database IO, then it won't look at the original request (and it's source socket) for that timeframe.

There is no great OS solution for handling this. You kind of need to run async IO on the lowest layer, and at least still be able to receive the read readiness and associated close/reset notification that you can somehow forward to the application stack (maybe in the form a `CancellationToken`)

asdfasgasdgasdg · on Aug 29, 2020

Requests should be kept alive. For any system, if the requester goes away, eventually the system should stop doing work on their behalf. That seems like the root of the problem in the situation you're describing.

deathanatos · on Aug 29, 2020

Yes, I'd agree, but a fair number of database's wire protocols have no means of saying "this request is cancelled"!

As for "if the requester goes away", remember that the requester might be a few hops away. E.g., the HTTP connection from the mobile client drops; the web server and its connection to the DB is still alive and well. I can forcefully shut that connection, but that's somewhat of a drag (I'd rather keep it open, since it is perfectly good).

Beyond closing the connection, and being able to issue some form of "cancel this query" request: HTTP/1 lacks it entirely, PostgreSQL requires opening a separate connection, Redis lacks it entirely, and I think both Mongo and MySQL lack it entirely.

Even support for "time this request out" is spotty.

toast0 · on Aug 29, 2020

> PostgreSQL requires opening a separate connection, Redis lacks it entirely, and I think both Mongo and MySQL lack it entirely.

MySQL has a kill command, which does need to be done on another connection (might also need more permissions, it's been a while). It's been a while since I used a lot of MySQL, but this was definitely a pain point when things went sideways.

jd_mongodb · on Aug 29, 2020

MongoDB has killOp which will allow a user to kill any operation they have an operation ID for:

https://docs.mongodb.com/manual/reference/method/db.killOp/

deathanatos · on Aug 29, 2020

This (and the content of the article) has been one of the (dull) recurring themes in my career, I feel. Finding and adding timeouts and trying to prevent databases from chasing their own tail.

A co-worker of mine actually added support for timeouts to a database we were using. (It is a smaller, less-well known DB.) I added it to the Python side.

Good cancellation support in the language is really critical here, I found. In Python, it was a breeze to add timeouts and get rid of long running requests, if, say, the network connection dropped: you cancel the future, and that cancellation propagates to all the sub-futures. It is even hookable so that one can — if the network protocol supports it — propagate that across the wire to other services.

The DB in our case was written in Go, however, so that was tougher. Golang's best method (that we learned of at the time) is to thread a "Context" object through your code paths. We were working with existing code, of course, and it lacked this, and it's harder to add in hindsight.

Of course, once we got the server to stop hanging on queries of doom and return a more appropriate "that's a query of doom, and would hang the server" error, the complaint was that the server wasn't executing those queries anymore…

Too · on Aug 30, 2020

The asyncio cancel is convenient but can be really deceiving. If you want to reliably propagate the cancel across the wire, you likely have to invoke new await-calls inside your CancelledError-block. But doing this requires you to re-await the cancelled function again!

    async def process():
        try:
            await db.slow_operation()
        except CancelledError:
            synchronous_functions_work()
            await db.cancel()   # This future will not complete on timeout
 
    p = process()
    try:
        await asyncio.wait_for(p, timeout=1)
    except TimeoutError:
        await p   # Required for db.cancel() to run!!!

Now fire-and-forget on a timeout is perhaps the most reasonable approach, otherwise you'd get timeout on timeouts, so better implementation would be to restart p without awaiting it, or putting it on a background cancel-list. But it can be really confusing when you are not aware of this behavior.

Edit: Seems they actually fixed/changed this in 3.7: https://bugs.python.org/issue32751. So instead you have to write robust except-blocks that must never timeout.