Opened 16 months ago

Last modified 16 months ago

#30398 assigned New feature

Check database connection's health before its reused

Reported by: Przemysław Suliga Owned by: Przemysław Suliga
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: yes
Needs tests: yes Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

With persistent database connections, when the DB server closes the connection (for example during its restart), the next request that tries to use that persistent DB connection in Django will fail with an error (even after the DB server is ready to process queries) like this one (example for PostgreSQL):

...
  File ".../django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.OperationalError: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

After that, Django will close the connection in question. During the handling of the next request, that Django process will create a new DB connection and it will succeed. So with persistent connections (assuming all processes/workers for a Django app/project already have a DB connection open), every Django process will fail to handle one DB accessing request after its connection is closed by the DB server.

This isn't news. Quoting from #24810:

When the database closes the connection, for instance because the database server was restarted, an exception will be raised the next time user code attempts to use the database. Django doesn't attempt to do anything about it.

This is expected to result in at most one 500 error for each thread that serves requests, since database connections are thread-local. This doesn't sound horrific. After all the database was restarted while the website was serving requests.

We could try to prevent these errors by health checking the connection before every re-use attempt (currently the health check is only done after an error has already been detected for the connection).

This Work In Progress patch (no tests and docs yet) introduces a CONN_HEALTH_CHECK_DELAY DB setting to Django:

  • If it's set to None (default), the behavior is the same as before.
  • Otherwise, it expects an integer with a number of seconds after which the connection should be health checked before it's reused (0 for "before every reuse").

When checks are enabled, it works as follows (assuming already existing "persistent" DB connection):

  • Before each connection reuse, perform a connection test (SELECT 1 query in case of PostgreSQL).
  • This only happens if last check was performed more than CONN_HEALTH_CHECK_DELAY seconds ago.
  • This only happens once per request (or async task - Hi Celery)) regardless of how many transactions the "request" will perform.
  • This only happens for "requests" that do use the database.
  • If the check fails, close the connection and create a new one.

I feel like I'm probably missing something important here...
Anyway, on the other hand, what gives me hints that this might not be the worst idea in the world are examples from other places:

Granted, the above are _separate_ components (doing connection pooling and some other things in case of Active Record) and very likely a proper fix for #24810 might fix the issue this PR fixes. On the other hand this approach here is addressing a different issue and might be simpler and more feasible (at least as the request/response cycle code part goes)? Happy to see this through if something like this makes sense for Django.

---

My main motivation for researching this was trying to create a solution for Celery. With default prefork concurrency, celery tasks running Django ORM code would work very nicely with non-zero CONN_MAX_AGE. But the problem is I don't want to end up loosing messages/tasks because of these "db server closes connections" events (however rare they might be - with lots connections and threads/processes the number of errors gets bigger - Lowering the CONN_MAX_AGE is a trade-off which doesn't guarantee good enough safety). And Celery tasks by default are ACK-ed before they're executed (so if they fail because of that, they're lost). But it seems that Django itself could also benefit from this. And so here we are.

Change History (6)

comment:1 Changed 16 months ago by Carlton Gibson

Triage Stage: UnreviewedAccepted

OK, yes, this seems reasonable to at least look at. (Assuming there's no massive performance hit or such...)

Aside: once of the main benefits of a move to async is that it allows easy creation of background tasks, for which this kind of health check might be suitable. (Not celery's model, but related...)

comment:2 Changed 16 months ago by felixxm

Has patch: unset

comment:3 Changed 16 months ago by Przemysław Suliga

Owner: changed from nobody to Przemysław Suliga
Status: newassigned

comment:4 Changed 16 months ago by Przemysław Suliga

Has patch: set

comment:5 Changed 16 months ago by felixxm

Needs documentation: set
Needs tests: set

comment:6 Changed 16 months ago by Ran Benita

Taking a risk of detracting the discussion a bit, but given Djagno's async future, wouldn't this be the time to revisit the decision to use persistent connections over a connection pool? For most async frameworks (we use gevent, but will be the same with ASGI), persistent connections don't work since async tasks are not reused like threads/processes, instead a task is spawned per request.

Note: See TracTickets for help on using tickets.
Back to Top