#30398 closed New feature (fixed)
Check database connection's health before it is reused
Reported by: | Przemysław Suliga | Owned by: | Przemysław Suliga |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Ready for checkin | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
With persistent database connections, when the DB server closes the connection (for example during its restart), the next request that tries to use that persistent DB connection in Django will fail with an error (even after the DB server is ready to process queries) like this one (example for PostgreSQL):
... File ".../django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) django.db.utils.OperationalError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.
After that, Django will close the connection in question. During the handling of the next request, that Django process will create a new DB connection and it will succeed. So with persistent connections (assuming all processes/workers for a Django app/project already have a DB connection open), every Django process will fail to handle one DB accessing request after its connection is closed by the DB server.
This isn't news. Quoting from #24810:
When the database closes the connection, for instance because the database server was restarted, an exception will be raised the next time user code attempts to use the database. Django doesn't attempt to do anything about it.
This is expected to result in at most one 500 error for each thread that serves requests, since database connections are thread-local. This doesn't sound horrific. After all the database was restarted while the website was serving requests.
We could try to prevent these errors by health checking the connection before every re-use attempt (currently the health check is only done after an error has already been detected for the connection).
This Work In Progress patch (no tests and docs yet) introduces a CONN_HEALTH_CHECK_DELAY
DB setting to Django:
- If it's set to
None
(default), the behavior is the same as before. - Otherwise, it expects an integer with a number of seconds after which the connection should be health checked before it's reused (
0
for "before every reuse").
When checks are enabled, it works as follows (assuming already existing "persistent" DB connection):
- Before each connection reuse, perform a connection test (
SELECT 1
query in case of PostgreSQL). - This only happens if last check was performed more than
CONN_HEALTH_CHECK_DELAY
seconds ago. - This only happens once per request (or async task - Hi Celery)) regardless of how many transactions the "request" will perform.
- This only happens for "requests" that do use the database.
- If the check fails, close the connection and create a new one.
I feel like I'm probably missing something important here...
Anyway, on the other hand, what gives me hints that this might not be the worst idea in the world are examples from other places:
django-db-geventpool
(used with success in some gunicorn/gevent/PostgreSQL setups) doesSELECT 1
every time it leases out an existing connection.- PgBouncer does the check query too (configurable by the server_check_delay setting - default 30 seconds).
- HikariCP (a popular Java JDBC connection pool) also does connection health checks before leasing out connections.
- Rails' Active Record seems like it also does something like that.
Granted, the above are _separate_ components (doing connection pooling and some other things in case of Active Record) and very likely a proper fix for #24810 might fix the issue this PR fixes. On the other hand this approach here is addressing a different issue and might be simpler and more feasible (at least as the request/response cycle code part goes)? Happy to see this through if something like this makes sense for Django.
---
My main motivation for researching this was trying to create a solution for Celery. With default prefork
concurrency, celery tasks running Django ORM code would work very nicely with non-zero CONN_MAX_AGE
. But the problem is I don't want to end up loosing messages/tasks because of these "db server closes connections" events (however rare they might be - with lots connections and threads/processes the number of errors gets bigger - Lowering the CONN_MAX_AGE
is a trade-off which doesn't guarantee good enough safety). And Celery tasks by default are ACK-ed before they're executed (so if they fail because of that, they're lost). But it seems that Django itself could also benefit from this. And so here we are.
Change History (13)
comment:1 by , 6 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:2 by , 6 years ago
Has patch: | unset |
---|
comment:3 by , 6 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:5 by , 6 years ago
Needs documentation: | set |
---|---|
Needs tests: | set |
comment:6 by , 6 years ago
Taking a risk of detracting the discussion a bit, but given Djagno's async future, wouldn't this be the time to revisit the decision to use persistent connections over a connection pool? For most async frameworks (we use gevent, but will be the same with ASGI), persistent connections don't work since async tasks are not reused like threads/processes, instead a task is spawned per request.
comment:7 by , 3 years ago
Summary: | Check database connection's health before its reused → Check database connection's health before it is reused |
---|
comment:10 by , 3 years ago
Patch needs improvement: | unset |
---|
comment:11 by , 3 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
OK, yes, this seems reasonable to at least look at. (Assuming there's no massive performance hit or such...)
Aside: once of the main benefits of a move to async is that it allows easy creation of background tasks, for which this kind of health check might be suitable. (Not celery's model, but related...)