id,summary,reporter,owner,description,type,status,component,version,severity,resolution,keywords,cc,stage,has_patch,needs_docs,needs_tests,needs_better_patch,easy,ui_ux 30398,Check database connection's health before it is reused,Przemysław Suliga,Przemysław Suliga,"With [https://docs.djangoproject.com/en/2.2/ref/databases/#persistent-connections persistent database connections], when the DB server closes the connection (for example during its restart), the next request that tries to use that persistent DB connection in Django will fail with an error (even after the DB server is ready to process queries) like this one (example for PostgreSQL): {{{ ... File "".../django/db/backends/utils.py"", line 84, in _execute return self.cursor.execute(sql, params) django.db.utils.OperationalError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. }}} After that, Django will close the connection in question. During the handling of the next request, that Django process will create a new DB connection and it will succeed. So with persistent connections (assuming all processes/workers for a Django app/project already have a DB connection open), every Django process will fail to handle one DB accessing request after its connection is closed by the DB server. This isn't news. Quoting from #24810: > When the database closes the connection, for instance because the database server was restarted, an exception will be raised the next time user code attempts to use the database. Django doesn't attempt to do anything about it. > > This is expected to result in at most one 500 error for each thread that serves requests, since database connections are thread-local. This doesn't sound horrific. After all the database was restarted while the website was serving requests. We could try to prevent these errors by [https://github.com/django/django/blob/80482e924953ec0e2484a9cb0f44bd5eeea93856/django/db/backends/base/base.py#L480-L490 health checking the connection] before every re-use attempt (currently the health check is only done after an error has already been detected for the connection). This [https://github.com/django/django/compare/master...suligap:db-conn-health-checks Work In Progress patch] (no tests and docs yet) introduces a `CONN_HEALTH_CHECK_DELAY` DB setting to Django: * If it's set to `None` (default), the behavior is the same as before. * Otherwise, it expects an integer with a number of seconds after which the connection should be health checked before it's reused (`0` for ""before every reuse""). When checks are enabled, it works as follows (assuming already existing ""persistent"" DB connection): * Before each connection reuse, perform a connection test (`SELECT 1` query in case of PostgreSQL). * This only happens if last check was performed more than `CONN_HEALTH_CHECK_DELAY` seconds ago. * This only happens once per request (or async task - [https://github.com/celery/celery/blob/3d387704d4a18db5aea758e9a26c1ee4d71659df/celery/fixups/django.py#L163-L192 Hi Celery])) regardless of how many transactions the ""request"" will perform. * This only happens for ""requests"" that do use the database. * If the check fails, close the connection and create a new one. I feel like I'm probably missing something important here... Anyway, on the other hand, what gives me hints that this might not be the worst idea in the world are examples from other places: - `django-db-geventpool` (used with success in some gunicorn/gevent/PostgreSQL setups) [https://github.com/jneight/django-db-geventpool/blob/a1a691d9cf031a7fa0c377f87a69e87bb54e7ea6/django_db_geventpool/backends/postgresql_psycopg2/psycopg2_pool.py#L44 does] `SELECT 1` every time it leases out an existing connection. - PgBouncer does the check query too (configurable by the [https://github.com/pgbouncer/pgbouncer/blob/cb9a1511a48f7f5fb3ddf258055c698c8004f00f/doc/config.md#server_check_delay server_check_delay] setting - default 30 seconds). - HikariCP (a popular Java JDBC connection pool) also does [https://github.com/brettwooldridge/HikariCP/blob/e8e9cdd00ab886b26340c05706485ddb092f5116/src/main/java/com/zaxxer/hikari/pool/HikariPool.java#L185 connection health checks] before leasing out connections. - Rails' Active Record [https://github.com/rails/rails/blob/b1fc1319dfbc450451bb0b410113dd6b7ab1d412/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb#L859-L869 seems] like [https://github.com/rails/rails/blob/dc2d19b8bf693886bec520c1787166eb105c8567/activerecord/lib/active_record/connection_adapters/abstract_adapter.rb#L512-L517 it also] does [https://github.com/rails/rails/blob/dc2d19b8bf693886bec520c1787166eb105c8567/activerecord/lib/active_record/connection_adapters/mysql2_adapter.rb#L93-L95 something like that]. Granted, the above are _separate_ components (doing connection pooling and some other things in case of Active Record) and very likely a proper fix for #24810 might fix the issue this PR fixes. On the other hand [https://github.com/django/django/compare/master...suligap:db-conn-health-checks this approach here] is addressing a different issue and might be simpler and more feasible (at least as the request/response cycle code part goes)? Happy to see this through if something like this makes sense for Django. --- My main motivation for researching this was trying to create a solution for Celery. With default `prefork` concurrency, celery tasks running Django ORM code would work very nicely with non-zero `CONN_MAX_AGE`. But the problem is I don't want to end up loosing messages/tasks because of these ""db server closes connections"" events (however rare they might be - with lots connections and threads/processes the number of errors gets bigger - Lowering the `CONN_MAX_AGE` is a trade-off which doesn't guarantee good enough safety). And Celery tasks by default are [http://docs.celeryproject.org/en/v4.3.0/reference/celery.app.task.html#celery.app.task.Task.acks_late ACK-ed] before they're executed (so if they fail because of that, they're lost). But it seems that Django itself could also benefit from this. And so here we are. ",New feature,closed,"Database layer (models, ORM)",dev,Normal,fixed,,,Ready for checkin,1,0,0,0,0,0