#31373 closed Bug (invalid)
Django 3.x / uwsgi is leaking connections.
Reported by: | Marcin Nowak | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 3.0 |
Severity: | Release blocker | Keywords: | |
Cc: | Andrew Godwin, Simon Charette | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
Project upgraded from 2.x to 3.0.4 is leaking PostgreSQL database connections.
ATOMIC_REQUESTS=False, AUTOCOMMIT=True, CONN_MAX_AGE=0
UWSGi threads are enabled (--enable-threads
)
Python 3.8.1 and 3.7.3
After few hours all posgtress connections are used (too many connections error).
pg_stat_activity is reporting idle connections with part of queries visible.
Facts:
- setting up CONN_MAX_AGE > 0 does not help
- CONN_MAX_AGE = 30, starting
runserver
and making many calls withab -c 10 -n 100 http://127.0.0.1:8000/
uses all available connections and ends with error 500 (due to no available connections) - CONN_MAX_AGE = 30, starting
runserver --nothreading
and making calls withab -c 100 -n 100 http://127.0.0.1:8000/
works fine - CONN_MAX_AGE=30 + runserver + ab scenario is reproducible on clean install. Called view must access database.
- CONN_MAX_AGE=0 + runserver + ab -> cannot reproduce.
- Django 2.x has similar issue for CONN_MAX_AGE=30, but can reuse connections (still there are several errors on high traffic, but app server can "recover" and handle bunch of incoming requests)
- Looks like Django 3.x cannot reuse connections
- uwsgi + threads + CONN_MAX_AGE=0 on localhost -> cannot reproduce
Runserver (with threads) test results:
ab -c 10 -n 200
Django 3.0.4:
Complete requests: 200
Failed requests: 108
ab -c 10 -n 1000
Django 2.2.11:
Complete requests: 1000
Failed requests: 37
There may be two separate cases:
- not reusing connection when CONN_MAX_AGE is greater than 0
- leaking (not closing?) connections on uwsgi (?) with enabled threads (?) and CONN_MAX_AGE set to 0
I'm not sure about the 2nd case.
Attachments (3)
Change History (37)
by , 5 years ago
Attachment: | Firefox_Screenshot_2020-03-17T15-05-35.104Z.png added |
---|
comment:1 by , 5 years ago
Description: | modified (diff) |
---|
by , 5 years ago
Attachment: | Firefox_Screenshot_2020-03-17T19-42-34.977Z.png added |
---|
comment:2 by , 5 years ago
Description: | modified (diff) |
---|
comment:3 by , 5 years ago
Description: | modified (diff) |
---|
comment:4 by , 5 years ago
Description: | modified (diff) |
---|
comment:5 by , 5 years ago
Description: | modified (diff) |
---|
comment:6 by , 5 years ago
Description: | modified (diff) |
---|
comment:7 by , 5 years ago
Description: | modified (diff) |
---|
comment:8 by , 5 years ago
Description: | modified (diff) |
---|
comment:9 by , 5 years ago
Description: | modified (diff) |
---|
follow-up: 11 comment:10 by , 5 years ago
Resolution: | → needsinfo |
---|---|
Status: | new → closed |
comment:11 by , 5 years ago
Replying to Simon Charette:
I understand your desire to see this addressed but without a simplified test project that reproduces your issue it's impossible to determine whether or not Django is at fault and its not one of your projects third-party app that is.
Please reopen this issue if you can provide a reproduction case else it's simply not actionable by triagers that don't have access to the project you're trying to upgrade. Thanks!
The case about not reusing db connections is reproducible.
- Just create a new project
- Setup postgresql_psycopg2 db, set CONN_MAX_AGE=30,
- create view hitting the database (User.objects.count() for example), bind it to "/" route
- make sure your postgres has connection limit set to quite low value (mine is 100)
- manage.py runserver
- run
ab -c 10 -n 200 https://127.0.0.1:8000/
My case from production server is quite different, maybe related somehow. There is a chance that my project is causing the issue, but it looks that Django 3.x has own issue too, which maybe increases my problem. The most significant change in my project was upgrade of Django, from 2.x to 3.x
Sorry for chaotic description. It's been 16 hour of my today's work, for last few hours I was trying to solve the issue, find the cause (I gave up, sorry). Now I'm working on getting Django 2.x back, backporting some 3.x features to 2.x, so I cannot focus anymore on reproducing my issue in isolated environment (I know how important it is).
comment:12 by , 5 years ago
createdb connmaxage python -m venv /tmp/env . /tmp/env/bin/activate pip install django==3.0.4 pip install psycopg2-binary cd /tmp/ unzip connmaxage.zip cd connmaxage ./manage.py runserver
then (on the 2nd terminal)
ab -c 10 -n 1000 http://127.0.0.1:8000/
Result:
[17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1 [17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1 [17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1 [17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1 [17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1 [17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1 [17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1 [17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1 [17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1 [17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1 [17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1 Internal Server Error: / Traceback (most recent call last): File "/tmp/env/lib/python3.8/site-packages/django/db/backends/base/base.py", line 220, in ensure_connection self.connect() File "/tmp/env/lib/python3.8/site-packages/django/utils/asyncio.py", line 26, in inner return func(*args, **kwargs) File "/tmp/env/lib/python3.8/site-packages/django/db/backends/base/base.py", line 197, in connect self.connection = self.get_new_connection(conn_params) File "/tmp/env/lib/python3.8/site-packages/django/utils/asyncio.py", line 26, in inner return func(*args, **kwargs) File "/tmp/env/lib/python3.8/site-packages/django/db/backends/postgresql/base.py", line 185, in get_new_connection connection = Database.connect(**conn_params) File "/tmp/env/lib/python3.8/site-packages/psycopg2/__init__.py", line 126, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) psycopg2.OperationalError: FATAL: sorry, too many clients already The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/tmp/env/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner response = get_response(request) File "/tmp/env/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response response = self.process_exception_by_middleware(e, request) File "/tmp/env/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response response = wrapped_callback(request, *callback_args, **callback_kwargs) File "/tmp/connmaxage/connmaxage/main/views.py", line 7, in main return HttpResponse(str(User.objects.count())) File "/tmp/env/lib/python3.8/site-packages/django/db/models/manager.py", line 82, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) File "/tmp/env/lib/python3.8/site-packages/django/db/models/query.py", line 397, in count return self.query.get_count(using=self.db) File "/tmp/env/lib/python3.8/site-packages/django/db/models/sql/query.py", line 517, in get_count number = obj.get_aggregation(using, ['__count'])['__count'] File "/tmp/env/lib/python3.8/site-packages/django/db/models/sql/query.py", line 502, in get_aggregation result = compiler.execute_sql(SINGLE) File "/tmp/env/lib/python3.8/site-packages/django/db/models/sql/compiler.py", line 1149, in execute_sql cursor = self.connection.cursor() File "/tmp/env/lib/python3.8/site-packages/django/utils/asyncio.py", line 26, in inner return func(*args, **kwargs) File "/tmp/env/lib/python3.8/site-packages/django/db/backends/base/base.py", line 260, in cursor return self._cursor() File "/tmp/env/lib/python3.8/site-packages/django/db/backends/base/base.py", line 236, in _cursor self.ensure_connection() File "/tmp/env/lib/python3.8/site-packages/django/utils/asyncio.py", line 26, in inner return func(*args, **kwargs) File "/tmp/env/lib/python3.8/site-packages/django/db/backends/base/base.py", line 220, in ensure_connection self.connect() File "/tmp/env/lib/python3.8/site-packages/django/db/utils.py", line 90, in __exit__ raise dj_exc_value.with_traceback(traceback) from exc_value File "/tmp/env/lib/python3.8/site-packages/django/db/backends/base/base.py", line 220, in ensure_connection self.connect() File "/tmp/env/lib/python3.8/site-packages/django/utils/asyncio.py", line 26, in inner return func(*args, **kwargs) File "/tmp/env/lib/python3.8/site-packages/django/db/backends/base/base.py", line 197, in connect self.connection = self.get_new_connection(conn_params) File "/tmp/env/lib/python3.8/site-packages/django/utils/asyncio.py", line 26, in inner return func(*args, **kwargs) File "/tmp/env/lib/python3.8/site-packages/django/db/backends/postgresql/base.py", line 185, in get_new_connection connection = Database.connect(**conn_params) File "/tmp/env/lib/python3.8/site-packages/psycopg2/__init__.py", line 126, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) django.db.utils.OperationalError: FATAL: sorry, too many clients already [17/Mar/2020 23:20:23] "GET / HTTP/1.0" 500 166202
comment:13 by , 5 years ago
Ab results for Django 3:
Complete requests: 1000 Failed requests: 901
Re-test with Django 2:
- in the same env
pip install django==2.2.11
- do same
ab
test:
Complete requests: 1000 Failed requests: 19
comment:14 by , 5 years ago
Cc: | added |
---|---|
Component: | Uncategorized → Database layer (models, ORM) |
Resolution: | needsinfo |
Severity: | Normal → Release blocker |
Status: | closed → new |
Triage Stage: | Unreviewed → Accepted |
Type: | Uncategorized → Bug |
Thanks for the sample project.
I managed to bisect the regression at a415ce70bef6d91036b00dd2c8544aed7aeeaaed which introduced async support in Django. Running ab
against cce47ff65a4dd3786c049ec14ee889e128ca7de9 didn't reproduce.
I didn't investigate why connections kept piling up but I suspect these lines are to blame.
Note that I only managed to reproduce against runserver
and just like Marcin I couldn't reproduce with uwsgi --enable-threads
locally. I also tried gunicorn --threads
and couldn't reproduce.
comment:15 by , 5 years ago
That would not surprise me - next things to investigate are:
- Of course, reverting that one change and seeing if it fixes it.
- Does uwsgi override any part of the threading module? This happens with gevent and, I think, mod_wsgi, and it causes havoc as they ruin thread IDs.
- Checking a couple of recent major versions of asgiref within Django's compatibility range, to see if the change was there (doubtful)
Likely cause is that the close_old_connections() task is never being run in what Local thinks is the same thread as before. I have a potential workaround we can do if it _is_ overriding thread IDs, but if not, we'll have to dive deeper and see what we can fix about it.
comment:16 by , 5 years ago
Cc: | added |
---|
Andrew what about runserver
?
I understand it's not meant to be used as a production http server but the fact the issue can be reproduced pretty easily makes me believe something might be wrong with this commit or asgiref.local.Local
.
comment:17 by , 5 years ago
Oh, I missed the runserver
comment - that would indeed indicate a severe problem, as it uses bog-standard WSGI threading.
I suspect the problem will be with Local, as it is likely not precisely replicating threading.local enough - working out how would be my line of investigation. On the plus side, if it is that, we can ship a fix to asgiref quickly out-of-cycle with Django and everyone will benefit.
I'd volunteer to look, but starting a new job and the current world situation means I won't have time to do the investigation for a little bit, but if we can trace it down to a direct behavioural difference I can likely write the fix.
comment:18 by , 5 years ago
Same results.
Running on stable/2.2.x
, just with runserver
, I see some psycopg2.OperationalError: FATAL: sorry, too many clients already
errors, but it does OK — it recovers for a while, before an error again, and so on. (It's only runserver...)
Upping to stable/3.0.x
, the limit is reached and then it never recovers.
On stable/3.0.x
, (and master
), reverting just the use of asgiref.local.Local
to threading.local
in the lines Simon identified in django/db/utils.py
leads to a better-than-stable/2.2.x behaviour. (It's essentially the same, but errors slightly less often, so presumably something else changed to account for that.)
(Disabling the auto-reloader makes no apparent difference.)
I suspect the problem will be with Local, as it is likely not precisely replicating threading.local enough
This looks favourite then. I shall try and dig-in tomorrow.
comment:19 by , 5 years ago
The root cause is probably different, but I think it's worth pointing out that asgiref.local.Local also has problems with Gunicorn + gevent: https://github.com/django/asgiref/issues/144
comment:20 by , 5 years ago
OK, I have attempted to rewrite Local to more closely match how threading
does it - I've pushed a new branch to asgiref
called new_local
. Could you test against that? https://github.com/django/asgiref/tree/new_local
comment:21 by , 5 years ago
I gave the asgiref new_local branch a run against runserver (with Django stable/3.0.x).
This again exhibits some errors but then recovers (similar to 2.2)
@Marcin are you able to try in your project?
comment:22 by , 5 years ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
Summary: | Django 3.x / uwsgi is leaking connections → Django 3.x / uwsgi is leaking connections. |
We confirmed that it was an issue in asgiref
that was fixed in asgiref==3.2.7
(Thanks Andrew!).
With asgiref==3.2.5
:
Time taken for tests: 47.321 seconds Complete requests: 500 Failed requests: 400
With asgiref==3.2.7
:
Time taken for tests: 4.413 seconds Complete requests: 500 Failed requests: 0
Note: asgiref==3.2.6
introduced a regression and shouldn't be used.
comment:23 by , 5 years ago
@Marcin are you able to try in your project?
I think I will, but give me a some time.
comment:24 by , 5 years ago
Hi Marcin, thanks.
Since it’s been addressed in asgiref
, a pip install -U asgiref
(or equivalent, given that you probably don’t deploy by hand) should do it.
Do report back! :)
comment:25 by , 5 years ago
Should we bump asgi
to >=3.2.7
in the next 3.0 release and on master?
https://github.com/django/django/blob/ab5720ad80b48fac746021e826fe76d0869f9c02/setup.cfg#L42
follow-up: 29 comment:26 by , 5 years ago
I would rather put it as ~=3.2
as that will ensure that installations auto-upgrade to an asgiref 3.3 and beyond if they are released, but not a potentially-API-incompatible asgiref 4. If we use a simple greater-than, we'd have to also put a < 4
.
comment:27 by , 5 years ago
Andrew, agreed. As long as it reduces the amount of reports related to this hard to diagnose issue I think its worth doing.
comment:28 by , 5 years ago
Right. I don't think there's a situation other than an explicit version pin where someone will pull in an old version of asgiref (the only one I can think of is an out-of-date internal mirror).
comment:29 by , 5 years ago
Replying to Andrew Godwin:
I would rather put it as
~=3.2
as that will ensure that installations auto-upgrade to an asgiref 3.3 and beyond if they are released, but not a potentially-API-incompatible asgiref 4. If we use a simple greater-than, we'd have to also put a< 4
.
It is set that way in Django 3.0, see setup.py#L86. We can decide to pin a specific version of asgiref
to Django 3.1. after a feature freeze ~ May, 2020.
comment:30 by , 5 years ago
Hi Carlton.
I'll report back, but this is more complicated than just pip install or something like that.
I've backported few things and I'm monkey patching Django 2.x. I must revert these changes, switch to 3.x, and deploy a new release.
And because there is still a little risk, I have to be careful and deploy a new release at the time of low traffic, then I must have some time to observe metrics.
But I'll do this and give you a feedback, for sure.
comment:33 by , 5 years ago
I've updated to Django 3.0.4 and asgiref 3.2.7 - PosgreSQL connections graph looks good.
I think the issue was fixed. Thank you for fast reaction and bugfix.
comment:34 by , 5 years ago
After few days I can finally confirm that new asgiref version solves the issue. Thanks.
I understand your desire to see this addressed but without a simplified test project that reproduces your issue it's impossible to determine whether or not Django is at fault and its not one of your projects third-party app that is.
Please reopen this issue if you can provide a reproduction case else it's simply not actionable by triagers that don't have access to the project you're trying to upgrade. Thanks!