Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#31373 closed Bug (invalid)

Django 3.x / uwsgi is leaking connections.

Reported by: Marcin Nowak Owned by: nobody
Component: Database layer (models, ORM) Version: 3.0
Severity: Release blocker Keywords:
Cc: Andrew Godwin, Simon Charette Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Marcin Nowak)

Project upgraded from 2.x to 3.0.4 is leaking PostgreSQL database connections.
ATOMIC_REQUESTS=False, AUTOCOMMIT=True, CONN_MAX_AGE=0
UWSGi threads are enabled (--enable-threads)
Python 3.8.1 and 3.7.3

After few hours all posgtress connections are used (too many connections error).
pg_stat_activity is reporting idle connections with part of queries visible.



Facts:

  • setting up CONN_MAX_AGE > 0 does not help
  • CONN_MAX_AGE = 30, starting runserver and making many calls with ab -c 10 -n 100 http://127.0.0.1:8000/ uses all available connections and ends with error 500 (due to no available connections)
  • CONN_MAX_AGE = 30, starting runserver --nothreading and making calls with ab -c 100 -n 100 http://127.0.0.1:8000/ works fine
  • CONN_MAX_AGE=30 + runserver + ab scenario is reproducible on clean install. Called view must access database.
  • CONN_MAX_AGE=0 + runserver + ab -> cannot reproduce.
  • Django 2.x has similar issue for CONN_MAX_AGE=30, but can reuse connections (still there are several errors on high traffic, but app server can "recover" and handle bunch of incoming requests)
  • Looks like Django 3.x cannot reuse connections
  • uwsgi + threads + CONN_MAX_AGE=0 on localhost -> cannot reproduce

Runserver (with threads) test results:

ab -c 10 -n 200 Django 3.0.4:
Complete requests: 200
Failed requests: 108

ab -c 10 -n 1000 Django 2.2.11:
Complete requests: 1000
Failed requests: 37


There may be two separate cases:

  1. not reusing connection when CONN_MAX_AGE is greater than 0
  2. leaking (not closing?) connections on uwsgi (?) with enabled threads (?) and CONN_MAX_AGE set to 0

I'm not sure about the 2nd case.

Attachments (3)

Firefox_Screenshot_2020-03-17T15-05-35.104Z.png (63.9 KB ) - added by Marcin Nowak 4 years ago.
Firefox_Screenshot_2020-03-17T19-42-34.977Z.png (53.7 KB ) - added by Marcin Nowak 4 years ago.
connmaxage.zip (5.8 KB ) - added by Marcin Nowak 4 years ago.
Example project with CONN_MAX_AGE issue

Download all attachments as: .zip

Change History (37)

comment:1 by Marcin Nowak, 4 years ago

Description: modified (diff)

comment:2 by Marcin Nowak, 4 years ago

Description: modified (diff)

comment:3 by Marcin Nowak, 4 years ago

Description: modified (diff)

comment:4 by Marcin Nowak, 4 years ago

Description: modified (diff)

comment:5 by Marcin Nowak, 4 years ago

Description: modified (diff)

comment:6 by Marcin Nowak, 4 years ago

Description: modified (diff)

comment:7 by Marcin Nowak, 4 years ago

Description: modified (diff)

comment:8 by Marcin Nowak, 4 years ago

Description: modified (diff)

comment:9 by Marcin Nowak, 4 years ago

Description: modified (diff)

comment:10 by Simon Charette, 4 years ago

Resolution: needsinfo
Status: newclosed

I understand your desire to see this addressed but without a simplified test project that reproduces your issue it's impossible to determine whether or not Django is at fault and its not one of your projects third-party app that is.

Please reopen this issue if you can provide a reproduction case else it's simply not actionable by triagers that don't have access to the project you're trying to upgrade. Thanks!

Last edited 4 years ago by Simon Charette (previous) (diff)

in reply to:  10 comment:11 by Marcin Nowak, 4 years ago

Replying to Simon Charette:

I understand your desire to see this addressed but without a simplified test project that reproduces your issue it's impossible to determine whether or not Django is at fault and its not one of your projects third-party app that is.

Please reopen this issue if you can provide a reproduction case else it's simply not actionable by triagers that don't have access to the project you're trying to upgrade. Thanks!

The case about not reusing db connections is reproducible.

  1. Just create a new project
  2. Setup postgresql_psycopg2 db, set CONN_MAX_AGE=30,
  3. create view hitting the database (User.objects.count() for example), bind it to "/" route
  4. make sure your postgres has connection limit set to quite low value (mine is 100)
  5. manage.py runserver
  6. run ab -c 10 -n 200 https://127.0.0.1:8000/

My case from production server is quite different, maybe related somehow. There is a chance that my project is causing the issue, but it looks that Django 3.x has own issue too, which maybe increases my problem. The most significant change in my project was upgrade of Django, from 2.x to 3.x

Sorry for chaotic description. It's been 16 hour of my today's work, for last few hours I was trying to solve the issue, find the cause (I gave up, sorry). Now I'm working on getting Django 2.x back, backporting some 3.x features to 2.x, so I cannot focus anymore on reproducing my issue in isolated environment (I know how important it is).

by Marcin Nowak, 4 years ago

Attachment: connmaxage.zip added

Example project with CONN_MAX_AGE issue

comment:12 by Marcin Nowak, 4 years ago

createdb connmaxage
python -m venv /tmp/env
. /tmp/env/bin/activate
pip install django==3.0.4
pip install psycopg2-binary
cd /tmp/
unzip connmaxage.zip
cd connmaxage
./manage.py runserver

then (on the 2nd terminal)

ab -c 10 -n 1000 http://127.0.0.1:8000/

Result:

[17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1
[17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1
[17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1
[17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1
[17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1
[17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1
[17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1
[17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1
[17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1
[17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1
[17/Mar/2020 23:20:23] "GET / HTTP/1.0" 200 1
Internal Server Error: /
Traceback (most recent call last):
  File "/tmp/env/lib/python3.8/site-packages/django/db/backends/base/base.py", line 220, in ensure_connection
    self.connect()
  File "/tmp/env/lib/python3.8/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/tmp/env/lib/python3.8/site-packages/django/db/backends/base/base.py", line 197, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/tmp/env/lib/python3.8/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/tmp/env/lib/python3.8/site-packages/django/db/backends/postgresql/base.py", line 185, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/tmp/env/lib/python3.8/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL:  sorry, too many clients already


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tmp/env/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/tmp/env/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/tmp/env/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/tmp/connmaxage/connmaxage/main/views.py", line 7, in main
    return HttpResponse(str(User.objects.count()))
  File "/tmp/env/lib/python3.8/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/tmp/env/lib/python3.8/site-packages/django/db/models/query.py", line 397, in count
    return self.query.get_count(using=self.db)
  File "/tmp/env/lib/python3.8/site-packages/django/db/models/sql/query.py", line 517, in get_count
    number = obj.get_aggregation(using, ['__count'])['__count']
  File "/tmp/env/lib/python3.8/site-packages/django/db/models/sql/query.py", line 502, in get_aggregation
    result = compiler.execute_sql(SINGLE)
  File "/tmp/env/lib/python3.8/site-packages/django/db/models/sql/compiler.py", line 1149, in execute_sql
    cursor = self.connection.cursor()
  File "/tmp/env/lib/python3.8/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/tmp/env/lib/python3.8/site-packages/django/db/backends/base/base.py", line 260, in cursor
    return self._cursor()
  File "/tmp/env/lib/python3.8/site-packages/django/db/backends/base/base.py", line 236, in _cursor
    self.ensure_connection()
  File "/tmp/env/lib/python3.8/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/tmp/env/lib/python3.8/site-packages/django/db/backends/base/base.py", line 220, in ensure_connection
    self.connect()
  File "/tmp/env/lib/python3.8/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/tmp/env/lib/python3.8/site-packages/django/db/backends/base/base.py", line 220, in ensure_connection
    self.connect()
  File "/tmp/env/lib/python3.8/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/tmp/env/lib/python3.8/site-packages/django/db/backends/base/base.py", line 197, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/tmp/env/lib/python3.8/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/tmp/env/lib/python3.8/site-packages/django/db/backends/postgresql/base.py", line 185, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/tmp/env/lib/python3.8/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: FATAL:  sorry, too many clients already

[17/Mar/2020 23:20:23] "GET / HTTP/1.0" 500 166202

comment:13 by Marcin Nowak, 4 years ago

Ab results for Django 3:

Complete requests:      1000
Failed requests:        901

Re-test with Django 2:

  1. in the same env pip install django==2.2.11
  2. do same ab test:
Complete requests:      1000
Failed requests:        19

comment:14 by Simon Charette, 4 years ago

Cc: Andrew Godwin added
Component: UncategorizedDatabase layer (models, ORM)
Resolution: needsinfo
Severity: NormalRelease blocker
Status: closednew
Triage Stage: UnreviewedAccepted
Type: UncategorizedBug

Thanks for the sample project.

I managed to bisect the regression at a415ce70bef6d91036b00dd2c8544aed7aeeaaed which introduced async support in Django. Running ab against cce47ff65a4dd3786c049ec14ee889e128ca7de9 didn't reproduce.

I didn't investigate why connections kept piling up but I suspect these lines are to blame.

Note that I only managed to reproduce against runserver and just like Marcin I couldn't reproduce with uwsgi --enable-threads locally. I also tried gunicorn --threads and couldn't reproduce.

Last edited 4 years ago by Simon Charette (previous) (diff)

comment:15 by Andrew Godwin, 4 years ago

That would not surprise me - next things to investigate are:

  • Of course, reverting that one change and seeing if it fixes it.
  • Does uwsgi override any part of the threading module? This happens with gevent and, I think, mod_wsgi, and it causes havoc as they ruin thread IDs.
  • Checking a couple of recent major versions of asgiref within Django's compatibility range, to see if the change was there (doubtful)

Likely cause is that the close_old_connections() task is never being run in what Local thinks is the same thread as before. I have a potential workaround we can do if it _is_ overriding thread IDs, but if not, we'll have to dive deeper and see what we can fix about it.

comment:16 by Simon Charette, 4 years ago

Cc: Simon Charette added

Andrew what about runserver?

I understand it's not meant to be used as a production http server but the fact the issue can be reproduced pretty easily makes me believe something might be wrong with this commit or asgiref.local.Local.

comment:17 by Andrew Godwin, 4 years ago

Oh, I missed the runserver comment - that would indeed indicate a severe problem, as it uses bog-standard WSGI threading.

I suspect the problem will be with Local, as it is likely not precisely replicating threading.local enough - working out how would be my line of investigation. On the plus side, if it is that, we can ship a fix to asgiref quickly out-of-cycle with Django and everyone will benefit.

I'd volunteer to look, but starting a new job and the current world situation means I won't have time to do the investigation for a little bit, but if we can trace it down to a direct behavioural difference I can likely write the fix.

comment:18 by Carlton Gibson, 4 years ago

Same results.

Running on stable/2.2.x, just with runserver, I see some psycopg2.OperationalError: FATAL: sorry, too many clients already errors, but it does OK — it recovers for a while, before an error again, and so on. (It's only runserver...)

Upping to stable/3.0.x, the limit is reached and then it never recovers.

On stable/3.0.x, (and master), reverting just the use of asgiref.local.Local to threading.local in the lines Simon identified in django/db/utils.py leads to a better-than-stable/2.2.x behaviour. (It's essentially the same, but errors slightly less often, so presumably something else changed to account for that.)

(Disabling the auto-reloader makes no apparent difference.)

I suspect the problem will be with Local, as it is likely not precisely replicating threading.local enough

This looks favourite then. I shall try and dig-in tomorrow.

Last edited 4 years ago by Carlton Gibson (previous) (diff)

comment:19 by Reupen Shah, 4 years ago

The root cause is probably different, but I think it's worth pointing out that asgiref.local.Local also has problems with Gunicorn + gevent: https://github.com/django/asgiref/issues/144

comment:20 by Andrew Godwin, 4 years ago

OK, I have attempted to rewrite Local to more closely match how threading does it - I've pushed a new branch to asgiref called new_local. Could you test against that? https://github.com/django/asgiref/tree/new_local

comment:21 by Carlton Gibson, 4 years ago

I gave the asgiref new_local branch a run against runserver (with Django stable/3.0.x).
This again exhibits some errors but then recovers (similar to 2.2)

@Marcin are you able to try in your project?

comment:22 by Mariusz Felisiak, 4 years ago

Resolution: invalid
Status: newclosed
Summary: Django 3.x / uwsgi is leaking connectionsDjango 3.x / uwsgi is leaking connections.

We confirmed that it was an issue in asgiref that was fixed in asgiref==3.2.7 (Thanks Andrew!).

With asgiref==3.2.5:

Time taken for tests:   47.321 seconds
Complete requests:      500
Failed requests:        400

With asgiref==3.2.7:

Time taken for tests:   4.413 seconds                                                                                                                  
Complete requests:      500                                                                                                                            
Failed requests:        0                                                                                                                              

Note: asgiref==3.2.6 introduced a regression and shouldn't be used.

comment:23 by Marcin Nowak, 4 years ago

@Marcin are you able to try in your project?

I think I will, but give me a some time.

comment:24 by Carlton Gibson, 4 years ago

Hi Marcin, thanks.

Since it’s been addressed in asgiref, a pip install -U asgiref (or equivalent, given that you probably don’t deploy by hand) should do it.

Do report back! :)

comment:25 by Simon Charette, 4 years ago

Should we bump asgi to >=3.2.7 in the next 3.0 release and on master?

https://github.com/django/django/blob/ab5720ad80b48fac746021e826fe76d0869f9c02/setup.cfg#L42

comment:26 by Andrew Godwin, 4 years ago

I would rather put it as ~=3.2 as that will ensure that installations auto-upgrade to an asgiref 3.3 and beyond if they are released, but not a potentially-API-incompatible asgiref 4. If we use a simple greater-than, we'd have to also put a < 4.

comment:27 by Simon Charette, 4 years ago

Andrew, agreed. As long as it reduces the amount of reports related to this hard to diagnose issue I think its worth doing.

comment:28 by Andrew Godwin, 4 years ago

Right. I don't think there's a situation other than an explicit version pin where someone will pull in an old version of asgiref (the only one I can think of is an out-of-date internal mirror).

in reply to:  26 comment:29 by Mariusz Felisiak, 4 years ago

Replying to Andrew Godwin:

I would rather put it as ~=3.2 as that will ensure that installations auto-upgrade to an asgiref 3.3 and beyond if they are released, but not a potentially-API-incompatible asgiref 4. If we use a simple greater-than, we'd have to also put a < 4.

It is set that way in Django 3.0, see setup.py#L86. We can decide to pin a specific version of asgiref to Django 3.1. after a feature freeze ~ May, 2020.

comment:30 by Marcin Nowak, 4 years ago

Hi Carlton.

I'll report back, but this is more complicated than just pip install or something like that.

I've backported few things and I'm monkey patching Django 2.x. I must revert these changes, switch to 3.x, and deploy a new release.
And because there is still a little risk, I have to be careful and deploy a new release at the time of low traffic, then I must have some time to observe metrics.

But I'll do this and give you a feedback, for sure.

comment:31 by Carlton Gibson, 4 years ago

Hi Marcin. Thanks! (Sounds fun... :)

comment:32 by Marcin Nowak, 4 years ago

Sounds fun... :)

Yes, I'm not strong in English.

comment:33 by Marcin Nowak, 4 years ago

I've updated to Django 3.0.4 and asgiref 3.2.7 - PosgreSQL connections graph looks good.
I think the issue was fixed. Thank you for fast reaction and bugfix.

comment:34 by Marcin Nowak, 4 years ago

After few days I can finally confirm that new asgiref version solves the issue. Thanks.

Note: See TracTickets for help on using tickets.
Back to Top