Opened 5 years ago
Closed 5 years ago
#31515 closed Bug (fixed)
Django tries to reuse MySQL connections that have gone away with ASGI.
Reported by: | Braunson | Owned by: | Carlton Gibson |
---|---|---|---|
Component: | Documentation | Version: | dev |
Severity: | Release blocker | Keywords: | |
Cc: | Andrew Godwin, Carlton Gibson | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
This bug only appears to happen when using ASGI. No matter the CONN_MAX_AGE
it seems to be ignored. I have tried both manually setting to 0 so they are never reused and setting them to an hour without any effect.
If it makes a difference I am using multiple databases. Again this issue does NOT occur with WSGI, only with ASGI.
Is there any work around I can do to solve this in the meantime? It results in almost every page 500 erroring once or twice and a refresh fixes it temporarily for the client.
Attachments (1)
Change History (33)
follow-up: 2 comment:1 by , 5 years ago
comment:2 by , 5 years ago
Replying to Simon Charette:
This could be related #31373, which version of Django and asgiref are you using?
3.0.5, and I'm not explicitly using asgiref. I'm using uvicorn 0.11.3 to serve the requests.
follow-up: 4 comment:3 by , 5 years ago
You don't have to be explicitly using asgiref to be affected by the issue; Django uses it internally to keep reference to database connections and other globals.
Could you try upgrading to asgiref>=3.2.7
and see if it helps? Prior version had a bug which showed similar symptoms in #31373.
comment:4 by , 5 years ago
Replying to Simon Charette:
You don't have to be explicitly using asgiref to be affected by the issue; Django uses it internally to keep reference to database connections and other globals.
Could you try upgrading to
asgiref>=3.2.7
and see if it helps? Prior version had a bug which showed similar symptoms in #31373.
Unfortunately that requirement did not help. After a few hours of operation the error continues to appear again. So even with the max age set to 0 it keeps trying to reuse those old database connections long after the server terminates them.
follow-ups: 6 7 comment:5 by , 5 years ago
Thanks for trying it out. It also has bit of similarity with #30398 but the fact it only happens with ASGI is puzzling. Could you try setting up a minimal project that reproduces your issue to make it easier to identify its origin?
comment:6 by , 5 years ago
Replying to Simon Charette:
Thanks for trying it out. It also has bit of similarity with #30398 but the fact it only happens with ASGI is puzzling. Could you try setting up a minimal project that reproduces your issue to make it easier to identify its origin?
Been working hard to try and reproduce this in a development environment but I'm having a hard time. I can see either the timeout must be getting reached or there is a packet issue because the logs of our SQL server is filled with communication errors on initial packet whenever we hit a 500 error on the website.
Additionally (not sure if related) but I am able to reproduce issues with unclosed connections when running ASGI:
INFO: 127.0.0.1:58148 - "GET / HTTP/1.0" 200 OK INFO: logging in using static token ERROR: Unclosed client session client_session: <aiohttp.client.ClientSession object at 0x10cf64eb0> ERROR: Unclosed client session client_session: <aiohttp.client.ClientSession object at 0x10cf645e0> ERROR: Unclosed connector connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x10cedb2e0>, 20427.378)]'] connector: <aiohttp.connector.TCPConnector object at 0x10cf64580> INFO: 127.0.0.1:58150 - "GET / HTTP/1.0" 200 OK INFO: logging in using static token ERROR: Unclosed client session client_session: <aiohttp.client.ClientSession object at 0x10cf66b20> ERROR: Unclosed connector connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x10cf37ac0>, 20427.611)]'] connector: <aiohttp.connector.TCPConnector object at 0x10cf66880> INFO: 127.0.0.1:58152 - "GET / HTTP/1.0" 200 OK
However, again I'm having a really hard time reproducing the SQL connection bug that I've been experiencing. The ticket you mentioned above would surely solve this issue I'm experiencing but I would rather find the root cause as I'm sure you would.
In order to demonstrate the above connection errors I simply logged in and out of an async HTTP library for Discord, so that could be their own fault and not related to Django. What still seems to be related to Django is reuse of old DB connections with the MAX AGE set to 0.
comment:7 by , 5 years ago
Replying to Simon Charette:
Thanks for trying it out. It also has bit of similarity with #30398 but the fact it only happens with ASGI is puzzling. Could you try setting up a minimal project that reproduces your issue to make it easier to identify its origin?
Actually @Simon I'm not able to reproduce the specific "MySQL has gone away" issue on Django, but I AM able to reproduce the fact that Django seems to not be closing the database connections properly which could be causing it. My development machine when running a benchmark on this test code shows the following in the SQL server:
2020-04-26 1:10:37 154 [Warning] Aborted connection 154 to db: 'test' user: 'root' host: '172.17.0.1' (Got an error reading communication packets) 2020-04-26 1:12:28 168 [Warning] Aborted connection 168 to db: 'test' user: 'root' host: '172.17.0.1' (Got an error reading communication packets) 2020-04-26 1:13:13 171 [Warning] Aborted connection 171 to db: 'test' user: 'root' host: '172.17.0.1' (Got an error reading communication packets) 2020-04-26 1:17:32 178 [Warning] Aborted connection 178 to db: 'test' user: 'root' host: '172.17.0.1' (Got an error reading communication packets) 2020-04-26 1:21:33 280 [Warning] Aborted connection 280 to db: 'test' user: 'root' host: '172.17.0.1' (Got an error reading communication packets) 2020-04-26 1:21:33 282 [Warning] Aborted connection 282 to db: 'test' user: 'root' host: '172.17.0.1' (Got an error reading communication packets) 2020-04-26 1:21:33 187 [Warning] Aborted connection 187 to db: 'test' user: 'root' host: '172.17.0.1' (Got an error reading communication packets)
Additionally, with show processlist
I can see that Django is keeping connections open when doing nothing (they all disappear along with those errors in the SQL console when I shut down the asgi server), is this proper behaviour when the CONN_AGE is set to 0?
I packaged up a small test project. Run pip install -r requirements.txt && python manage.py migrate
, then startup the ASGI with uvicorn djangobug_repro.asgi:application --reload
Run a benchmark against the homepage and run show processlist;
in your SQL server to see the IDs not change, ie the connection is being reused. You can run a benchmark with ab: ab -c 20 -n 300
to see the connections reused. If I just refresh a couple times on my own I don't see the issue, it looks like everything gets closed off. But if you run the benchmark you see that Django starts to run behind on closing the connections?
I can also confirm this behaviour does not exist using WSGI with the same project (I used gunicorn with the benchmark, I commented out the async portion and just saved a random number to the db. Only 1 process existed in the SQL server)
by , 5 years ago
Attachment: | djangobug_repro.zip added |
---|
comment:8 by , 5 years ago
Cc: | added |
---|---|
Summary: | Django tries to reuse MySQL connections that have gone away with ASGI β Django tries to reuse MySQL connections that have gone away with ASGI. |
comment:9 by , 5 years ago
Using cache and sync method are unrelated, I can reproduce the same behavior with:
def home(request): users = UserProxy.objects.all() num = 1 users[0].discord = num users[0].save() return render(request, 'home.html', {'users': users[0], 'num': num})
comment:10 by , 5 years ago
Also, I cannot reproduce this issue on the current master, connections are closed properly at the end of each request.
comment:11 by , 5 years ago
This really should have been handled by that most recent asgiref release as you discussed above - unless gevent or something else is in there screwing up the threading module, I'm not sure what's happening here.
follow-up: 13 comment:12 by , 5 years ago
Andrew, it works properly with asgiref==3.2.7
on master. You can reproduce this issue on Django 3.0 with the same version of asgiref
π€.
comment:13 by , 5 years ago
Replying to felixxm:
Andrew, it works properly with
asgiref==3.2.7
on master. You can reproduce this issue on Django 3.0 with the same version ofasgiref
π€.
So it's broken in 3.0.5 with the latest asgiref
but is working in master somehow? Which would mean the next release fixed it inadvertently?
follow-up: 15 comment:14 by , 5 years ago
Well, the current master
has a whole swathe of different async fixes due to the work done around the async views work - I would not be surprised if this is what fixed it if you are comparing to 3.0. The signal handling, especially, was fixed, and not running the start/end request signals properly is one of the main causes of this kind of bug.
comment:15 by , 5 years ago
Replying to Andrew Godwin:
Well, the current
master
has a whole swathe of different async fixes due to the work done around the async views work - I would not be surprised if this is what fixed it if you are comparing to 3.0. The signal handling, especially, was fixed, and not running the start/end request signals properly is one of the main causes of this kind of bug.
Awesome. Glad to hear it's supposedly been fixed then. Is there a roadmap/schedule I can look at for when the master changes will make it into a release?
comment:17 by , 5 years ago
Replying to felixxm:
Braunson, see Version3.1Roadmap.
Ouch August for this fix? Is there any plan to backport it? Results in a lot of 500 errors from timed out connections. Anything I can do in the meantime?
follow-up: 19 comment:18 by , 5 years ago
Component: | Database layer (models, ORM) β Documentation |
---|---|
Severity: | Normal β Release blocker |
Triage Stage: | Unreviewed β Accepted |
Django 3.0 supports ASGI, however it can be really tricky to backport a fix from master. There is a big risk of introducing new regressions, so IMO we should document this limitation in Django 3.0.
I will try to add a proper regression test.
comment:19 by , 5 years ago
Replying to felixxm:
Django 3.0 supports ASGI, however it can be really tricky to backport a fix from master. There is a big risk of introducing new regressions, so IMO we should document this limitation in Django 3.0.
I will try to add a proper regression test.
There's no possible way for an end user to solve this in 3.0.5 then? Even if I set a really high timeout value for the SQL server, they still would have to expire eventually and Django will try to use them. It makes any database operations on 3.0.5 sort of impossible without having a timeout eventually no?
follow-up: 21 comment:20 by , 5 years ago
Well, we would have to find exactly where the bug is first before fixing it. Nothing I know of in my quick glance through the code made it obvious where it could be; we seem to correctly handle the close_old_connections
call even in Django 3.0 ASGI mode (not calling close_old_connections
is often the cause of this bug).
I never figured out why during Channels development, but the MySQL backend seems specifically prone to these sorts of issues even when it should be fine. I think if you set a non-zero CONN_MAX_AGE it should clear itself up, if you can take that operationally, otherwise I think you may have to run on WSGI rather than ASGI for now (unless we track down exactly what is doing this)
comment:21 by , 5 years ago
Replying to Andrew Godwin:
Well, we would have to find exactly where the bug is first before fixing it. Nothing I know of in my quick glance through the code made it obvious where it could be; we seem to correctly handle the
close_old_connections
call even in Django 3.0 ASGI mode (not callingclose_old_connections
is often the cause of this bug).
I never figured out why during Channels development, but the MySQL backend seems specifically prone to these sorts of issues even when it should be fine. I think if you set a non-zero CONN_MAX_AGE it should clear itself up, if you can take that operationally, otherwise I think you may have to run on WSGI rather than ASGI for now (unless we track down exactly what is doing this)
I did actually try setting a non-zero CONN_MAX_AGE
but the issue persists. And unfortunately we developed some features that rely on ASGI due to using async third party libraries.
follow-up: 23 comment:22 by , 5 years ago
Wait, how are you able to use async third party libraries in Django 3.0? While it has ASGI support, that's only to run in synchronous mode; if you're doing your own async stuff from Django views you want to be careful or you can cause issues like this.
comment:23 by , 5 years ago
Replying to Andrew Godwin:
Wait, how are you able to use async third party libraries in Django 3.0? While it has ASGI support, that's only to run in synchronous mode; if you're doing your own async stuff from Django views you want to be careful or you can cause issues like this.
Using the @async_to_sync decorators for some functions where awaits might be required (not async_to_sync directly on views). An example was included in the zip although I commented it out as it wasn't actually related to this database issue.
follow-up: 25 comment:24 by , 5 years ago
Oh, OK. Note that you can use those under WSGI Django as well and they'll perform very similarly - Django 3.0 does not have a true-asynchronous request path, so it's using one thread per request anyway.
comment:25 by , 5 years ago
Replying to Andrew Godwin:
Oh, OK. Note that you can use those under WSGI Django as well and they'll perform very similarly - Django 3.0 does not have a true-asynchronous request path, so it's using one thread per request anyway.
Thanks, we'll just not use asgi for now then :)
comment:26 by , 5 years ago
OK, I can't yet pin the exact difference between stable/3.0.x and master, but the issue looks at least partly to do with uvicorn.
I see a stable 20 connections opened. This ties in with what we saw on #31134:
I finally understood DB connection lifecycle and uvicorn workers. First of all uvicorn for every worker starts 10 threads so for 2 workers there will be 20 opened connections, for 10 workers it will be 100 connections.
Django requires to have non-transactional requests to close the connection after CONN_MAX_AGE and because when I set ATOMIC_REQUESTS to True, Django will always leave the open connection but firstly when the connection was open longer than CONN_MAX_AGE, Django will reestablish that connection once again.
comment:7
Running in WSGI mode, uvicorn still opens multiple DB connections, 10 it seems βΒ need to look at the number of workers there.
$ uvicorn --interface wsgi djangobug_repro.wsgi:application
VS the same thing with gunicorn, where we just have the single connection.
$ gunicorn djangobug_repro.wsgi:application
So even with WSGI uvicorn is behaving differently.
The reporter on #31134 concluded this was uvicorn's expected behaviour. (I'd need to look at uvicorn's source code to determine that for myself.)
We need to investigate further the difference with master. We're calling response.close()
inside sync_to_async
there. Adding that to stable/3.0.x doesn't resolve this issue. We're looking at if, on master, we need to use thread_sensitive
there.
comment:27 by , 5 years ago
After some investigation, I think we should close requests with thread_sensitive=True
(that's an issue on master). Without this we do not hit connections in the main thread and CONN_MAG_AGE=0
is not respected, Django reuses the same connection (from the same thread). I confirmed this checking performance_schema.threads.THREAD_OS_ID
.
With thread_sensitve=True
, connections are closed after each request as expected:
diff --git a/django/core/handlers/asgi.py b/django/core/handlers/asgi.py index 82d2e1ab9d..3bc275bc44 100644 --- a/django/core/handlers/asgi.py +++ b/django/core/handlers/asgi.py @@ -259,7 +259,7 @@ class ASGIHandler(base.BaseHandler): 'body': chunk, 'more_body': not last, }) - await sync_to_async(response.close)() + await sync_to_async(response.close, thread_sensitive=True)() @classmethod def chunk_bytes(cls, data):
I'm not sure why it doesn't help in Django 3.0, there is probably an extra issue.
comment:28 by , 5 years ago
I definitely endorse thread_sensitive on the close call - not sure why I didn't do it on master, but let's add it there. Database libraries are why that flag was added in the first place!
comment:29 by , 5 years ago
Version: | 3.0 β master |
---|
Let's treat this as a release blocker for Django 3.1.
comment:30 by , 5 years ago
Owner: | changed from | to
---|---|
Status: | new β assigned |
This could be related #31373, which version of Django and asgiref are you using?