Opened 5 years ago

Closed 5 years ago

#31515 closed Bug (fixed)

Django tries to reuse MySQL connections that have gone away with ASGI.

Reported by: Braunson Owned by: Carlton Gibson
Component: Documentation Version: dev
Severity: Release blocker Keywords:
Cc: Andrew Godwin, Carlton Gibson Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

This bug only appears to happen when using ASGI. No matter the CONN_MAX_AGE it seems to be ignored. I have tried both manually setting to 0 so they are never reused and setting them to an hour without any effect.

If it makes a difference I am using multiple databases. Again this issue does NOT occur with WSGI, only with ASGI.

Is there any work around I can do to solve this in the meantime? It results in almost every page 500 erroring once or twice and a refresh fixes it temporarily for the client.

Attachments (1)

djangobug_repro.zip (23.7 KB ) - added by Braunson 5 years ago.

Download all attachments as: .zip

Change History (33)

comment:1 by Simon Charette, 5 years ago

This could be related #31373, which version of Django and asgiref are you using?

in reply to:  1 comment:2 by Braunson, 5 years ago

Replying to Simon Charette:

This could be related #31373, which version of Django and asgiref are you using?

3.0.5, and I'm not explicitly using asgiref. I'm using uvicorn 0.11.3 to serve the requests.

comment:3 by Simon Charette, 5 years ago

You don't have to be explicitly using asgiref to be affected by the issue; Django uses it internally to keep reference to database connections and other globals.

Could you try upgrading to asgiref>=3.2.7 and see if it helps? Prior version had a bug which showed similar symptoms in #31373.

in reply to:  3 comment:4 by Braunson, 5 years ago

Replying to Simon Charette:

You don't have to be explicitly using asgiref to be affected by the issue; Django uses it internally to keep reference to database connections and other globals.

Could you try upgrading to asgiref>=3.2.7 and see if it helps? Prior version had a bug which showed similar symptoms in #31373.

Unfortunately that requirement did not help. After a few hours of operation the error continues to appear again. So even with the max age set to 0 it keeps trying to reuse those old database connections long after the server terminates them.

comment:5 by Simon Charette, 5 years ago

Thanks for trying it out. It also has bit of similarity with #30398 but the fact it only happens with ASGI is puzzling. Could you try setting up a minimal project that reproduces your issue to make it easier to identify its origin?

in reply to:  5 comment:6 by Braunson, 5 years ago

Replying to Simon Charette:

Thanks for trying it out. It also has bit of similarity with #30398 but the fact it only happens with ASGI is puzzling. Could you try setting up a minimal project that reproduces your issue to make it easier to identify its origin?

Been working hard to try and reproduce this in a development environment but I'm having a hard time. I can see either the timeout must be getting reached or there is a packet issue because the logs of our SQL server is filled with communication errors on initial packet whenever we hit a 500 error on the website.

Additionally (not sure if related) but I am able to reproduce issues with unclosed connections when running ASGI:

INFO:     127.0.0.1:58148 - "GET / HTTP/1.0" 200 OK
INFO:     logging in using static token
ERROR:    Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x10cf64eb0>
ERROR:    Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x10cf645e0>
ERROR:    Unclosed connector
connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x10cedb2e0>, 20427.378)]']
connector: <aiohttp.connector.TCPConnector object at 0x10cf64580>
INFO:     127.0.0.1:58150 - "GET / HTTP/1.0" 200 OK
INFO:     logging in using static token
ERROR:    Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x10cf66b20>
ERROR:    Unclosed connector
connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x10cf37ac0>, 20427.611)]']
connector: <aiohttp.connector.TCPConnector object at 0x10cf66880>
INFO:     127.0.0.1:58152 - "GET / HTTP/1.0" 200 OK

However, again I'm having a really hard time reproducing the SQL connection bug that I've been experiencing. The ticket you mentioned above would surely solve this issue I'm experiencing but I would rather find the root cause as I'm sure you would.

In order to demonstrate the above connection errors I simply logged in and out of an async HTTP library for Discord, so that could be their own fault and not related to Django. What still seems to be related to Django is reuse of old DB connections with the MAX AGE set to 0.

in reply to:  5 comment:7 by Braunson, 5 years ago

Replying to Simon Charette:

Thanks for trying it out. It also has bit of similarity with #30398 but the fact it only happens with ASGI is puzzling. Could you try setting up a minimal project that reproduces your issue to make it easier to identify its origin?

Actually @Simon I'm not able to reproduce the specific "MySQL has gone away" issue on Django, but I AM able to reproduce the fact that Django seems to not be closing the database connections properly which could be causing it. My development machine when running a benchmark on this test code shows the following in the SQL server:

2020-04-26  1:10:37 154 [Warning] Aborted connection 154 to db: 'test' user: 'root' host: '172.17.0.1' (Got an error reading communication packets)
2020-04-26  1:12:28 168 [Warning] Aborted connection 168 to db: 'test' user: 'root' host: '172.17.0.1' (Got an error reading communication packets)
2020-04-26  1:13:13 171 [Warning] Aborted connection 171 to db: 'test' user: 'root' host: '172.17.0.1' (Got an error reading communication packets)
2020-04-26  1:17:32 178 [Warning] Aborted connection 178 to db: 'test' user: 'root' host: '172.17.0.1' (Got an error reading communication packets)
2020-04-26  1:21:33 280 [Warning] Aborted connection 280 to db: 'test' user: 'root' host: '172.17.0.1' (Got an error reading communication packets)
2020-04-26  1:21:33 282 [Warning] Aborted connection 282 to db: 'test' user: 'root' host: '172.17.0.1' (Got an error reading communication packets)
2020-04-26  1:21:33 187 [Warning] Aborted connection 187 to db: 'test' user: 'root' host: '172.17.0.1' (Got an error reading communication packets)

Additionally, with show processlist I can see that Django is keeping connections open when doing nothing (they all disappear along with those errors in the SQL console when I shut down the asgi server), is this proper behaviour when the CONN_AGE is set to 0?

https://i.imgur.com/LRdVrAL.png

I packaged up a small test project. Run pip install -r requirements.txt && python manage.py migrate, then startup the ASGI with uvicorn djangobug_repro.asgi:application --reload

Run a benchmark against the homepage and run show processlist; in your SQL server to see the IDs not change, ie the connection is being reused. You can run a benchmark with ab: ab -c 20 -n 300 to see the connections reused. If I just refresh a couple times on my own I don't see the issue, it looks like everything gets closed off. But if you run the benchmark you see that Django starts to run behind on closing the connections?

I can also confirm this behaviour does not exist using WSGI with the same project (I used gunicorn with the benchmark, I commented out the async portion and just saved a random number to the db. Only 1 process existed in the SQL server)

Last edited 5 years ago by Braunson (previous) (diff)

by Braunson, 5 years ago

Attachment: djangobug_repro.zip added

comment:8 by Mariusz Felisiak, 5 years ago

Cc: Andrew Godwin Carlton Gibson added
Summary: Django tries to reuse MySQL connections that have gone away with ASGI β†’ Django tries to reuse MySQL connections that have gone away with ASGI.

comment:9 by Mariusz Felisiak, 5 years ago

Using cache and sync method are unrelated, I can reproduce the same behavior with:

def home(request):
    users = UserProxy.objects.all()
    num = 1
    users[0].discord = num
    users[0].save()
    return render(request, 'home.html', {'users': users[0], 'num': num})

comment:10 by Mariusz Felisiak, 5 years ago

Also, I cannot reproduce this issue on the current master, connections are closed properly at the end of each request.

comment:11 by Andrew Godwin, 5 years ago

This really should have been handled by that most recent asgiref release as you discussed above - unless gevent or something else is in there screwing up the threading module, I'm not sure what's happening here.

comment:12 by Mariusz Felisiak, 5 years ago

Andrew, it works properly with asgiref==3.2.7 on master. You can reproduce this issue on Django 3.0 with the same version of asgiref πŸ€”.

in reply to:  12 comment:13 by Braunson, 5 years ago

Replying to felixxm:

Andrew, it works properly with asgiref==3.2.7 on master. You can reproduce this issue on Django 3.0 with the same version of asgiref πŸ€”.

So it's broken in 3.0.5 with the latest asgiref but is working in master somehow? Which would mean the next release fixed it inadvertently?

comment:14 by Andrew Godwin, 5 years ago

Well, the current master has a whole swathe of different async fixes due to the work done around the async views work - I would not be surprised if this is what fixed it if you are comparing to 3.0. The signal handling, especially, was fixed, and not running the start/end request signals properly is one of the main causes of this kind of bug.

in reply to:  14 comment:15 by Braunson, 5 years ago

Replying to Andrew Godwin:

Well, the current master has a whole swathe of different async fixes due to the work done around the async views work - I would not be surprised if this is what fixed it if you are comparing to 3.0. The signal handling, especially, was fixed, and not running the start/end request signals properly is one of the main causes of this kind of bug.

Awesome. Glad to hear it's supposedly been fixed then. Is there a roadmap/schedule I can look at for when the master changes will make it into a release?

comment:16 by Mariusz Felisiak, 5 years ago

Braunson, see Version3.1Roadmap.

in reply to:  16 comment:17 by Braunson, 5 years ago

Replying to felixxm:

Braunson, see Version3.1Roadmap.

Ouch August for this fix? Is there any plan to backport it? Results in a lot of 500 errors from timed out connections. Anything I can do in the meantime?

Last edited 5 years ago by Braunson (previous) (diff)

comment:18 by Mariusz Felisiak, 5 years ago

Component: Database layer (models, ORM) β†’ Documentation
Severity: Normal β†’ Release blocker
Triage Stage: Unreviewed β†’ Accepted

Django 3.0 supports ASGI, however it can be really tricky to backport a fix from master. There is a big risk of introducing new regressions, so IMO we should document this limitation in Django 3.0.

I will try to add a proper regression test.

in reply to:  18 comment:19 by Braunson, 5 years ago

Replying to felixxm:

Django 3.0 supports ASGI, however it can be really tricky to backport a fix from master. There is a big risk of introducing new regressions, so IMO we should document this limitation in Django 3.0.

I will try to add a proper regression test.

There's no possible way for an end user to solve this in 3.0.5 then? Even if I set a really high timeout value for the SQL server, they still would have to expire eventually and Django will try to use them. It makes any database operations on 3.0.5 sort of impossible without having a timeout eventually no?

comment:20 by Andrew Godwin, 5 years ago

Well, we would have to find exactly where the bug is first before fixing it. Nothing I know of in my quick glance through the code made it obvious where it could be; we seem to correctly handle the close_old_connections call even in Django 3.0 ASGI mode (not calling close_old_connections is often the cause of this bug).

I never figured out why during Channels development, but the MySQL backend seems specifically prone to these sorts of issues even when it should be fine. I think if you set a non-zero CONN_MAX_AGE it should clear itself up, if you can take that operationally, otherwise I think you may have to run on WSGI rather than ASGI for now (unless we track down exactly what is doing this)

Last edited 5 years ago by Andrew Godwin (previous) (diff)

in reply to:  20 comment:21 by Braunson, 5 years ago

Replying to Andrew Godwin:

Well, we would have to find exactly where the bug is first before fixing it. Nothing I know of in my quick glance through the code made it obvious where it could be; we seem to correctly handle the close_old_connections call even in Django 3.0 ASGI mode (not calling close_old_connections is often the cause of this bug).

I never figured out why during Channels development, but the MySQL backend seems specifically prone to these sorts of issues even when it should be fine. I think if you set a non-zero CONN_MAX_AGE it should clear itself up, if you can take that operationally, otherwise I think you may have to run on WSGI rather than ASGI for now (unless we track down exactly what is doing this)

I did actually try setting a non-zero CONN_MAX_AGE but the issue persists. And unfortunately we developed some features that rely on ASGI due to using async third party libraries.

comment:22 by Andrew Godwin, 5 years ago

Wait, how are you able to use async third party libraries in Django 3.0? While it has ASGI support, that's only to run in synchronous mode; if you're doing your own async stuff from Django views you want to be careful or you can cause issues like this.

in reply to:  22 comment:23 by Braunson, 5 years ago

Replying to Andrew Godwin:

Wait, how are you able to use async third party libraries in Django 3.0? While it has ASGI support, that's only to run in synchronous mode; if you're doing your own async stuff from Django views you want to be careful or you can cause issues like this.

Using the @async_to_sync decorators for some functions where awaits might be required (not async_to_sync directly on views). An example was included in the zip although I commented it out as it wasn't actually related to this database issue.

Last edited 5 years ago by Braunson (previous) (diff)

comment:24 by Andrew Godwin, 5 years ago

Oh, OK. Note that you can use those under WSGI Django as well and they'll perform very similarly - Django 3.0 does not have a true-asynchronous request path, so it's using one thread per request anyway.

in reply to:  24 comment:25 by Braunson, 5 years ago

Replying to Andrew Godwin:

Oh, OK. Note that you can use those under WSGI Django as well and they'll perform very similarly - Django 3.0 does not have a true-asynchronous request path, so it's using one thread per request anyway.

Thanks, we'll just not use asgi for now then :)

comment:26 by Carlton Gibson, 5 years ago

OK, I can't yet pin the exact difference between stable/3.0.x and master, but the issue looks at least partly to do with uvicorn.

I see a stable 20 connections opened. This ties in with what we saw on #31134:

I finally understood DB connection lifecycle and uvicorn workers. First of all uvicorn for every worker starts 10 threads so for 2 workers there will be 20 opened connections, for 10 workers it will be 100 connections.

Django requires to have non-transactional requests to close the connection after CONN_MAX_AGE and because when I set ATOMIC_REQUESTS to True, Django will always leave the open connection but firstly when the connection was open longer than CONN_MAX_AGE, Django will reestablish that connection once again.
comment:7

Running in WSGI mode, uvicorn still opens multiple DB connections, 10 it seems β€”Β need to look at the number of workers there.

$ uvicorn --interface wsgi  djangobug_repro.wsgi:application

VS the same thing with gunicorn, where we just have the single connection.

 $ gunicorn  djangobug_repro.wsgi:application

So even with WSGI uvicorn is behaving differently.

The reporter on #31134 concluded this was uvicorn's expected behaviour. (I'd need to look at uvicorn's source code to determine that for myself.)

We need to investigate further the difference with master. We're calling response.close() inside sync_to_async there. Adding that to stable/3.0.x doesn't resolve this issue. We're looking at if, on master, we need to use tread_sensitive there.

Version 0, edited 5 years ago by Carlton Gibson (next)

comment:27 by Mariusz Felisiak, 5 years ago

After some investigation, I think we should close requests with thread_sensitive=True (that's an issue on master). Without this we do not hit connections in the main thread and CONN_MAG_AGE=0 is not respected, Django reuses the same connection (from the same thread). I confirmed this checking performance_schema.threads.THREAD_OS_ID.

With thread_sensitve=True, connections are closed after each request as expected:

diff --git a/django/core/handlers/asgi.py b/django/core/handlers/asgi.py
index 82d2e1ab9d..3bc275bc44 100644
--- a/django/core/handlers/asgi.py
+++ b/django/core/handlers/asgi.py
@@ -259,7 +259,7 @@ class ASGIHandler(base.BaseHandler):
                     'body': chunk,
                     'more_body': not last,
                 })
-        await sync_to_async(response.close)()
+        await sync_to_async(response.close, thread_sensitive=True)()

     @classmethod
     def chunk_bytes(cls, data):

I'm not sure why it doesn't help in Django 3.0, there is probably an extra issue.

comment:28 by Andrew Godwin, 5 years ago

I definitely endorse thread_sensitive on the close call - not sure why I didn't do it on master, but let's add it there. Database libraries are why that flag was added in the first place!

comment:29 by Mariusz Felisiak, 5 years ago

Version: 3.0 β†’ master

Let's treat this as a release blocker for Django 3.1.

comment:30 by Carlton Gibson, 5 years ago

Owner: changed from nobody to Carlton Gibson
Status: new β†’ assigned

comment:31 by Carlton Gibson, 5 years ago

Has patch: set

comment:32 by Carlton Gibson <carlton@…>, 5 years ago

Resolution: β†’ fixed
Status: assigned β†’ closed

In 92507bf3:

Fixed #31515 -- Made ASGIHandler dispatch lifecycle signals with thread sensitive.

Note: See TracTickets for help on using tickets.
Back to Top