Opened 11 months ago

Closed 10 months ago

Last modified 5 months ago

#35059 closed Bug (fixed)

ASGI server leaves stale DB connections

Reported by: James Thorniley Owned by: James Thorniley
Component: HTTP handling Version: 5.0
Severity: Release blocker Keywords:
Cc: Carlton Gibson, Andrew Godwin Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

This seems to be a regression from 4.2 to 5.0.

We are using Django ASGI server via uvicorn, with a mariadb database. When updating our app servers from Django 4.2.8 to 5.0, we experienced a spike in the number of database connections, eventually going over the limit set for the DB and causing 500 errors due to

django.db.utils.OperationalError: (1040, 'Too many connections')

I've created a minimal project to reproduce what I think is the phenomenon here using local docker container for the DB:

https://github.com/jthorniley/django-asgi

I've put complete reproduction steps in that github link, but I will summarise here:

  • When running a Django 5.0 ASGI server, some connections are not closed when requests complete, leaving stale connections in the DB. This can be confirmed by running curl against a local server and then show processlist in the mysql CLI when the requests are complete.
  • As a result, if you send a moderate number of parallel requests to the server, it will go past the limit of DB connections (I think the out-of-the-box limit for the mariadb docker container is around 150, but it is configurable). Then new connections will fail with the Too many connections error. Obviously its theoretically possible to go over this limit anyway but it happens sooner due to server processes not cleaning up connections after requests complete.
  • You can downgrade to 4.2.8 and run the same reproduction project and see that there is no issue there. The DB connections are closed after every individual request and hence provided there aren't more parallel requests than allowed connections to the DB you are safe.

Change History (22)

comment:1 by Bryce Durham, 11 months ago

I also experienced this with Postgres. Downgraded to Django 4.2.8

comment:2 by Aryan Gholizadeh Mojaveri, 11 months ago

Triage Stage: UnreviewedAccepted

I also could reproduce this bug using James's repo.

comment:3 by Mariusz Felisiak, 11 months ago

Cc: Carlton Gibson Andrew Godwin added
Resolution: needsinfo
Status: newclosed
Summary: "Too many connections" regression from 4.2->5.0 - asgi server leaves stale DB connectionsASGI server leaves stale DB connections
Triage Stage: AcceptedUnreviewed
Type: UncategorizedBug

I don't see it as a regression. As for me it behaves exactly the same on Django 4.2. If you can bisect to determine the commit where the behavior changed, that would be useful.

This may be a duplicate of #33297 or #33497.

comment:4 by James Thorniley, 11 months ago

git bisect found commit 7cd187a5ba58d7769039f487faeb9a5a2ff05540 however I think this is a red herring, not sure why it came to that.

Reverting the changes from 64cea1e48f285ea2162c669208d95188b32bbc82 on top of 5.0 fix the problems, so I think that is actually the problematic commit.

This commit moved the send_response function into a task which can be cancelled if and when http.disconnect is received. However the send_response function was also responsible for calling response.close() which is a sync function called asynchronously using sync_to_async. This introduced a race condition as if the http.disconnect is received during the call to the sync_to_async wrapped code then the response.close() can effectively be cancelled and the associated clean up (triggered by the request_finished signal from inside response.close() never gets called.

If created a PR here for review, it seems to solve the issue for me: https://github.com/django/django/pull/17675

comment:5 by Carlton Gibson, 11 months ago

Component: UncategorizedHTTP handling
Patch needs improvement: set
Resolution: needsinfo
Severity: NormalRelease blocker
Status: closednew
Triage Stage: UnreviewedAccepted

With the test cases, I think we can accept this as a bug in 64cea1e48f285ea2162c669208d95188b32bbc82. Thanks for the investigation James.

The request_finished signal should be dispatched in the disconnect case.

I left a few initial comments on the PR. Happy to take another look later on.

comment:6 by Natalia Bidart, 11 months ago

Has patch: set
Owner: changed from nobody to James Thorniley
Patch needs improvement: unset
Status: newassigned

Assigning to James following the proposed PR.

comment:7 by Natalia Bidart, 11 months ago

Patch needs improvement: set

Setting patch needs improvements following the latest batch of review comments made by Carlton.

comment:8 by Natalia Bidart, 10 months ago

Patch needs improvement: unset

comment:9 by Natalia Bidart, 10 months ago

Patch needs improvement: set

Setting as patch needs improvement while I wait for the author to reply to my comments.

comment:10 by James Thorniley, 10 months ago

Patch needs improvement: unset

comment:11 by Josh Orr, 10 months ago

I have this same problem, it caused a massive spike in database connections. I had to revert back to Django 4.2 for now until this is fixed. I came here to report but found your all already on top of it, thank you! Can't wait until this is fixed in 5.0.

in reply to:  11 comment:12 by Mariusz Felisiak, 10 months ago

Replying to Josh Orr:

I have this same problem, it caused a massive spike in database connections. I had to revert back to Django 4.2 for now until this is fixed. I came here to report but found your all already on top of it, thank you! Can't wait until this is fixed in 5.0.

Hi Josh, Does that the proposed patch work for you? Testing would be a huge help.

comment:13 by Natalia Bidart, 10 months ago

#35148 was a duplicate.

comment:14 by Natalia Bidart, 10 months ago

This is pretty much ready for checkin pending two question and one optional test change. I will wait for feedback from either the author or Carlton to merge ASAP (release is planned for Feb 6th). Thanks everyone who helped so far, by either building the fix, reviewing, debugging and/or testing the patch!

comment:15 by Natalia Bidart, 10 months ago

Triage Stage: AcceptedReady for checkin

comment:16 by Natalia <124304+nessita@…>, 10 months ago

In a43d75e:

Refs #35059 -- Made asgi tests' SignalHandler helper class re-usable by other tests.

comment:17 by Natalia <124304+nessita@…>, 10 months ago

Resolution: fixed
Status: assignedclosed

In 11393ab:

Fixed #35059 -- Ensured that ASGIHandler always sends the request_finished signal.

Prior to this work, when async tasks that process the request are cancelled due
to receiving an early "http.disconnect" ASGI message, the request_finished
signal was not being sent, potentially leading to resource leaks (such as
database connections).

This branch ensures that the request_finished signal is sent even in the case
of early termination of the response.

Regression in 64cea1e48f285ea2162c669208d95188b32bbc82.

Co-authored-by: Natalia <124304+nessita@…>
Co-authored-by: Carlton Gibson <carlton.gibson@…>

comment:18 by Natalia <124304+nessita@…>, 10 months ago

In bbb9ef3:

[5.0.x] Refs #35059 -- Made asgi tests' SignalHandler helper class re-usable by other tests.

Backport of a43d75e81da783fda08bf8d3493252e3676d11ea from main

comment:19 by Natalia <124304+nessita@…>, 10 months ago

In f1fbd061:

[5.0.x] Fixed #35059 -- Ensured that ASGIHandler always sends the request_finished signal.

Prior to this work, when async tasks that process the request are cancelled due
to receiving an early "http.disconnect" ASGI message, the request_finished
signal was not being sent, potentially leading to resource leaks (such as
database connections).

This branch ensures that the request_finished signal is sent even in the case
of early termination of the response.

Regression in 64cea1e48f285ea2162c669208d95188b32bbc82.

Co-authored-by: Natalia <124304+nessita@…>
Co-authored-by: Carlton Gibson <carlton.gibson@…>

Backport of 11393ab1316f973c5fbb534305750740d909b4e4 from main

comment:20 by GitHub <noreply@…>, 6 months ago

In f4a08b6d:

Refs #35059 -- Used asyncio.Event in ASGITest.test_asyncio_cancel_error to enforce specific interleaving.

Sleep call leads to a hard to trace error in CI. Using an Event is
more deterministic, and should be less prone to environment
variations.

Bug in 11393ab1316f973c5fbb534305750740d909b4e4.

comment:21 by Natalia <124304+nessita@…>, 5 months ago

In 3824858:

[5.1.x] Refs #35059 -- Used asyncio.Event in ASGITest.test_asyncio_cancel_error to enforce specific interleaving.

Sleep call leads to a hard to trace error in CI. Using an Event is
more deterministic, and should be less prone to environment
variations.

Bug in 11393ab1316f973c5fbb534305750740d909b4e4.

Backport of f4a08b6ddfcacadfe9ff8364bf1c6c54f5dd370f from main.

comment:22 by Natalia <124304+nessita@…>, 5 months ago

In dd3d0483:

[5.0.x] Refs #35059 -- Used asyncio.Event in ASGITest.test_asyncio_cancel_error to enforce specific interleaving.

Sleep call leads to a hard to trace error in CI. Using an Event is
more deterministic, and should be less prone to environment
variations.

Bug in 11393ab1316f973c5fbb534305750740d909b4e4.

Backport of f4a08b6ddfcacadfe9ff8364bf1c6c54f5dd370f from main.

Note: See TracTickets for help on using tickets.
Back to Top