Opened 17 months ago

Last modified 29 hours ago

#34865 assigned Bug

DatabaseWrapper are not GC and connections are not closed

Reported by: Fábio Domingues Owned by: Priyank Panchal
Component: Database layer (models, ORM) Version: 4.2
Severity: Normal Keywords:
Cc: Carlton Gibson, Andrew Godwin, David Wobrock, Florian Apolloner Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

I have found possible memory leak when investigating why database connections where not closed after the thread/coroutine ended (https://code.djangoproject.com/ticket/33497 and https://github.com/django/django/pull/17275), the DatabaseWrapper ends up in a circular reference with DatabaseFeatures/DatabaseCreation/DatabaseClient/... and the GC could not cleanup the objects.
I tryed use WeakRef from the "child" objects to the wrapper or deleting the __dict__ and the wrapper was collected and the connections was closed.

Attachments (2)

databasewrapper.png (96.7 KB ) - added by Fábio Domingues 17 months ago.
fix_34865.patch (2.1 KB ) - added by Fábio Domingues 17 months ago.
Possible approach.

Download all attachments as: .zip

Change History (22)

by Fábio Domingues, 17 months ago

Attachment: databasewrapper.png added

comment:1 by Mariusz Felisiak, 17 months ago

Resolution: duplicate
Status: newclosed

Fabio, thanks for the report. Is there a reason to open a new ticket and don't add your findings in the #33497? As far as I'm aware, this is a duplicate.

in reply to:  1 comment:2 by Fábio Domingues, 17 months ago

Fixing this memory leak will not solve the problem of the persistent connections not been reuse, only the fact that they will not pill up (and starve the DB) if inaccessible.
But I'm new here, still learning. Should I move the findings to #33497?

Last edited 17 months ago by Fábio Domingues (previous) (diff)

comment:3 by Mariusz Felisiak, 17 months ago

Cc: Carlton Gibson Andrew Godwin Andreas Pelme added
Resolution: duplicate
Status: closednew
Triage Stage: UnreviewedAccepted

We can keep this separately.

Tentatively accepted, I'd like to check out the patch proposal.

comment:4 by Carlton Gibson, 17 months ago

...I'd like to check out the patch proposal.

Yes, please.

Any chance you could give a minimal example showing the behaviour here, to look at?
Various folks have reported leaky behaviour on Channels/Daphne, but we've not been able to pin it down.

… ends up in a circular reference ... and the GC could not cleanup the objects.

Just out of interest, could you try throwing in a gc.collect() — that should get it — unless we never let the DatabaseWrapper go, which is why it would be nice to look at the example — but WeakRef is hopefully right in this case.

Last edited 17 months ago by Carlton Gibson (previous) (diff)

comment:5 by Fábio Domingues, 17 months ago

Tested using Uvicorn and Daphne, the database engine should not matter, I tested with PostgreSQL, CONN_MAX_AGE could be any value and the effect is the same using async or sync views.

get_refs = lambda: [o for o in gc.get_objects() if isinstance(o, BaseDatabaseWrapper)]
plot_refs = lambda: objgraph.show_backrefs(get_refs(), filename="sample-graph.png")

def sync_view(request):
    MyModel.objects.count()
    # gc.collect()
    # plot_refs()
    return HttpResponse(f"Number of BaseDatabaseWrapper instances: {len(get_refs())}")

async def async_view(request):
    await MyModel.objects.acount()
    # gc.collect()
    # plot_refs()
    return HttpResponse(f"Number of BaseDatabaseWrapper instances: {len(get_refs())}")

On each refresh you should see the number of instances increase. If you plot the references you would see many diagrams like the one in the attachment. Throwing gc.collect() actually get it, but it take some milliseconds, even if the underlying database connection is already closed.

comment:6 by Fábio Domingues, 17 months ago

One more detail. Using objgraph.show_growth without doing gc.collect give us this on each new request:

list                       1990       +10
dict                       6798        +3
deque                        20        +2
lock                         43        +2
OrderedDict                  11        +2
partial                      24        +2
ReferenceType              3848        +1
DatabaseWrapper               6        +1
DatabaseClient                6        +1
DatabaseCreation              6        +1
DatabaseFeatures              6        +1
DatabaseIntrospection         6        +1
DatabaseOperations            6        +1
BaseDatabaseValidation        6        +1
DatabaseErrorWrapper          5        +1
PGconn                        5        +1
Connection                    5        +1
PrepareManager                5        +1
AdaptersMap                   7        +1
TypesRegistry                 7        +1

with the patch the output of show_growth is empty.

Last edited 17 months ago by Fábio Domingues (previous) (diff)

by Fábio Domingues, 17 months ago

Attachment: fix_34865.patch added

Possible approach.

comment:7 by David Wobrock, 17 months ago

Cc: David Wobrock added

comment:8 by Priyank Panchal, 17 months ago

Owner: changed from nobody to Priyank Panchal
Status: newassigned

comment:9 by Florian Apolloner, 17 months ago

Throwing gc.collect() actually get it, but it take some milliseconds, even if the underlying database connection is already closed.

with "actually get it" do you mean that it cleans it up properly? If yes there does not seem to be real leak here since garbage collection usually runs every now and then (or do I miss something). It might still be beneficial to improve the performance here but that might require a test or two :D

comment:10 by Sarah Boyce, 16 months ago

Has patch: set

comment:11 by Mariusz Felisiak, 16 months ago

Needs tests: set
Patch needs improvement: set

Marking as needs improvement per Florian's comment.

comment:12 by Florian Apolloner, 15 months ago

Cc: Florian Apolloner added

comment:13 by Patryk Zawadzki, 7 days ago

Florian, it does clean up eventually thanks to the garbage collection, but garbage collection is a last resort method of freeing unused memory.

There are two important problems with relyin on GC:

  1. It happens very rarely (the default thresholds are (700, 10 10), which means the deepest collection cycle—generation 2—only runs every 70,000 dereferenced objects. That's very rare; in a simple Django application, that can mean a hundred requests.
  2. It also locks the global interpreter lock and effectively freezes the entire Python process (and all its threads) until it goes through all uncollectable objects. The more objects it finds (not just the ones with reference cycles but also everything else they hold a reference to), the longer the freeze. In extreme cases, it can randomly add seconds to your response times.

Aggregation of garbage leads to memory fragmentation as unused memory cannot be reused until the garbage collection thresholds are met, which leads to the overall memory use slowly creeping up.

What makes it worse is that in this case, it holds a reference to a psycopg connection object, holding all of the psycopg internals hostage and preventing proper finalization of the connection (and, I assume, in Django 5.x also preventing the connection from being reclaimed by the connection pool? I did not verify that).

in reply to:  13 comment:14 by Florian Apolloner, 7 days ago

Hi Patryk, I agree which is why this ticket is accepted. It just needs a PR with tests (or an explanation why tests make no sense, are not possible) and then we can see what we can do about that.

What makes it worse is that in this case, it holds a reference to a psycopg connection object, holding all of the psycopg internals hostage and preventing proper finalization of the connection (and, I assume, in Django 5.x also preventing the connection from being reclaimed by the connection pool? I did not verify that).

I don't think so, the connection is not returned to the pool on GC but rather when .close is called on it at the end of the request: https://github.com/django/django/blob/5da3ad7bf90fba7321f4c2834db44aa920c70bc7/django/db/backends/postgresql/base.py#L391

comment:15 by Filip Owczarek, 4 days ago

Hi,
I just submitted my proposed solution.

https://github.com/django/django/pull/19153

comment:16 by Andreas Pelme, 4 days ago

Cc: Andreas Pelme removed

comment:17 by Florian Apolloner, 4 days ago

Thinking more about this ticket (please note that I am thinking out loud here, so let me know if take a wrong turn somewhere): DatabaseWrapper is in a standard setup *not* supposed to get garbage collected at all. Each thread will have one wrapper per defined database stored in a thread local (ConnectionHandler/BaseConnectionHandler). So unless the "web"server is spawning a new thread per connection this should not be a problem (I could see it being a problem for runserver but that is for dev only anyways).

So to the people having issues here: Where exactly are you seeing this, in job queues (celery etc) or in your webservers (which exactly). Fábio mentioned seeing it on uvicorn / daphne, is this also a problem for gunicorn (I know that one usually has a fixed worker & thread count there)?

Please note that I am not saying that there is no problem hidden here, the cyclic reference is certainly ugly. But if the object isn't ment to get garbage collected, it is not the end of the world. That said I would certainly prefer a solution that doesn't use a weakref proxy (we would need to do some profiling on the overhead).

Last edited 4 days ago by Florian Apolloner (previous) (diff)

comment:18 by Patryk Zawadzki, 33 hours ago

Florian, we're using ASGI, so I believe it's very much spawning and collecting threads left and right. The database wrapper-related objects come up in garbage lists every few requests with uvicorn in production mode.

comment:19 by Florian Apolloner, 32 hours ago

Mhm yeah I recommended to switch to a fixed thread pool at some point iirc :D Though even with a fixed thread pool I think that the "thread/asgi" local might be per request then (ie async context id) which would show this problem. Thank you for confirming that you are using ASGI.

comment:20 by Patryk Zawadzki, 29 hours ago

Yeah, using a (cappable!) thread pool would be amazing, especially if the async ORM API was also able to delegate to a request-specific DB thread instead of mixing everything in a single shared thread.

Note: See TracTickets for help on using tickets.
Back to Top