Opened 13 years ago
Last modified 2 years ago
#16022 assigned Cleanup/optimization
Cyclic reference in FieldFile causes memory usage to grow considerably
Reported by: | Gustavo Narea | Owned by: | Joshua Massover |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 4.0 |
Severity: | Normal | Keywords: | memory leak |
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | yes |
Needs tests: | yes | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
django.db.models.fields.files:FieldFile creates a cyclic reference to the model instance it's attached to via django.db.models.fields.files:FileDescriptor.
The effects can be considerable. In our Web site, for example, it causes the process running Django to increase the memory used by over 1MB after every request.
The patch is generated against Django trunk, even though it's only been tested under Django 1.1.4 (but the cyclic reference doesn't seem to have been fixed).
Attachments (1)
Change History (15)
by , 13 years ago
Attachment: | fieldfile-weakref.patch added |
---|
comment:1 by , 13 years ago
Please disregard the patch, it breaks file uploads. I'm trying to improve it at the moment (and suggestions are much appreciated).
comment:2 by , 13 years ago
I've tried replacing weakref.ref() with weakref.proxy(), which does work but then memory keeps growing as usual. I think a proper way to fix it is to break the cyclic reference, but that'd be backwards incompatible.
I cannot afford spending more time on this, so I'll leave it to someone else. I'm happy to test any eventual fix though. In the meantime, we'll be using mod_wsgi's maximum-requests to mitigate the consequences of this problem.
A hint to whoever will have a look at this: Try setting self.instance to None in FieldFile.__init__ and you won't notice any memory growth when you query many records. This breaks creation, modification and deletion of files, but helps demonstrate that this is the problematic reference.
comment:3 by , 13 years ago
Has patch: | unset |
---|---|
Keywords: | memory leak added |
comment:4 by , 13 years ago
Summary: | Cyclic reference in FieldFile → Cyclic reference in FieldFile causes memory usage to grow considerably |
---|
comment:5 by , 13 years ago
Python's gc is supposed to be able to collect reference cycles these days, unless there are __del__
methods. I didn't see any that could by linked with this bug.
comment:6 by , 13 years ago
Resolution: | → worksforme |
---|---|
Status: | new → closed |
OK, I tried to reproduce this seriously. I created a project called madupload
and an app called receiver
inside.
madupload ├── __init__.py ├── manage.py ├── receiver │ ├── __init__.py │ ├── models.py │ ├── views.py ├── settings.py └── urls.py
receiver/models.py
:
from django import forms from django.db import models class UploadModel(models.Model): upload = models.FileField() class UploadForm(forms.ModelForm): class Meta: model = UploadModel
receiver/views.py
:
from django.http import HttpResponse from django.views.decorators.csrf import csrf_exempt from .models import UploadForm @csrf_exempt def upload(request): form = UploadForm(request.POST, request.FILES) try: model = form.save(commit=False) return HttpResponse("Saved %i bytes.\n" % model.upload.size, status=201, content_type='text/plain') except ValueError: return HttpResponse(repr(form.errors) + "\n", status=400, content_type='text/plain')
urls.py
:
from django.conf.urls.defaults import patterns, url urlpatterns = patterns('', url(r'^$', 'receiver.views.upload'), )
All other files are unchanged from Django's template.
Then I launched runserver
and started uploading:
myk@mYk madupload % dd if=/dev/random of=1mb.bin bs=1024 count=1024 1024+0 records in 1024+0 records out 1048576 bytes transferred in 0.146655 secs (7149958 bytes/sec) myk@mYk madupload % while true; do curl -F "upload=@1mb.bin" http://localhost:8000; done Saved 1048576 bytes. Saved 1048576 bytes. Saved 1048576 bytes. ...
I've let it run more than one hundred of uploads, and the memory footprint of the Python process handling runserver
is flat.
At this point, I have proven that the cyclic reference in FileField is not a problem in itself.
I still don't know what's causing your bug. You may have a global variable referencing your FileField objects somewhere.
comment:7 by , 13 years ago
Resolution: | worksforme |
---|---|
Status: | closed → reopened |
Hello, aaugustin.
Thank you very much for looking into this!
I forgot to add a useful piece of information here, which I mentioned on Google Groups: Memory increases considerably when you query a model that has a FileField. I'm sorry about that.
To reproduce this, you could create, say, 2000 records and iterate over them like this inside a Django view:
for record in MyModel.objects.all(): echo record.upload
As you refresh the page, you'll notice that memory increases.
Cheers.
comment:8 by , 13 years ago
Resolution: | → needsinfo |
---|---|
Status: | reopened → closed |
I read your email to django-users which had more info and I updated my test code above as follows.
receiver/models.py
:
from django import forms from django.db import models class UploadModel(models.Model): upload = models.FileField(upload_to='receiver') class UploadForm(forms.ModelForm): class Meta: model = UploadModel
receiver/views.py
:
from django.http import HttpResponse from django.views.decorators.csrf import csrf_exempt from .models import UploadForm, UploadModel @csrf_exempt def upload(request, commit=False): form = UploadForm(request.POST, request.FILES) try: model = form.save(commit=commit) action = "Saved" if commit else "Received" return HttpResponse("%s %i bytes.\n" % (action, model.upload.size), status=201, content_type='text/plain') except ValueError: return HttpResponse(repr(form.errors) + "\n", status=400, content_type='text/plain') def names(request): upload_names = [model.upload.name for model in UploadModel.objects.all()] return HttpResponse("%i objects \n" % len(upload_names), status=200, content_type='text/plain')
urls.py
:
from django.conf.urls.defaults import patterns, url urlpatterns = patterns('', url(r'^fake/$', 'receiver.views.upload', {'commit': False}), url(r'^save/$', 'receiver.views.upload', {'commit': True}), url(r'^read/$', 'receiver.views.names'), )
I added a database, etc. to the settings and ran syncdb
.
I uploaded 100 1kb files:
myk@mYk madupload % dd if=/dev/random of=1kb.bin bs=1024 count=1 1+0 records in 1+0 records out 1024 bytes transferred in 0.000232 secs (4414149 bytes/sec) myk@mYk madupload % for i in {1..100}; do curl -F "upload=@1kb.bin" http://localhost:8000/save/; done Saved 1024 bytes. Saved 1024 bytes. Saved 1024 bytes. ...
And then I read them over and over:
myk@mYk madupload % while true; do curl http://localhost:8000/read/; done 100 objects 100 objects 100 objects ...
I've been heating the planet like this for a few minutes and the memory curve climbed a little bit at the beginning, then stabilized.
Initially, I was using Python 2.6; I switched to Python 2.5 but the result was the same.
So, unfortunately, we still don't have a proof that the bug is in Django itself.
At this point, I am afraid you didn't provide sufficient information for me to reproduce the bug. You could:
- try to play with
gc.garbage
to see what happens in your app, see http://docs.python.org/library/gc.html - modify my example until it exhibits the memory leak
- come up with your own reproducible example, and provide complete instructions of how to run it
Thanks!
comment:9 by , 13 years ago
PS : closing the ticket as "needsinfo" is part of the standard triaging procedure when we can't reproduce a problem. It doesn't mean the problem doesn't exist.
comment:10 by , 3 years ago
Resolution: | needsinfo |
---|---|
Status: | closed → new |
UI/UX: | unset |
I just got hit with this. I created a repo with more info and tests that can reproduce the issue. The code in production was a logging statement that was accessing the file field. I only wanted to log the file name, not the file itself. By removing the log of the file field entirely, our memory utilization is 1/4 what it used to be. The leaking model / file field is pretty crucial and is used in a majority our endpoints.
# before extra = { "file_field": obj.file_field "id": obj.id } logger.info("hello world!", extra=extra) # after with 1/4 less memory usage and no other changes extra = { "id": obj.id } logger.info("hello world!", extra=extra)
comment:11 by , 3 years ago
Triage Stage: | Unreviewed → Accepted |
---|---|
Type: | Bug → Cleanup/optimization |
Hi Joshua.
Thanks for the reproduce project — very nice.
I'm going to accept this for review as a potential optimisation, since memory use is lower in your example using the weakref
there, and, as you say, it doesn't break any tests.
I don't think it's an actual memory leak per se — rather a question of the (somewhat high) level at which Python's garbage collector chooses to kick in.
You have a loop in your test-cases:
for i in range(100): # print(process.memory_full_info().rss / 1024 ** 2) new = Leak.objects.get(id=leak.id) extra = {"f": new.f}
If you uncomment the print()
we see the memory usage rising, in the first unpatched case:
... 653.0 663.015625 673.03125 683.046875 693.0625 703.09375 713.109375 723.125 733.140625 743.15625 753.171875 763.1875 773.203125 783.21875 793.234375 ...
If you then though import gc
and force the garbage collector to run...
for i in range(100): print(process.memory_full_info().rss / 1024 ** 2) new = Leak.objects.get(id=leak.id) extra = {"f": new.f} gc.collect()
... you'll see that the memory use remains steady:
... 152.453125 152.453125 152.453125 152.453125 152.453125 152.453125 152.453125 152.453125 152.453125 152.453125 152.453125 152.453125 152.453125 152.453125 152.453125 152.453125 152.453125 ...
As such it's not a leak — it's perfectly collectable — which is inline with Aymeric's previous observation:
... the memory curve climbed a little bit at the beginning, then stabilized.
The GC does eventually kick-in — but yes no reason why we can't see about bringing that down by removing the reference loop.
Do you fancy making a PR with your suggested change, so we can get some eyes on it?
Thanks.
comment:12 by , 3 years ago
Has patch: | set |
---|---|
Owner: | changed from | to
Status: | new → assigned |
Version: | 1.1 → 4.0 |
PR up! Added much more information there.
comment:13 by , 3 years ago
Super, thanks — great work! Let's see what folks say.
Slight aside (not meant to distract from the proposed change): As part of this, I'm wondering if there's some basic advice on using gc
that we might add to the docs. 🤔 For example a gc.collect()
in a file upload view might go a long way...
comment:14 by , 2 years ago
Needs documentation: | set |
---|---|
Needs tests: | set |
Patch needs improvement: | set |
From the PR it looks like we can push forwards with adding a weak
option to FileField
that would have FieldFile
use a weakref
, but that we can't just bring that in as the default since there's a behaviour change with save()
and delete()
. (Maybe the case for the stronger change can be made via discussion on the DevelopersMailingList.)
Fixed cyclic reference by making it a weak one