Opened 6 years ago

Closed 22 months ago

Last modified 5 months ago

#29510 closed Bug (invalid)

QueryDict.copy() returns closed files when the type of file is TemporaryUploadedFile

Reported by: Liquid Scorpio Owned by: Dan Madere
Component: File uploads/storage Version: 1.11
Severity: Normal Keywords: QueryDict, upload, file
Cc: Jeff, Herbert Fortes, Dmytro Litvinov Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Liquid Scorpio)

When uploaded file size is greater than FILE_UPLOAD_MAX_MEMORY_SIZE, Django uses TemporaryUploadedFile to represent the file object. However, when executing .copy() on a QueryDict containing such a file, the returned object has the file but it is in closed state (seekable() is False).

Expected: File should be present in open state (seekable() should be True)

Below is a reproducible example and also contains version details:

Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
In [18]: import django

In [19]: django.VERSION
Out[19]: (1, 11, 11, u'final', 0)

In [20]: from django.http.request import QueryDict

In [21]: from django.core.files.uploadedfile import TemporaryUploadedFile

In [22]: d = QueryDict(mutable=True)

In [23]: f = TemporaryUploadedFile('test.jpg', 'image/jpeg', 100, 'utf-8')

In [24]: f.seekable()
Out[24]: True

In [25]: d.appendlist('image', f)

In [26]: d['image'].seekable()
Out[25]: True

In [27]: c = d.copy()

In [28]: c['image'].seekable()
Out[28]: False

In [30]: 

Change History (21)

comment:1 by Tim Graham, 6 years ago

Component: UncategorizedFile uploads/storage
Triage Stage: UnreviewedAccepted

I can reproduce with Python 2, even if I'm unsure that the behavior is incorrect:

>>> from django.core.files.uploadedfile import TemporaryUploadedFile
>>> from copy import deepcopy
>>> f = TemporaryUploadedFile('test.jpg', 'image/jpeg', 100, 'utf-8')
f.seekable()
True
a = deepcopy(f)
>>> a.seekable()
False

However, the deepcopy() crashes on Python 3 with:
TypeError: cannot serialize '_io.BufferedRandom' object

Accepting for further investigation.

comment:2 by Liquid Scorpio, 6 years ago

Description: modified (diff)

comment:3 by Jeff, 6 years ago

Cc: Jeff added
Owner: changed from nobody to Jeff
Status: newassigned

comment:4 by Jeff, 6 years ago

This is not the reported issue, however:

In python 3, the copy.deepcopy function is not able to copy TemporaryUploadedFiles because python's tempfile.TemporaryFile is not serializable. I'm not certain how serious the issue is, or if we want to fix it, but I believe this makes QueryDicts unable to be copied if it contains a TemporaryUploadedFile obj.

comment:5 by Jeff, 6 years ago

Owner: Jeff removed
Status: assignednew

comment:6 by Herbert Fortes, 6 years ago

Cc: Herbert Fortes added

comment:7 by Paulo, 5 years ago

Owner: set to Paulo
Status: newassigned

comment:8 by Paulo, 5 years ago

The underlying issue here is a change in Python attempting to pickle the file.
On python 2.7, I can reproduce the reported issue:

In [19]: file = tempfile.NamedTemporaryFile(suffix='.upload')

In [20]: copy.deepcopy(file)
Out[20]: <closed file '<uninitialized file>', mode '<uninitialized file>' at 0x7f45f2d0b420>

On Python 3.6 I can't reproduce the original issue, instead I get Tim's issue:

In [17]: file = tempfile.NamedTemporaryFile(suffix='.upload')

In [18]: copy.deepcopy(file)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-36f4ecfb14c1> in <module>
----> 1 copy.deepcopy(file)

/usr/local/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    167                     reductor = getattr(x, "__reduce_ex__", None)
    168                     if reductor:
--> 169                         rv = reductor(4)
    170                     else:
    171                         reductor = getattr(x, "__reduce__", None)

/usr/local/lib/python3.6/tempfile.py in func_wrapper(*args, **kwargs)
    483             @_functools.wraps(func)
    484             def func_wrapper(*args, **kwargs):
--> 485                 return func(*args, **kwargs)
    486             # Avoid closing the file as long as the wrapper is alive,
    487             # see issue #18879.

TypeError: cannot serialize '_io.BufferedRandom' object

Personally I think the 2.7 issue is not as bad as the 3.x since it at least allows for the copy to proceed (while silently closing the file).
My suggestion would be to preemptively check that the objects being copied are serializable or to catch the TypeError and raise a different error indicating the user that the query dict contains a non-seralizable value (like file, etc..)

comment:9 by Claude Paroz, 5 years ago

Did you explore possible handling of such file objects in QueryDict.__deepcopy__?

in reply to:  9 ; comment:10 by Paulo, 5 years ago

Replying to Claude Paroz:

Did you explore possible handling of such file objects in QueryDict.__deepcopy__?

To keep the file on copy?
If so then yes, we can check value for UploadedFile and create a new instance that points to the same underlying file (although I guess that's more shallow than deep copy).
To have fully deep copy we'd have to create new tmp if is on disk or somehow copy the file stream.
That said, I think "soft" copy should be enough and is more expected behavior for the file itself.
My concern with this approach was if we have any other non file values that are not serializable, would this open the door to having to handle those too?
Guess not since there's only so many types that are handled by QueryDict.

comment:11 by Dan Madere, 2 years ago

Owner: changed from Paulo to Dan Madere

in reply to:  10 ; comment:12 by Dan Madere, 2 years ago

Replying to Paulo:

Replying to Claude Paroz:

Did you explore possible handling of such file objects in QueryDict.__deepcopy__?

To keep the file on copy?
If so then yes, we can check value for UploadedFile and create a new instance that points to the same underlying file (although I guess that's more shallow than deep copy).
To have fully deep copy we'd have to create new tmp if is on disk or somehow copy the file stream.

After trying it, I don't think either of these suggestions is workable, considering the nature of TemporaryUploadedFile. It has logic where the file is deleted as soon as it's closed. When copying/streaming the old file's content to the new one, we have a problem where the new file deletes itself immediately.

I stepped back, and pondered why are people copying a QueryDict anyway, and what do they expect to happen to a TemporaryUploadedFile inside? https://docs.djangoproject.com/en/dev/ref/request-response/#querydict-objects:

class QueryDict¶

In an HttpRequest object, the GET and POST attributes are instances of django.http.QueryDict, a dictionary-like class customized to deal with multiple values for the same key. This is necessary because some HTML form elements, notably <select multiple>, pass multiple values for the same key.

The QueryDicts at request.POST and request.GET will be immutable when accessed in a normal request/response cycle. To get a mutable version you need to use QueryDict.copy().

That leads me to think that people are copying a QueryDict because it's a nice starting point to have all the query params organized, but they want to modify it, and pass to the template. Doing so would result in a This QueryDict instance is immutable exception, and the advice above, so they try to copy the QueryDict before modifying it.

I propose we omit TemporaryUploadedFile values on deep copy of a QueryDict, and will open a PR for feedback.

comment:13 by Dan Madere, 2 years ago

Has patch: set

comment:14 by Carlton Gibson, 2 years ago

I stepped back, and pondered why are people copying a QueryDict anyway...

This seems like the right thought to me. 🤔

I have to admit I'm struggling to see the use-case — Folks would have to be making a copy of request.FILES for this to come up no? Why would you want a mutable copy of that? (Maybe more ☕️ is needed :)

Copying request.GET in order to edit a query string makes sense. Uploaded files though... 🤔

(Sorry if I'm being dense)

in reply to:  14 comment:15 by Dan Madere, 2 years ago

Replying to Carlton Gibson:

I stepped back, and pondered why are people copying a QueryDict anyway...

This seems like the right thought to me. 🤔

I have to admit I'm struggling to see the use-case — Folks would have to be making a copy of request.FILES for this to come up no? Why would you want a mutable copy of that? (Maybe more ☕️ is needed :)

Thanks Carlton!

Not necessarily, it could also be making a copy of request.POST, which surprisingly includes the files too. My theory is that most people believe that copying it only includes the query params, because that's the useful part to append more data to, and pass to the template.

Regardless of the use case though, it seems impossible to me to copy a TemporaryUploadedFile. I don't see any reason people would expect it to actually copy the file.

comment:16 by Carlton Gibson, 2 years ago

Hey Dan — just to let you know, I'm looking at the request object now, with this half in mind. Will follow-up once I've done that. Thanks

in reply to:  12 comment:17 by anton-kazlouski, 23 months ago

This is exactly what happened in my case. I have a code that modifies some attributes of request.data and then passes it to a serializer. At some point we started to notice This QueryDict instance is immutable exceptions in our logs. Thus, decided to use built-in method which you are referenced to. Once, replaced direct modifications request.data with request.data.copy(), TypeError: cannot pickle '_io.BufferedRandom' object on... started to araise.

Replying to Dan Madere:

After trying it, I don't think either of these suggestions is workable, considering the nature of TemporaryUploadedFile. It has logic where the file is deleted as soon as it's closed. When copying/streaming the old file's content to the new one, we have a problem where the new file deletes itself immediately.

I stepped back, and pondered why are people copying a QueryDict anyway, and what do they expect to happen to a TemporaryUploadedFile inside? https://docs.djangoproject.com/en/dev/ref/request-response/#querydict-objects:

class QueryDict¶

In an HttpRequest object, the GET and POST attributes are instances of django.http.QueryDict, a dictionary-like class customized to deal with multiple values for the same key. This is necessary because some HTML form elements, notably <select multiple>, pass multiple values for the same key.

The QueryDicts at request.POST and request.GET will be immutable when accessed in a normal request/response cycle. To get a mutable version you need to use QueryDict.copy().

That leads me to think that people are copying a QueryDict because it's a nice starting point to have all the query params organized, but they want to modify it, and pass to the template. Doing so would result in a This QueryDict instance is immutable exception, and the advice above, so they try to copy the QueryDict before modifying it.

I propose we omit TemporaryUploadedFile values on deep copy of a QueryDict, and will open a PR for feedback.

comment:18 by Carlton Gibson, 22 months ago

Hey Dan. Sorry for the pause. Getting there now.

Can I ask you to clarify this:

Not necessarily, it could also be making a copy of request.POST, which surprisingly includes the files too.

I'm looking at % ./runtests.py file_uploads -k test_simple_upload --parallel=1 and % ./runtests.py file_uploads -k test_large_upload --parallel=1 and inspecting the request for the views:

(django) carlton@Carltons-MacBook-Pro tests % ./runtests.py file_uploads -k test_simple_upload --parallel=1
Testing against Django installed in '/Users/carlton/Projects/Django/django/django'
Found 1 test(s).
Creating test database for alias 'default'...
System check identified no issues (0 silenced).
<WSGIRequest: POST '/upload/'>

POST:
<QueryDict: {'name': ['Ringo']}>

FILES:
<MultiValueDict: {'file_field': [<InMemoryUploadedFile: tests.py (text/x-python)>]}>
> /Users/carlton/Projects/Django/django/tests/file_uploads/views.py(31)file_upload_view()
-> form_data = request.POST.copy()
(Pdb) interact
*interactive*
>>> request.POST.copy()
<QueryDict: {'name': ['Ringo']}>
>>> request.FILES.copy()
<MultiValueDict: {'file_field': [<InMemoryUploadedFile: tests.py (text/x-python)>]}>
>>> request.FILES.copy()["file_field"]
<InMemoryUploadedFile: tests.py (text/x-python)>
>>> request.FILES.copy()["file_field"].seekable()
True

And the same here:

(django) carlton@Carltons-MacBook-Pro tests % ./runtests.py file_uploads -k test_large_upload --parallel=1
Testing against Django installed in '/Users/carlton/Projects/Django/django/django'
Found 1 test(s).
Creating test database for alias 'default'...
System check identified no issues (0 silenced).
<WSGIRequest: POST '/verify/'>

POST:
<QueryDict: {'name': ['Ringo'], 'name_hash': ['a6d6d397a391d876d5aae28cdca9f25071906022'], 'file_field1_hash': ['b84b7b775430232c2ee4e2216db1a591d97b2db0'], 'file_field2_hash': ['fbcf384f1c96e3d24fb25b749b00241a5cc4690f']}>

FILES:
<MultiValueDict: {'file_field1': [<TemporaryUploadedFile: tmpx56phiic.file1 (application/octet-stream)>], 'file_field2': [<TemporaryUploadedFile: tmp7ipmyal8.file2 (application/octet-stream)>]}>
> /Users/carlton/Projects/Django/django/tests/file_uploads/views.py(59)file_upload_view_verify()
-> form_data = request.POST.copy()
(Pdb) interact
*interactive*
>>> request.POST.copy()
<QueryDict: {'name': ['Ringo'], 'name_hash': ['a6d6d397a391d876d5aae28cdca9f25071906022'], 'file_field1_hash': ['b84b7b775430232c2ee4e2216db1a591d97b2db0'], 'file_field2_hash': ['fbcf384f1c96e3d24fb25b749b00241a5cc4690f']}>
>>> request.FILES.copy()
<MultiValueDict: {'file_field1': [<TemporaryUploadedFile: tmpx56phiic.file1 (application/octet-stream)>], 'file_field2': [<TemporaryUploadedFile: tmp7ipmyal8.file2 (application/octet-stream)>]}>
>>> request.FILES.copy()["file_field1"]
<TemporaryUploadedFile: tmpx56phiic.file1 (application/octet-stream)>
>>> request.FILES.copy()["file_field1"].seekable()
True

So, it's neither the case that request.POST includes the files data nor that copying request.FILES leads to the error. 🤔

I do get the error creating the TemporaryUploadedFile by-hand, so I need to see what's different about the upload flow.

If you were able to provide a (minimal but) full reproduce, it might save a cycle.
Thanks!

comment:19 by Carlton Gibson, 22 months ago

Presumably you're using DRF's request.data which "includes all parsed content, including file and non-file inputs."

request.POST is still available there that gives you the QueryDict without the file data.

(Still looking at how I'm able to copy the TemporaryUploadedFile in the test case.)

comment:20 by Carlton Gibson, 22 months ago

Resolution: invalid
Status: assignedclosed

So (ruling out the test client) I've experimented with a real-view, uploading a test file with some POST data as well:

from django.urls import path
from django.http import HttpResponse

import pprint

# With settings to ensure TemporaryUploadedFile: 
#   FILE_UPLOAD_MAX_MEMORY_SIZE=10
def file_upload_view(request):
    pprint.pprint(request)
    print("\nPOST:")
    pprint.pprint(request.POST)
    print("\nFILES:")
    pprint.pprint(request.FILES)

    f = request.FILES.copy()
    assert f["file1"].seekable()

    return HttpResponse("Got it!")


urlpatterns = [
    path("upload", file_upload_view),
]

This passes without issue:

<WSGIRequest: POST '/upload'>

POST:
<QueryDict: {'key1': ['value1'], 'key2': ['value2']}>

FILES:
{'file1': <TemporaryUploadedFile: Large Test File.txt (text/plain)>}

Note the...

    f = request.FILES.copy()
    assert f["file1"].seekable()

I've tested back to Python 3.9 and Django 3.2.

The reason this works is that FILES is a MultiValueDict not a QueryDict:

>>> request.FILES.copy()
<MultiValueDict: {'file_field1': [<TemporaryUploadedFile: tmpx56phiic.file1 (application/octet-stream)>], 'file_field2': [<TemporaryUploadedFile: tmp7ipmyal8.file2 (application/octet-stream)>]}>

...and so copy() uses copy.copy() rather than copy.deepcopy().

If you try and deepcopy() the TemporaryUploadedFile you indeed get the error, but that's not what Django does.

comment:21 by Dmytro Litvinov, 5 months ago

Cc: Dmytro Litvinov added
Note: See TracTickets for help on using tickets.
Back to Top