Opened 7 years ago

Closed 4 years ago

Last modified 4 years ago

#11739 closed Bug (fixed)

ContentFile() does not support unicode data

Reported by: Adam Nelson Owned by: nobody
Component: File uploads/storage Version: 1.1
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

It's not clear to me where exactly this is a bug but it's certainly a problem. This example is for a model PageDisplay with a FileField named 'content':

>>> pager = PageDisplay.objects.get(id=4)
>>> pager
<PageDisplay: http://debtshed.com:8888/>
>>> pager.content
<FieldFile: None>
>>> from django.core.files.base import ContentFile
>>> pager.content.save('tester.html',ContentFile('test1'))
>>> pager.content.read()
'test1'
>>> pager.content.save('tester.html',ContentFile(u'test1'))
>>> pager.content.read()
't\x00e\x00s\x00t\x001\x00'

I'm not sure if this is working on the save and not the read or vice versa but I presume that the unicode translation should be symmetrical.

Attachments (4)

11739.diff (431 bytes) - added by codysoyland 5 years ago.
11739_with_test.diff (1.3 KB) - added by cody@… 5 years ago.
Same as last patch but with a test
11739-3.diff (1.7 KB) - added by Claude Paroz 5 years ago.
Wrap StringIO with utf-8 reader/writer
11739-4.diff (2.2 KB) - added by Claude Paroz 5 years ago.
Third alternative

Download all attachments as: .zip

Change History (23)

comment:1 Changed 7 years ago by Adam Nelson

Version: 1.01.1

Tried on 1.1 as well and also a problem.

comment:2 Changed 7 years ago by andrew@…

I was trying to reproduce this on current trunk on Mac Leopard, everything works as it should, my results are:

>>> pager.content.save('tester.html', ContentFile(u'test1'))
>>> pager.content.read()
'test1'

comment:3 Changed 7 years ago by Adam Nelson

It's an issue for me still, on Leopard. Can nobody else reproduce?

comment:4 Changed 7 years ago by Russell Keith-Magee

Resolution: worksforme
Status: newclosed

comment:5 Changed 5 years ago by anonymous

Easy pickings: unset
Severity: Normal
Type: Uncategorized
UI/UX: unset

I've met the same problem..
when I render a template_file to a static_file...and save in a FileField for further usage..

the result below is the same as model.file.save(..., ...)
and I have to encode it first...of course not a good solution...

>>> from django.core.files.base import ContentFile
>>> ContentFile(u'test').read()
't\x00e\x00s\x00t\x00'
>>> ContentFile(u'test'.encode('gb2312')).read()
'test'

Changed 5 years ago by codysoyland

Attachment: 11739.diff added

comment:6 Changed 5 years ago by cody@…

Resolution: worksforme
Status: closedreopened
Type: UncategorizedBug

I'm running into this as well. It stems from the fact that cStringIO does not properly support unicode. The following code demonstrates the difference between StringIO and cStringIO:

>>> import StringIO
>>> import cStringIO
>>> StringIO.StringIO(u'test').read()
u'test'
>>> cStringIO.StringIO(u'test').read()
't\x00e\x00s\x00t\x00'

This behavior has been discussed also in https://code.djangoproject.com/ticket/7990.

I've attached a patch that changes the import such that it doesn't import cStringIO.

comment:7 Changed 5 years ago by cody@…

As a temporary workaround, the unicode string can be encoded like so:

>>> print ContentFile(u'汉语'.encode('utf-8')).read().decode('utf-8')
汉语

comment:8 Changed 5 years ago by Adam Nelson

Has patch: set
Needs tests: set

Changed 5 years ago by cody@…

Attachment: 11739_with_test.diff added

Same as last patch but with a test

comment:9 Changed 5 years ago by Gabriel Hurley

Triage Stage: UnreviewedAccepted

comment:10 Changed 5 years ago by Claude Paroz

Needs tests: unset

My concern with cody's patch is that it also excludes cStringIO usage for the File class, which might slow down file handling on platforms supporting cStringIO.

Changed 5 years ago by Claude Paroz

Attachment: 11739-3.diff added

Wrap StringIO with utf-8 reader/writer

comment:11 Changed 5 years ago by Claude Paroz

Here's a possible alternative which only touches ContentFile. Of course, it would also be possible to force using StringIO.StringIO only for ContentFile. Some design decision to take...

comment:12 Changed 5 years ago by Julien Phalip

Could some benchmarking be done to assess how performance would be affected by either patch?

comment:13 Changed 5 years ago by codysoyland

I did some really basic benchmarks that show the first patch, which removes the cStringIO module completely, is much faster on unicode data, but slower on binary data than the second patch, which wraps the cStringIO module. I don't know how conclusive these tests are though, because they are with small datasets.

Test script (unicode):

    from django.core.files.base import ContentFile
    import timeit
    print timeit.timeit(lambda: ContentFile(u'test').read())

Test script (bytes):

    from django.core.files.base import ContentFile
    import timeit
    print timeit.timeit(lambda: ContentFile('test').read())

I ran each script 3 times on each patch. timeit runs the function 1,000,000 times and returns the number of seconds elapsed.

11739_with_test.diff (No cStringIO)

Unicode results:
7.600399017333984
7.397403955459595
7.361830949783325

Bytes results:
8.416419982910156
7.377918004989624
7.391348123550415

11739-3.diff (Wrapped cStringIO)

Unicode results:
14.868388175964355
14.914705991744995
14.742434024810791

Bytes results:
5.290870904922485
4.712155818939209
5.349420070648193

comment:14 Changed 5 years ago by Claude Paroz

Thanks for those tests!

What worries me in your patch, Cody, is that you are excluding cStringIO not only for ContentFile, but for the plain File class. I guess that ContentFile is mainly used in tests, where performance is less a concern (I see only one use in Django code, related to the collectstatic management command).

That's why I think the code which should also be tested performance-wise is the iter method of the File object.

Changed 5 years ago by Claude Paroz

Attachment: 11739-4.diff added

Third alternative

comment:15 Changed 5 years ago by Claude Paroz

While thinking about this, I had an idea of a third way to workaround this unicode problem, which should always use the fastest StringIO available. Naming might not be accurate, but at least you get the idea.

comment:16 Changed 5 years ago by Łukasz Rekucki

Did a little research on this: The docs show the use of unicode with ContentFile only once and it's with explicit encoding to UTF-8. Even after fixing the cStringIO bug (It has been fixed in 2.7.3, see http://bugs.python.org/issue1548891) passing unicode would still work only if the unicode string can be encoded to ASCII.

Note that in Python 3, there is no StringIO that accepts both bytes and unicode. One must rather explicitly choose between BytesIO and StringIO.

As the workaround for this bug is very simple and this has been for a while now, I propose to document ContentFile being bytes only and throwing a ValueError if given unicode. In Django 1.5, we can start using the new io and introduce TextContentFile that will use io.TextIO.

PS. This is also what py3k branch does: https://code.djangoproject.com/browser/django/branches/features/py3k/django/core/files/base.py#L8

Last edited 5 years ago by Łukasz Rekucki (previous) (diff)

comment:17 Changed 5 years ago by Adam Nelson

+1

Very progressive. That's a similar direction to requests 0.10.0 (http://pypi.python.org/pypi/requests) moving to 'text' for unicode and 'content' for bytes.

comment:18 Changed 4 years ago by Claude Paroz <claude@…>

Resolution: fixed
Status: reopenedclosed

In [361d6738f89f8443855d4378d3241566d9fca6e9]:

Fixed #11739 -- Made ContentFile support Unicode input

comment:19 Changed 4 years ago by Aymeric Augustin

I'm late to the party (I only noticed this ticket because it was fixed) but I liked lrekucki's proposal -- file content *is* binary.

Note: See TracTickets for help on using tickets.
Back to Top