Opened 15 years ago

Closed 12 years ago

Last modified 12 years ago

#11739 closed Bug (fixed)

ContentFile() does not support unicode data

Reported by: Adam Nelson Owned by: nobody
Component: File uploads/storage Version: 1.1
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

It's not clear to me where exactly this is a bug but it's certainly a problem. This example is for a model PageDisplay with a FileField named 'content':

>>> pager = PageDisplay.objects.get(id=4)
>>> pager
<PageDisplay: http://debtshed.com:8888/>
>>> pager.content
<FieldFile: None>
>>> from django.core.files.base import ContentFile
>>> pager.content.save('tester.html',ContentFile('test1'))
>>> pager.content.read()
'test1'
>>> pager.content.save('tester.html',ContentFile(u'test1'))
>>> pager.content.read()
't\x00e\x00s\x00t\x001\x00'

I'm not sure if this is working on the save and not the read or vice versa but I presume that the unicode translation should be symmetrical.

Attachments (4)

11739.diff (431 bytes ) - added by codysoyland 13 years ago.
11739_with_test.diff (1.3 KB ) - added by cody@… 13 years ago.
Same as last patch but with a test
11739-3.diff (1.7 KB ) - added by Claude Paroz 13 years ago.
Wrap StringIO with utf-8 reader/writer
11739-4.diff (2.2 KB ) - added by Claude Paroz 13 years ago.
Third alternative

Download all attachments as: .zip

Change History (23)

comment:1 by Adam Nelson, 15 years ago

Version: 1.01.1

Tried on 1.1 as well and also a problem.

comment:2 by andrew@…, 15 years ago

I was trying to reproduce this on current trunk on Mac Leopard, everything works as it should, my results are:

>>> pager.content.save('tester.html', ContentFile(u'test1'))
>>> pager.content.read()
'test1'

comment:3 by Adam Nelson, 15 years ago

It's an issue for me still, on Leopard. Can nobody else reproduce?

comment:4 by Russell Keith-Magee, 15 years ago

Resolution: worksforme
Status: newclosed

comment:5 by anonymous, 13 years ago

Easy pickings: unset
Severity: Normal
Type: Uncategorized
UI/UX: unset

I've met the same problem..
when I render a template_file to a static_file...and save in a FileField for further usage..

the result below is the same as model.file.save(..., ...)
and I have to encode it first...of course not a good solution...

>>> from django.core.files.base import ContentFile
>>> ContentFile(u'test').read()
't\x00e\x00s\x00t\x00'
>>> ContentFile(u'test'.encode('gb2312')).read()
'test'

by codysoyland, 13 years ago

Attachment: 11739.diff added

comment:6 by cody@…, 13 years ago

Resolution: worksforme
Status: closedreopened
Type: UncategorizedBug

I'm running into this as well. It stems from the fact that cStringIO does not properly support unicode. The following code demonstrates the difference between StringIO and cStringIO:

>>> import StringIO
>>> import cStringIO
>>> StringIO.StringIO(u'test').read()
u'test'
>>> cStringIO.StringIO(u'test').read()
't\x00e\x00s\x00t\x00'

This behavior has been discussed also in https://code.djangoproject.com/ticket/7990.

I've attached a patch that changes the import such that it doesn't import cStringIO.

comment:7 by cody@…, 13 years ago

As a temporary workaround, the unicode string can be encoded like so:

>>> print ContentFile(u'汉语'.encode('utf-8')).read().decode('utf-8')
汉语

comment:8 by Adam Nelson, 13 years ago

Has patch: set
Needs tests: set

by cody@…, 13 years ago

Attachment: 11739_with_test.diff added

Same as last patch but with a test

comment:9 by Gabriel Hurley, 13 years ago

Triage Stage: UnreviewedAccepted

comment:10 by Claude Paroz, 13 years ago

Needs tests: unset

My concern with cody's patch is that it also excludes cStringIO usage for the File class, which might slow down file handling on platforms supporting cStringIO.

by Claude Paroz, 13 years ago

Attachment: 11739-3.diff added

Wrap StringIO with utf-8 reader/writer

comment:11 by Claude Paroz, 13 years ago

Here's a possible alternative which only touches ContentFile. Of course, it would also be possible to force using StringIO.StringIO only for ContentFile. Some design decision to take...

comment:12 by Julien Phalip, 13 years ago

Could some benchmarking be done to assess how performance would be affected by either patch?

comment:13 by codysoyland, 13 years ago

I did some really basic benchmarks that show the first patch, which removes the cStringIO module completely, is much faster on unicode data, but slower on binary data than the second patch, which wraps the cStringIO module. I don't know how conclusive these tests are though, because they are with small datasets.

Test script (unicode):

    from django.core.files.base import ContentFile
    import timeit
    print timeit.timeit(lambda: ContentFile(u'test').read())

Test script (bytes):

    from django.core.files.base import ContentFile
    import timeit
    print timeit.timeit(lambda: ContentFile('test').read())

I ran each script 3 times on each patch. timeit runs the function 1,000,000 times and returns the number of seconds elapsed.

11739_with_test.diff (No cStringIO)

Unicode results:
7.600399017333984
7.397403955459595
7.361830949783325

Bytes results:
8.416419982910156
7.377918004989624
7.391348123550415

11739-3.diff (Wrapped cStringIO)

Unicode results:
14.868388175964355
14.914705991744995
14.742434024810791

Bytes results:
5.290870904922485
4.712155818939209
5.349420070648193

comment:14 by Claude Paroz, 13 years ago

Thanks for those tests!

What worries me in your patch, Cody, is that you are excluding cStringIO not only for ContentFile, but for the plain File class. I guess that ContentFile is mainly used in tests, where performance is less a concern (I see only one use in Django code, related to the collectstatic management command).

That's why I think the code which should also be tested performance-wise is the iter method of the File object.

by Claude Paroz, 13 years ago

Attachment: 11739-4.diff added

Third alternative

comment:15 by Claude Paroz, 13 years ago

While thinking about this, I had an idea of a third way to workaround this unicode problem, which should always use the fastest StringIO available. Naming might not be accurate, but at least you get the idea.

comment:16 by Łukasz Rekucki, 13 years ago

Did a little research on this: The docs show the use of unicode with ContentFile only once and it's with explicit encoding to UTF-8. Even after fixing the cStringIO bug (It has been fixed in 2.7.3, see http://bugs.python.org/issue1548891) passing unicode would still work only if the unicode string can be encoded to ASCII.

Note that in Python 3, there is no StringIO that accepts both bytes and unicode. One must rather explicitly choose between BytesIO and StringIO.

As the workaround for this bug is very simple and this has been for a while now, I propose to document ContentFile being bytes only and throwing a ValueError if given unicode. In Django 1.5, we can start using the new io and introduce TextContentFile that will use io.TextIO.

Version 0, edited 13 years ago by Łukasz Rekucki (next)

comment:17 by Adam Nelson, 13 years ago

+1

Very progressive. That's a similar direction to requests 0.10.0 (http://pypi.python.org/pypi/requests) moving to 'text' for unicode and 'content' for bytes.

comment:18 by Claude Paroz <claude@…>, 12 years ago

Resolution: fixed
Status: reopenedclosed

In [361d6738f89f8443855d4378d3241566d9fca6e9]:

Fixed #11739 -- Made ContentFile support Unicode input

comment:19 by Aymeric Augustin, 12 years ago

I'm late to the party (I only noticed this ticket because it was fixed) but I liked lrekucki's proposal -- file content *is* binary.

Note: See TracTickets for help on using tickets.
Back to Top