#11739 closed Bug (fixed)
ContentFile() does not support unicode data
Reported by: | Adam Nelson | Owned by: | nobody |
---|---|---|---|
Component: | File uploads/storage | Version: | 1.1 |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
It's not clear to me where exactly this is a bug but it's certainly a problem. This example is for a model PageDisplay with a FileField named 'content':
>>> pager = PageDisplay.objects.get(id=4) >>> pager <PageDisplay: http://debtshed.com:8888/> >>> pager.content <FieldFile: None> >>> from django.core.files.base import ContentFile >>> pager.content.save('tester.html',ContentFile('test1')) >>> pager.content.read() 'test1' >>> pager.content.save('tester.html',ContentFile(u'test1')) >>> pager.content.read() 't\x00e\x00s\x00t\x001\x00'
I'm not sure if this is working on the save and not the read or vice versa but I presume that the unicode translation should be symmetrical.
Attachments (4)
Change History (23)
comment:1 by , 15 years ago
Version: | 1.0 → 1.1 |
---|
comment:2 by , 15 years ago
I was trying to reproduce this on current trunk on Mac Leopard, everything works as it should, my results are:
>>> pager.content.save('tester.html', ContentFile(u'test1')) >>> pager.content.read() 'test1'
comment:4 by , 15 years ago
Resolution: | → worksforme |
---|---|
Status: | new → closed |
comment:5 by , 13 years ago
Easy pickings: | unset |
---|---|
Severity: | → Normal |
Type: | → Uncategorized |
UI/UX: | unset |
I've met the same problem..
when I render a template_file to a static_file...and save in a FileField for further usage..
the result below is the same as model.file.save(..., ...)
and I have to encode it first...of course not a good solution...
>>> from django.core.files.base import ContentFile >>> ContentFile(u'test').read() 't\x00e\x00s\x00t\x00' >>> ContentFile(u'test'.encode('gb2312')).read() 'test'
by , 13 years ago
Attachment: | 11739.diff added |
---|
comment:6 by , 13 years ago
Resolution: | worksforme |
---|---|
Status: | closed → reopened |
Type: | Uncategorized → Bug |
I'm running into this as well. It stems from the fact that cStringIO does not properly support unicode. The following code demonstrates the difference between StringIO and cStringIO:
>>> import StringIO >>> import cStringIO >>> StringIO.StringIO(u'test').read() u'test' >>> cStringIO.StringIO(u'test').read() 't\x00e\x00s\x00t\x00'
This behavior has been discussed also in https://code.djangoproject.com/ticket/7990.
I've attached a patch that changes the import such that it doesn't import cStringIO.
comment:7 by , 13 years ago
As a temporary workaround, the unicode string can be encoded like so:
>>> print ContentFile(u'汉语'.encode('utf-8')).read().decode('utf-8') 汉语
comment:8 by , 13 years ago
Has patch: | set |
---|---|
Needs tests: | set |
comment:9 by , 13 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:10 by , 13 years ago
Needs tests: | unset |
---|
My concern with cody's patch is that it also excludes cStringIO usage for the File class, which might slow down file handling on platforms supporting cStringIO.
comment:11 by , 13 years ago
Here's a possible alternative which only touches ContentFile. Of course, it would also be possible to force using StringIO.StringIO only for ContentFile. Some design decision to take...
comment:12 by , 13 years ago
Could some benchmarking be done to assess how performance would be affected by either patch?
comment:13 by , 13 years ago
I did some really basic benchmarks that show the first patch, which removes the cStringIO module completely, is much faster on unicode data, but slower on binary data than the second patch, which wraps the cStringIO module. I don't know how conclusive these tests are though, because they are with small datasets.
Test script (unicode):
from django.core.files.base import ContentFile import timeit print timeit.timeit(lambda: ContentFile(u'test').read())
Test script (bytes):
from django.core.files.base import ContentFile import timeit print timeit.timeit(lambda: ContentFile('test').read())
I ran each script 3 times on each patch. timeit runs the function 1,000,000 times and returns the number of seconds elapsed.
11739_with_test.diff (No cStringIO)
Unicode results: 7.600399017333984 7.397403955459595 7.361830949783325 Bytes results: 8.416419982910156 7.377918004989624 7.391348123550415
11739-3.diff (Wrapped cStringIO)
Unicode results: 14.868388175964355 14.914705991744995 14.742434024810791 Bytes results: 5.290870904922485 4.712155818939209 5.349420070648193
comment:14 by , 13 years ago
Thanks for those tests!
What worries me in your patch, Cody, is that you are excluding cStringIO not only for ContentFile, but for the plain File class. I guess that ContentFile is mainly used in tests, where performance is less a concern (I see only one use in Django code, related to the collectstatic management command).
That's why I think the code which should also be tested performance-wise is the iter method of the File object.
comment:15 by , 13 years ago
While thinking about this, I had an idea of a third way to workaround this unicode problem, which should always use the fastest StringIO available. Naming might not be accurate, but at least you get the idea.
comment:16 by , 13 years ago
Did a little research on this: The docs show the use of unicode with ContentFile
only once and it's with explicit encoding to UTF-8. Even after fixing the cStringIO bug (It has been fixed in 2.7.3, see http://bugs.python.org/issue1548891) passing unicode would still work only if the unicode string can be encoded to ASCII.
Note that in Python 3, there is no StringIO
that accepts both bytes and unicode. One must rather explicitly choose between BytesIO and StringIO.
As the workaround for this bug is very simple and this has been for a while now, I propose to document ContentFile being bytes only and throwing a ValueError if given unicode. In Django 1.5, we can start using the new io
and introduce TextContentFile
that will use io.TextIO.
PS. This is also what py3k branch does: https://code.djangoproject.com/browser/django/branches/features/py3k/django/core/files/base.py#L8
comment:17 by , 13 years ago
+1
Very progressive. That's a similar direction to requests 0.10.0 (http://pypi.python.org/pypi/requests) moving to 'text' for unicode and 'content' for bytes.
comment:18 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
comment:19 by , 12 years ago
I'm late to the party (I only noticed this ticket because it was fixed) but I liked lrekucki's proposal -- file content *is* binary.
Tried on 1.1 as well and also a problem.