Opened 6 years ago

Closed 3 years ago

Last modified 3 years ago

#11739 closed Bug (fixed)

ContentFile() does not support unicode data

Reported by: adamnelson Owned by: nobody
Component: File uploads/storage Version: 1.1
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

It's not clear to me where exactly this is a bug but it's certainly a problem. This example is for a model PageDisplay with a FileField named 'content':

>>> pager = PageDisplay.objects.get(id=4)
>>> pager
<PageDisplay: http://debtshed.com:8888/>
>>> pager.content
<FieldFile: None>
>>> from django.core.files.base import ContentFile
>>> pager.content.save('tester.html',ContentFile('test1'))
>>> pager.content.read()
'test1'
>>> pager.content.save('tester.html',ContentFile(u'test1'))
>>> pager.content.read()
't\x00e\x00s\x00t\x001\x00'

I'm not sure if this is working on the save and not the read or vice versa but I presume that the unicode translation should be symmetrical.

Attachments (4)

11739.diff (431 bytes) - added by codysoyland 3 years ago.
11739_with_test.diff (1.3 KB) - added by cody@… 3 years ago.
Same as last patch but with a test
11739-3.diff (1.7 KB) - added by claudep 3 years ago.
Wrap StringIO with utf-8 reader/writer
11739-4.diff (2.2 KB) - added by claudep 3 years ago.
Third alternative

Download all attachments as: .zip

Change History (23)

comment:1 Changed 6 years ago by adamnelson

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Version changed from 1.0 to 1.1

Tried on 1.1 as well and also a problem.

comment:2 Changed 6 years ago by andrew@…

I was trying to reproduce this on current trunk on Mac Leopard, everything works as it should, my results are:

>>> pager.content.save('tester.html', ContentFile(u'test1'))
>>> pager.content.read()
'test1'

comment:3 Changed 6 years ago by adamnelson

It's an issue for me still, on Leopard. Can nobody else reproduce?

comment:4 Changed 5 years ago by russellm

  • Resolution set to worksforme
  • Status changed from new to closed

comment:5 Changed 4 years ago by anonymous

  • Easy pickings unset
  • Severity set to Normal
  • Type set to Uncategorized
  • UI/UX unset

I've met the same problem..
when I render a template_file to a static_file...and save in a FileField for further usage..

the result below is the same as model.file.save(..., ...)
and I have to encode it first...of course not a good solution...

>>> from django.core.files.base import ContentFile
>>> ContentFile(u'test').read()
't\x00e\x00s\x00t\x00'
>>> ContentFile(u'test'.encode('gb2312')).read()
'test'

Changed 3 years ago by codysoyland

comment:6 Changed 3 years ago by cody@…

  • Resolution worksforme deleted
  • Status changed from closed to reopened
  • Type changed from Uncategorized to Bug

I'm running into this as well. It stems from the fact that cStringIO does not properly support unicode. The following code demonstrates the difference between StringIO and cStringIO:

>>> import StringIO
>>> import cStringIO
>>> StringIO.StringIO(u'test').read()
u'test'
>>> cStringIO.StringIO(u'test').read()
't\x00e\x00s\x00t\x00'

This behavior has been discussed also in https://code.djangoproject.com/ticket/7990.

I've attached a patch that changes the import such that it doesn't import cStringIO.

comment:7 Changed 3 years ago by cody@…

As a temporary workaround, the unicode string can be encoded like so:

>>> print ContentFile(u'汉语'.encode('utf-8')).read().decode('utf-8')
汉语

comment:8 Changed 3 years ago by adamnelson

  • Has patch set
  • Needs tests set

Changed 3 years ago by cody@…

Same as last patch but with a test

comment:9 Changed 3 years ago by gabrielhurley

  • Triage Stage changed from Unreviewed to Accepted

comment:10 Changed 3 years ago by claudep

  • Needs tests unset

My concern with cody's patch is that it also excludes cStringIO usage for the File class, which might slow down file handling on platforms supporting cStringIO.

Changed 3 years ago by claudep

Wrap StringIO with utf-8 reader/writer

comment:11 Changed 3 years ago by claudep

Here's a possible alternative which only touches ContentFile. Of course, it would also be possible to force using StringIO.StringIO only for ContentFile. Some design decision to take...

comment:12 Changed 3 years ago by julien

Could some benchmarking be done to assess how performance would be affected by either patch?

comment:13 Changed 3 years ago by codysoyland

I did some really basic benchmarks that show the first patch, which removes the cStringIO module completely, is much faster on unicode data, but slower on binary data than the second patch, which wraps the cStringIO module. I don't know how conclusive these tests are though, because they are with small datasets.

Test script (unicode):

    from django.core.files.base import ContentFile
    import timeit
    print timeit.timeit(lambda: ContentFile(u'test').read())

Test script (bytes):

    from django.core.files.base import ContentFile
    import timeit
    print timeit.timeit(lambda: ContentFile('test').read())

I ran each script 3 times on each patch. timeit runs the function 1,000,000 times and returns the number of seconds elapsed.

11739_with_test.diff (No cStringIO)

Unicode results:
7.600399017333984
7.397403955459595
7.361830949783325

Bytes results:
8.416419982910156
7.377918004989624
7.391348123550415

11739-3.diff (Wrapped cStringIO)

Unicode results:
14.868388175964355
14.914705991744995
14.742434024810791

Bytes results:
5.290870904922485
4.712155818939209
5.349420070648193

comment:14 Changed 3 years ago by claudep

Thanks for those tests!

What worries me in your patch, Cody, is that you are excluding cStringIO not only for ContentFile, but for the plain File class. I guess that ContentFile is mainly used in tests, where performance is less a concern (I see only one use in Django code, related to the collectstatic management command).

That's why I think the code which should also be tested performance-wise is the iter method of the File object.

Changed 3 years ago by claudep

Third alternative

comment:15 Changed 3 years ago by claudep

While thinking about this, I had an idea of a third way to workaround this unicode problem, which should always use the fastest StringIO available. Naming might not be accurate, but at least you get the idea.

comment:16 Changed 3 years ago by lrekucki

Did a little research on this: The docs show the use of unicode with ContentFile only once and it's with explicit encoding to UTF-8. Even after fixing the cStringIO bug (It has been fixed in 2.7.3, see http://bugs.python.org/issue1548891) passing unicode would still work only if the unicode string can be encoded to ASCII.

Note that in Python 3, there is no StringIO that accepts both bytes and unicode. One must rather explicitly choose between BytesIO and StringIO.

As the workaround for this bug is very simple and this has been for a while now, I propose to document ContentFile being bytes only and throwing a ValueError if given unicode. In Django 1.5, we can start using the new io and introduce TextContentFile that will use io.TextIO.

PS. This is also what py3k branch does: https://code.djangoproject.com/browser/django/branches/features/py3k/django/core/files/base.py#L8

Last edited 3 years ago by lrekucki (previous) (diff)

comment:17 Changed 3 years ago by adamnelson

+1

Very progressive. That's a similar direction to requests 0.10.0 (http://pypi.python.org/pypi/requests) moving to 'text' for unicode and 'content' for bytes.

comment:18 Changed 3 years ago by Claude Paroz <claude@…>

  • Resolution set to fixed
  • Status changed from reopened to closed

In [361d6738f89f8443855d4378d3241566d9fca6e9]:

Fixed #11739 -- Made ContentFile support Unicode input

comment:19 Changed 3 years ago by aaugustin

I'm late to the party (I only noticed this ticket because it was fixed) but I liked lrekucki's proposal -- file content *is* binary.

Note: See TracTickets for help on using tickets.
Back to Top