Opened 11 years ago

Closed 11 years ago

Last modified 11 years ago

#19357 closed Bug (fixed)

Unicode errors when Django is installed under a non-ASCII path

Reported by: kujiu Owned by: Aymeric Augustin
Component: Python 2 Version: dev
Severity: Release blocker Keywords:
Cc: ua_django_bugzilla@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

With Python 2.7, when using the unicode_literals import, all strings in the current files are interpreted as unicode. But, all strings coming from an other file that doesn't contain this import are str. An UnicodeDecodeError occurs when concatenating directly strings from a file with unicode_literals and an other without. But all works fine with Python 3.x.

If we start django with django installed on a path containing unicode character (like asian character), a UnicodeDecodeError occurs on file django/utils/translation/trans_real.py (line 112). If an app have a path with unicode character, the error occurs on the same file at line 155. Each time, the traceback stops in posixpath.py (in python lib), with the command "path += '/' + b". Django doesn't start, tests don't start, shell doesn't start.

Some files in tests have the same issue like :

  • django/contrib/formtools/tests/init.py
  • django/utils/translation/trans_real.py tests/regressiontests/fixtures_regress/tests.py
  • tests/regressiontests/i18n/contenttypes/tests.py tests/regressiontests/i18n/patterns/tests.py
  • tests/regressiontests/i18n/tests.py tests/regressiontests/servers/tests.py
  • tests/regressiontests/staticfiles_tests/tests.py tests/regressiontests/templates/response.py
  • tests/regressiontests/test_client_regress/tests.py

Example of traceback (with "runtests.py --settings=test_sqlite"):

Traceback (most recent call last):
  File "./runtests.py", line 327, in <module>
    options.failfast, args)
  File "./runtests.py", line 170, in django_tests
    failures = test_runner.run_tests(test_labels, extra_tests=extra_tests)
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/test/simple.py", line 366, in run_tests
    suite = self.build_suite(test_labels, extra_tests)
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/test/simple.py", line 260, in build_suite
    suite.addTest(build_suite(app))
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/test/simple.py", line 68, in build_suite
    test_module = get_tests(app_module)
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/test/simple.py", line 25, in get_tests
    test_module = import_module('.'.join(prefix + [TEST_MODULE]))
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/utils/importlib.py", line 35, in import_module
    __import__(name)
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/contrib/flatpages/tests/__init__.py", line 2, in <mo
dule>
    from django.contrib.flatpages.tests.forms import *
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/contrib/flatpages/tests/forms.py", line 4, in <modul
e>
    from django.contrib.flatpages.forms import FlatpageForm
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/contrib/flatpages/forms.py", line 6, in <module>
    class FlatpageForm(forms.ModelForm):
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/contrib/flatpages/forms.py", line 10, in FlatpageForm
    error_message = _("This value must contain only letters, numbers,"
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/forms/fields.py", line 441, in __init__
    if error_message:
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/utils/functional.py", line 117, in __wrapper__
    res = func(*self.__args, **self.__kw)
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/utils/translation/__init__.py", line 71, in ugettext
    return _trans.ugettext(message)
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/utils/translation/trans_real.py", line 275, in ugett
ext
    return do_translate(message, 'ugettext')
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/utils/translation/trans_real.py", line 257, in do_tr
anslate
    _default = translation(settings.LANGUAGE_CODE)
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/utils/translation/trans_real.py", line 112, in trans
lation
    globalpath = os.path.join(os.path.dirname(sys.modules[settings.__module__].__file__), 'locale')
  File "/usr/lib/python2.7/posixpath.py", line 80, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 12: ordinal not in range(128)

Attachments (3)

19357-wip.patch (62.0 KB ) - added by Aymeric Augustin 11 years ago.
19357-wip-2.patch (79.5 KB ) - added by Aymeric Augustin 11 years ago.
19357-upath.diff (76.4 KB ) - added by Claude Paroz 11 years ago.
Wrapping file to return unicode

Download all attachments as: .zip

Change History (24)

comment:1 by Julien Phalip, 11 years ago

Severity: NormalRelease blocker
Triage Stage: UnreviewedAccepted

This seems like a potential important regression. The OP verified that it worked fine in 1.4 but this UnicodeDecodeError occurs in master.

comment:2 by Aymeric Augustin, 11 years ago

Owner: changed from nobody to Aymeric Augustin

comment:3 by Aymeric Augustin, 11 years ago

Indeed, it appears that this regression was introduced by the unicode literals patch.

The fix may be as simple as using str('locale') on line 112 of django/utils/translation/trans_real.py.

We should audit for other instances of this problem.

comment:4 by Aymeric Augustin, 11 years ago

Summary: Unicode errors with mixing files with unicode_literals and othersUnicode errors when Django is installed under a non-ASCII path

comment:5 by Aymeric Augustin, 11 years ago

In fact, when the unicode_literals path was applied, filesystem paths operations weren't changed accordingly. Python uses native strings for filesystem paths.

In files that have unicode_literals turned on, every string involved in a path operation must be wrapped with str() or force_str(). I'm currently at 1000 lines of diff, and counting...

Last edited 11 years ago by Aymeric Augustin (previous) (diff)

comment:6 by Karen Tracey, 11 years ago

Wait, what? It ought to be OK to pass unicode on Python 2.X to Python os path functions. Yes, internally Python is going to have to turn unicode into a bytestring for passing into the system functions...but it should be able to do that, we should not have to do it before calling the os path functions. We've gone back and forth on this before, see for example #11030. Django code CANNOT be responsible for figuring out what the right encoding is for file system paths -- utf8 is going to be wrong on Asian windows systems, for example. Underlying Python should be configured correctly to figure that out. Or am I misunderstanding what you are having to do here....?

comment:7 by Aymeric Augustin, 11 years ago

As discussed on IRC, I'm just wrapping ASCII string literals in str(...) to obtain native strings where necessary, under Python 2 and 3, in modules that have unicode_literals turned on. I'm not guessing encodings.

by Aymeric Augustin, 11 years ago

Attachment: 19357-wip.patch added

comment:8 by Aymeric Augustin, 11 years ago

With the patch I just attached, the tests suite result is: FAILED (failures=5, errors=85, skipped=181, expected failures=5) when the checkout is under a non-ASCII path (ie. mv django djángò).

Last edited 11 years ago by Aymeric Augustin (previous) (diff)

comment:9 by Aymeric Augustin, 11 years ago

I forgot to mention that when we eventually deprecate support for Python 2, it should be easy to remove most superfluous str() calls with a custom 2to3 fixer that rewrites str(<STRING LITERAL>) to <STRING LITERAL>.

comment:10 by Aymeric Augustin, 11 years ago

Currently this depends #19398.

by Aymeric Augustin, 11 years ago

Attachment: 19357-wip-2.patch added

comment:11 by Aymeric Augustin, 11 years ago

The patch I just attached requires applying the pull request for #19398 first. Clocking at over 1500 lines, it improves the test results under Python 2.6 + SQLite to FAILED (failures=2, errors=9, skipped=187, expected failures=7). Remaining failures are in the forms, templates, urlpatterns_reverse, and file_storage tests.

If we go ahead with this amount of changes in the 1.5 branch, it would be more reasonable to make a second beta :(

I didn't remove the unicode_literals import, even in modules where many strings are wrapped in str(...), because I don't have a good rule to chose which modules should have unicode_literals and which shouldn't.

by Claude Paroz, 11 years ago

Attachment: 19357-upath.diff added

Wrapping file to return unicode

comment:12 by Claude Paroz, 11 years ago

As discussed on IRC, the attached patch tries to convert all __file__ occurrences to unicode. As for me, I think this is less invasive, but testing required!

comment:13 by Aymeric Augustin, 11 years ago

Here's another random example (file /tmp/café.txt exists on disk):

Django 1.4.2

>>> from django.http import HttpRequest
>>> from django.views.static import serve
>>> serve(HttpRequest(), 'café.txt', '/tmp')
<django.http.HttpResponse object at 0x104ebf310>

master

>>> from django.http import HttpRequest
>>> from django.views.static import serve
>>> serve(HttpRequest(), 'café.txt', '/tmp')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/myk/Documents/dev/django/django/views/static.py", line 38, in serve
    path = path.lstrip('/')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

(This happens because unicode_literals is enabled => '/' is unicode => path.lstrip('/') attempts to convert path to unicode.)

We both spent hours working on this, and none of our patches fixes that regression!


From the users' point of view, it's the amount of breakage between 1.4 and 1.5 that matters, more than the amount of changes we have to do between now and 1.5. The real question is — how do we minimize the number of such regressions in modules where unicode_literals is enabled?

I don't have an answer. My instincts tell me that switching all filesystem path handling to unicode is risky — all the more since it happened by accident rather than for a good reason, and since no one anticipated the consequences — but that's nowhere near a scientific argument!

comment:14 by Matthias Dahl, 11 years ago

Cc: ua_django_bugzilla@… added

comment:15 by Claude Paroz, 11 years ago

Yes, we will probably have to add some more force_text calls in several places.

We choosed at some point to go the unicode route (and this was more or less forced by the option to be able to run Django with both Python 2 and Python 3). Now I think we should pursue this option. Frankly, I don't think one path is more risky that another, it's a real "design decision".

comment:16 by Aymeric Augustin, 11 years ago

This mailing-list discussion makes it clear that we should use Claude's technique.

Claude, if you'd like me to proof read your patch, could you create a pull request?

comment:18 by Aymeric Augustin, 11 years ago

Has patch: set
Patch needs improvement: set

I left a few comments on the pull request, but nothing fundamental.

comment:19 by Aymeric Augustin, 11 years ago

Also, I'm still getting one encoding-related error with the patch on master:

======================================================================
ERROR: test_warnings_capture (regressiontests.logging_tests.tests.WarningLoggerTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/myk/Documents/dev/djángo/django/test/utils.py", line 213, in inner
    return test_func(*args, **kwargs)
  File "/Users/myk/Documents/dev/djángo/tests/regressiontests/logging_tests/tests.py", line 157, in test_warnings_capture
    self.assertTrue('Foo Deprecated' in output.getvalue())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 27: ordinal not in range(128)

----------------------------------------------------------------------

comment:20 by Claude Paroz <claude@…>, 11 years ago

Resolution: fixed
Status: newclosed

In c91667338a4e774e2819ccf4da852dc7b759bc19:

Fixed #19357 -- Allow non-ASCII chars in filesystem paths

Thanks kujiu for the report and Aymeric Augustin for the review.

comment:21 by Claude Paroz <claude@…>, 11 years ago

In 4214a22e060f9d692b77294659023633c63e1647:

[1.5.x] Fixed #19357 -- Allow non-ASCII chars in filesystem paths

Thanks kujiu for the report and Aymeric Augustin for the review.
Backport of c91667338 from master.

Note: See TracTickets for help on using tickets.
Back to Top