Code

Opened 17 months ago

Closed 17 months ago

Last modified 17 months ago

#19357 closed Bug (fixed)

Unicode errors when Django is installed under a non-ASCII path

Reported by: kujiu Owned by: aaugustin
Component: Python 2 Version: master
Severity: Release blocker Keywords:
Cc: ua_django_bugzilla@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

With Python 2.7, when using the unicode_literals import, all strings in the current files are interpreted as unicode. But, all strings coming from an other file that doesn't contain this import are str. An UnicodeDecodeError occurs when concatenating directly strings from a file with unicode_literals and an other without. But all works fine with Python 3.x.

If we start django with django installed on a path containing unicode character (like asian character), a UnicodeDecodeError occurs on file django/utils/translation/trans_real.py (line 112). If an app have a path with unicode character, the error occurs on the same file at line 155. Each time, the traceback stops in posixpath.py (in python lib), with the command "path += '/' + b". Django doesn't start, tests don't start, shell doesn't start.

Some files in tests have the same issue like :

  • django/contrib/formtools/tests/init.py
  • django/utils/translation/trans_real.py tests/regressiontests/fixtures_regress/tests.py
  • tests/regressiontests/i18n/contenttypes/tests.py tests/regressiontests/i18n/patterns/tests.py
  • tests/regressiontests/i18n/tests.py tests/regressiontests/servers/tests.py
  • tests/regressiontests/staticfiles_tests/tests.py tests/regressiontests/templates/response.py
  • tests/regressiontests/test_client_regress/tests.py

Example of traceback (with "runtests.py --settings=test_sqlite"):

Traceback (most recent call last):
  File "./runtests.py", line 327, in <module>
    options.failfast, args)
  File "./runtests.py", line 170, in django_tests
    failures = test_runner.run_tests(test_labels, extra_tests=extra_tests)
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/test/simple.py", line 366, in run_tests
    suite = self.build_suite(test_labels, extra_tests)
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/test/simple.py", line 260, in build_suite
    suite.addTest(build_suite(app))
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/test/simple.py", line 68, in build_suite
    test_module = get_tests(app_module)
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/test/simple.py", line 25, in get_tests
    test_module = import_module('.'.join(prefix + [TEST_MODULE]))
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/utils/importlib.py", line 35, in import_module
    __import__(name)
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/contrib/flatpages/tests/__init__.py", line 2, in <mo
dule>
    from django.contrib.flatpages.tests.forms import *
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/contrib/flatpages/tests/forms.py", line 4, in <modul
e>
    from django.contrib.flatpages.forms import FlatpageForm
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/contrib/flatpages/forms.py", line 6, in <module>
    class FlatpageForm(forms.ModelForm):
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/contrib/flatpages/forms.py", line 10, in FlatpageForm
    error_message = _("This value must contain only letters, numbers,"
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/forms/fields.py", line 441, in __init__
    if error_message:
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/utils/functional.py", line 117, in __wrapper__
    res = func(*self.__args, **self.__kw)
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/utils/translation/__init__.py", line 71, in ugettext
    return _trans.ugettext(message)
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/utils/translation/trans_real.py", line 275, in ugett
ext
    return do_translate(message, 'ugettext')
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/utils/translation/trans_real.py", line 257, in do_tr
anslate
    _default = translation(settings.LANGUAGE_CODE)
  File "/home/kujiu/仕事/labs/projects/django-trunk/django/utils/translation/trans_real.py", line 112, in trans
lation
    globalpath = os.path.join(os.path.dirname(sys.modules[settings.__module__].__file__), 'locale')
  File "/usr/lib/python2.7/posixpath.py", line 80, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 12: ordinal not in range(128)

Attachments (3)

19357-wip.patch (62.0 KB) - added by aaugustin 17 months ago.
19357-wip-2.patch (79.5 KB) - added by aaugustin 17 months ago.
19357-upath.diff (76.4 KB) - added by claudep 17 months ago.
Wrapping file to return unicode

Download all attachments as: .zip

Change History (24)

comment:1 Changed 17 months ago by julien

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Severity changed from Normal to Release blocker
  • Triage Stage changed from Unreviewed to Accepted

This seems like a potential important regression. The OP verified that it worked fine in 1.4 but this UnicodeDecodeError occurs in master.

comment:2 Changed 17 months ago by aaugustin

  • Owner changed from nobody to aaugustin

comment:3 Changed 17 months ago by aaugustin

Indeed, it appears that this regression was introduced by the unicode literals patch.

The fix may be as simple as using str('locale') on line 112 of django/utils/translation/trans_real.py.

We should audit for other instances of this problem.

comment:4 Changed 17 months ago by aaugustin

  • Summary changed from Unicode errors with mixing files with unicode_literals and others to Unicode errors when Django is installed under a non-ASCII path

comment:5 Changed 17 months ago by aaugustin

In fact, when the unicode_literals path was applied, filesystem paths operations weren't changed accordingly. Python uses native strings for filesystem paths.

In files that have unicode_literals turned on, every string involved in a path operation must be wrapped with str() or force_str(). I'm currently at 1000 lines of diff, and counting...

Last edited 17 months ago by aaugustin (previous) (diff)

comment:6 Changed 17 months ago by kmtracey

Wait, what? It ought to be OK to pass unicode on Python 2.X to Python os path functions. Yes, internally Python is going to have to turn unicode into a bytestring for passing into the system functions...but it should be able to do that, we should not have to do it before calling the os path functions. We've gone back and forth on this before, see for example #11030. Django code CANNOT be responsible for figuring out what the right encoding is for file system paths -- utf8 is going to be wrong on Asian windows systems, for example. Underlying Python should be configured correctly to figure that out. Or am I misunderstanding what you are having to do here....?

comment:7 Changed 17 months ago by aaugustin

As discussed on IRC, I'm just wrapping ASCII string literals in str(...) to obtain native strings where necessary, under Python 2 and 3, in modules that have unicode_literals turned on. I'm not guessing encodings.

Changed 17 months ago by aaugustin

comment:8 Changed 17 months ago by aaugustin

With the patch I just attached, the tests suite result is: FAILED (failures=5, errors=85, skipped=181, expected failures=5) when the checkout is under a non-ASCII path (ie. mv django djángò).

Version 0, edited 17 months ago by aaugustin (next)

comment:9 Changed 17 months ago by aaugustin

I forgot to mention that when we eventually deprecate support for Python 2, it should be easy to remove most superfluous str() calls with a custom 2to3 fixer that rewrites str(<STRING LITERAL>) to <STRING LITERAL>.

comment:10 Changed 17 months ago by aaugustin

Currently this depends #19398.

Changed 17 months ago by aaugustin

comment:11 Changed 17 months ago by aaugustin

The patch I just attached requires applying the pull request for #19398 first. Clocking at over 1500 lines, it improves the test results under Python 2.6 + SQLite to FAILED (failures=2, errors=9, skipped=187, expected failures=7). Remaining failures are in the forms, templates, urlpatterns_reverse, and file_storage tests.

If we go ahead with this amount of changes in the 1.5 branch, it would be more reasonable to make a second beta :(

I didn't remove the unicode_literals import, even in modules where many strings are wrapped in str(...), because I don't have a good rule to chose which modules should have unicode_literals and which shouldn't.

Changed 17 months ago by claudep

Wrapping file to return unicode

comment:12 Changed 17 months ago by claudep

As discussed on IRC, the attached patch tries to convert all __file__ occurrences to unicode. As for me, I think this is less invasive, but testing required!

comment:13 Changed 17 months ago by aaugustin

Here's another random example (file /tmp/café.txt exists on disk):

Django 1.4.2

>>> from django.http import HttpRequest
>>> from django.views.static import serve
>>> serve(HttpRequest(), 'café.txt', '/tmp')
<django.http.HttpResponse object at 0x104ebf310>

master

>>> from django.http import HttpRequest
>>> from django.views.static import serve
>>> serve(HttpRequest(), 'café.txt', '/tmp')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/myk/Documents/dev/django/django/views/static.py", line 38, in serve
    path = path.lstrip('/')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

(This happens because unicode_literals is enabled => '/' is unicode => path.lstrip('/') attempts to convert path to unicode.)

We both spent hours working on this, and none of our patches fixes that regression!


From the users' point of view, it's the amount of breakage between 1.4 and 1.5 that matters, more than the amount of changes we have to do between now and 1.5. The real question is — how do we minimize the number of such regressions in modules where unicode_literals is enabled?

I don't have an answer. My instincts tell me that switching all filesystem path handling to unicode is risky — all the more since it happened by accident rather than for a good reason, and since no one anticipated the consequences — but that's nowhere near a scientific argument!

comment:14 Changed 17 months ago by BinaryKhaos

  • Cc ua_django_bugzilla@… added

comment:15 Changed 17 months ago by claudep

Yes, we will probably have to add some more force_text calls in several places.

We choosed at some point to go the unicode route (and this was more or less forced by the option to be able to run Django with both Python 2 and Python 3). Now I think we should pursue this option. Frankly, I don't think one path is more risky that another, it's a real "design decision".

comment:16 Changed 17 months ago by aaugustin

This mailing-list discussion makes it clear that we should use Claude's technique.

Claude, if you'd like me to proof read your patch, could you create a pull request?

comment:18 Changed 17 months ago by aaugustin

  • Has patch set
  • Patch needs improvement set

I left a few comments on the pull request, but nothing fundamental.

comment:19 Changed 17 months ago by aaugustin

Also, I'm still getting one encoding-related error with the patch on master:

======================================================================
ERROR: test_warnings_capture (regressiontests.logging_tests.tests.WarningLoggerTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/myk/Documents/dev/djángo/django/test/utils.py", line 213, in inner
    return test_func(*args, **kwargs)
  File "/Users/myk/Documents/dev/djángo/tests/regressiontests/logging_tests/tests.py", line 157, in test_warnings_capture
    self.assertTrue('Foo Deprecated' in output.getvalue())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 27: ordinal not in range(128)

----------------------------------------------------------------------

comment:20 Changed 17 months ago by Claude Paroz <claude@…>

  • Resolution set to fixed
  • Status changed from new to closed

In c91667338a4e774e2819ccf4da852dc7b759bc19:

Fixed #19357 -- Allow non-ASCII chars in filesystem paths

Thanks kujiu for the report and Aymeric Augustin for the review.

comment:21 Changed 17 months ago by Claude Paroz <claude@…>

In 4214a22e060f9d692b77294659023633c63e1647:

[1.5.x] Fixed #19357 -- Allow non-ASCII chars in filesystem paths

Thanks kujiu for the report and Aymeric Augustin for the review.
Backport of c91667338 from master.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.