Opened 13 years ago
Closed 13 years ago
#19398 closed Bug (wontfix)
django.utils._os.safe_join should return a native string
| Reported by: | Aymeric Augustin | Owned by: | nobody |
|---|---|---|---|
| Component: | Python 2 | Version: | 1.5-beta-1 |
| Severity: | Release blocker | Keywords: | |
| Cc: | Triage Stage: | Unreviewed | |
| Has patch: | no | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description
By default, filesystem paths are represented with native strings (ie. str objects) in Python 2 and Python 3.
% python2
>>> import os
>>> type(os.listdir('.')[0])
<type 'str'>
% python3
>>> import os
>>> type(os.listdir('.')[0])
<class 'str'>
In other words, they were switch from bytestrings to unicode in Python 3.
A brief interlude for perfectionists and pedants :)
In Python 2, it's possible to use unicode for filesystem paths, when os.path.supports_unicode_filenames = True, but that's not the default mode of operation.
In Python 3, it's possible to use bytestrings for filesystem paths, because not all supported platforms sport unicode-aware filesystems: see http://docs.python.org/3/library/os.path:
The path parameters can be passed as either strings, or bytes. Applications are encouraged to represent file names as (Unicode) character strings.
My initial statement still reflects the intent of Python's developers, from which Django shouldn't deviate.
The conversion to unicode was introduced 4 years ago in 8fb1459b5294fb9327b241fffec8576c5aa3fc7e. This commit was fixing an issue with the reporting of template loading errors.
In hindsight, it would have been better to keep safe_join similar to os.path.join, and preprocess the arguments or introduce a safe_joinu method.
Excluding tests, safe_join is used in four places in Django. Auditing these for proper use of bytestrings vs. unicode strings seems doable.
safe_join isn't documented and the name _os is a strong hint that it's a private API.
Therefore, I propose:
- to remove the coercion to unicode — which is incorrect anyway, because it doesn't honor
sys.getfilesystemencoding(), and thus fails on non-utf-8 filesystems; - to perform the coercion in callers that need it, or remove it altogether if possible.
This bug is preventing me from fixing #19357, which is a release blocker, because the static files finders are using safe_join. It is also the root cause for #17686.
Change History (4)
comment:2 by , 13 years ago
I've created a pull request that:
1) Introduces path_as_str and path_as_text functions:
- They do nothing under Python 3; they're only needed under Python 2; the following only applies to Python 2.
- Their primary goal is to make conversions to and from filesystem paths explicit. For instance, if
TEMPLATE_DIRScontains an unicode string, it must be converted to the filesystem's charset before being used for filesystem-related operations. (If we were strong typing fanatics, we'd have a FileSystemPath type, distinct from the regular string types.) - Currently, they're used at all safe_join and abspathu call sites, to guarantee backwards-compatibility. But the goal is to remove as many uses as possible in the near future (ie. #19357).
2) Drops _os.abspathu(...) in favor of the equivalent path_as_text(os.path.abspath(path_as_str(...)).
3) Modifies the implementation _os.safe_join(..., ...) to be consistent with os.path.join typewise, and replaces its current uses by the new equivalent path_as_text(_os.safe_join(path_as_str(...), path_as_str(...))
4) Tests that safe_join doesn't coerce to unicode any longer.
comment:3 by , 13 years ago
Pull request updated to accept bytestrings as paths under Python 3; path_as_str isn't a no-op any longer.
Tests pass with Python 2.6 and 3.2.
comment:4 by , 13 years ago
| Resolution: | → wontfix |
|---|---|
| Status: | new → closed |
The design decision for #19357 makes this unnecessary.
Digging deeper, it appears that Django tries to manipulate filesystem paths as unicode in Python 2, which doesn't work so well.
>>> from django.utils import _os >>> _os.abspathu('café') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "django/utils/_os.py", line 28, in abspathu path = join(os.getcwdu(), path) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/posixpath.py", line 71, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128) >>> import os >>> os.path.abspath('café') '/Users/aaugustin/Documents/dev/dj\xc3\xa1ng\xc3\xb2/caf\xc3\xa9'This exception is all the more entertaining since the docstring of
abspathusays "... thus avoiding a UnicodeDecodeError in join when the cwd has non-ASCII characters ..." ;)Also:
>>> _os.safe_join(_os.abspathu('.'), 'café') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "django/utils/_os.py", line 39, in safe_join final_path = abspathu(join(base, *paths)) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/posixpath.py", line 71, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128) >>> os.path.join(os.path.abspath('.'), 'café') '/Users/aaugustin/Documents/dev/dj\xc3\xa1ng\xc3\xb2/caf\xc3\xa9'OK, I'm being mean, I understand that the intended use case is:
>>> _os.safe_join(_os.abspathu('.'), u'café') u'/Users/aaugustin/Documents/dev/dj\xe1ng\xf2/caf\xe9'Still, supporting unicode filesystems path on Python 2 is a worthy goal, but not at the expense of breaking the default.
In order to avoid UnicodeErrors in all cases, I think we should convert values to
strexplicitly, usingsys.getfilesystemencoding()as charset, where necessary. Having this conversion happen as a side effect insafe_joinisn't explicit or predictable; it'd better be done at the call site or with a dedicated utility function.For the record,
abspathuis used insafe_joinand incore.files.storage; once again there aren't many places to fix.The next step is to study dfa90aec1bed28f581b0f0471dc95860bb166cc9. Its goal was to allow template to load directories under non-ASCII paths, see #9579. That commit didn't include a test, so we should exercise caution here. The best test is to run Django's tests from a checkout under a non-ASCII path.