Opened 12 years ago
Closed 12 years ago
#19398 closed Bug (wontfix)
django.utils._os.safe_join should return a native string
Reported by: | Aymeric Augustin | Owned by: | nobody |
---|---|---|---|
Component: | Python 2 | Version: | 1.5-beta-1 |
Severity: | Release blocker | Keywords: | |
Cc: | Triage Stage: | Unreviewed | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
By default, filesystem paths are represented with native strings (ie. str
objects) in Python 2 and Python 3.
% python2 >>> import os >>> type(os.listdir('.')[0]) <type 'str'>
% python3 >>> import os >>> type(os.listdir('.')[0]) <class 'str'>
In other words, they were switch from bytestrings to unicode in Python 3.
A brief interlude for perfectionists and pedants :)
In Python 2, it's possible to use unicode for filesystem paths, when os.path.supports_unicode_filenames = True
, but that's not the default mode of operation.
In Python 3, it's possible to use bytestrings for filesystem paths, because not all supported platforms sport unicode-aware filesystems: see http://docs.python.org/3/library/os.path:
The path parameters can be passed as either strings, or bytes. Applications are encouraged to represent file names as (Unicode) character strings.
My initial statement still reflects the intent of Python's developers, from which Django shouldn't deviate.
The conversion to unicode was introduced 4 years ago in 8fb1459b5294fb9327b241fffec8576c5aa3fc7e. This commit was fixing an issue with the reporting of template loading errors.
In hindsight, it would have been better to keep safe_join
similar to os.path.join
, and preprocess the arguments or introduce a safe_joinu
method.
Excluding tests, safe_join
is used in four places in Django. Auditing these for proper use of bytestrings vs. unicode strings seems doable.
safe_join
isn't documented and the name _os
is a strong hint that it's a private API.
Therefore, I propose:
- to remove the coercion to unicode — which is incorrect anyway, because it doesn't honor
sys.getfilesystemencoding()
, and thus fails on non-utf-8 filesystems; - to perform the coercion in callers that need it, or remove it altogether if possible.
This bug is preventing me from fixing #19357, which is a release blocker, because the static files finders are using safe_join
. It is also the root cause for #17686.
Change History (4)
comment:2 by , 12 years ago
I've created a pull request that:
1) Introduces path_as_str
and path_as_text
functions:
- They do nothing under Python 3; they're only needed under Python 2; the following only applies to Python 2.
- Their primary goal is to make conversions to and from filesystem paths explicit. For instance, if
TEMPLATE_DIRS
contains an unicode string, it must be converted to the filesystem's charset before being used for filesystem-related operations. (If we were strong typing fanatics, we'd have a FileSystemPath type, distinct from the regular string types.) - Currently, they're used at all safe_join and abspathu call sites, to guarantee backwards-compatibility. But the goal is to remove as many uses as possible in the near future (ie. #19357).
2) Drops _os.abspathu(...)
in favor of the equivalent path_as_text(os.path.abspath(path_as_str(...))
.
3) Modifies the implementation _os.safe_join(..., ...)
to be consistent with os.path.join
typewise, and replaces its current uses by the new equivalent path_as_text(_os.safe_join(path_as_str(...), path_as_str(...))
4) Tests that safe_join
doesn't coerce to unicode any longer.
comment:3 by , 12 years ago
Pull request updated to accept bytestrings as paths under Python 3; path_as_str isn't a no-op any longer.
Tests pass with Python 2.6 and 3.2.
comment:4 by , 12 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
The design decision for #19357 makes this unnecessary.
Digging deeper, it appears that Django tries to manipulate filesystem paths as unicode in Python 2, which doesn't work so well.
This exception is all the more entertaining since the docstring of
abspathu
says "... thus avoiding a UnicodeDecodeError in join when the cwd has non-ASCII characters ..." ;)Also:
OK, I'm being mean, I understand that the intended use case is:
Still, supporting unicode filesystems path on Python 2 is a worthy goal, but not at the expense of breaking the default.
In order to avoid UnicodeErrors in all cases, I think we should convert values to
str
explicitly, usingsys.getfilesystemencoding()
as charset, where necessary. Having this conversion happen as a side effect insafe_join
isn't explicit or predictable; it'd better be done at the call site or with a dedicated utility function.For the record,
abspathu
is used insafe_join
and incore.files.storage
; once again there aren't many places to fix.