django.utils._os.safe_join should return a native string
|Reported by:||aaugustin||Owned by:||nobody|
|Has patch:||no||Needs documentation:||no|
|Needs tests:||no||Patch needs improvement:||no|
By default, filesystem paths are represented with native strings (ie. str objects) in Python 2 and Python 3.
% python2 >>> import os >>> type(os.listdir('.')) <type 'str'>
% python3 >>> import os >>> type(os.listdir('.')) <class 'str'>
In other words, they were switch from bytestrings to unicode in Python 3.
A brief interlude for perfectionists and pedants :)
In Python 2, it's possible to use unicode for filesystem paths, when os.path.supports_unicode_filenames = True, but that's not the default mode of operation.
In Python 3, it's possible to use bytestrings for filesystem paths, because not all supported platforms sport unicode-aware filesystems: see http://docs.python.org/3/library/os.path:
The path parameters can be passed as either strings, or bytes. Applications are encouraged to represent file names as (Unicode) character strings.
My initial statement still reflects the intent of Python's developers, from which Django shouldn't deviate.
The conversion to unicode was introduced 4 years ago in 8fb1459b5294fb9327b241fffec8576c5aa3fc7e. This commit was fixing an issue with the reporting of template loading errors.
In hindsight, it would have been better to keep safe_join similar to os.path.join, and preprocess the arguments or introduce a safe_joinu method.
Excluding tests, safe_join is used in four places in Django. Auditing these for proper use of bytestrings vs. unicode strings seems doable.
safe_join isn't documented and the name _os is a strong hint that it's a private API.
Therefore, I propose:
- to remove the coercion to unicode — which is incorrect anyway, because it doesn't honor sys.getfilesystemencoding(), and thus fails on non-utf-8 filesystems;
- to perform the coercion in callers that need it, or remove it altogether if possible.