id	summary	reporter	owner	description	type	status	component	version	severity	resolution	keywords	cc	stage	has_patch	needs_docs	needs_tests	needs_better_patch	easy	ui_ux
31225	Use NFD normalization in get_valid_filename().	Guillaume Thomas	nobody	"Django uses the function [https://github.com/django/django/blob/6b178a3e930f72069f3cda2e6a09d1b320fc09ec/django/utils/text.py#L221-L232 `get_valid_filename`] to get a 'clean filename' from any input string.

Theoretically, this function only keeps unicode characters (underscore included), dashes (`-`) and points (`.`). It relies on the standard `re` package to match unicode characters.

There are several forms of unicode normalization (https://docs.python.org/3.8/library/unicodedata.html#unicodedata.normalize) and after having done some tests, it appears that `re` only handle the NFC normalization.

For instance:
{{{
import re
import unicodedata

re.match(""^\w$"", unicodedata.normalize(""NFC"", ""é""), re.UNICODE)
# <_sre.SRE_Match object; span=(0, 1), match='é'>

re.match(""^\w$"", unicodedata.normalize(""NFD"", ""é""), re.UNICODE)
# None
}}}

This makes `get_valid_filename` behave differently according to the unicode normalization of the input string. Thus:
{{{
import unicodedata
from django.utils.text import get_valid_filename

get_valid_filename(unicodedata.normalize(""NFC"", ""é""))
# é

get_valid_filename(unicodedata.normalize(""NFD"", ""é""))                                                                                                                                                    
# e
}}}

It appears that this normalization depends on the operating system. On MacOS, it uses a [https://ss64.com/osx/syntax-filenames.html nearly NFD]. On Unix, it's NFC. In the end, for files coming from MacOS systems, filenames are ""slugified"" which is not the case for other operating systems. My feeling at this stage is that this complexity could be abstracted for the developer and have a ""normalization independant"" handling of strings for this function.

I also think we could go further and force filenames to only contain ascii characters. This curiosity was found after we had an issue with our setup which consists of a django app behing a nginx. To retrieve private media files, django returns an empty http response and provide the internal filename with the [https://www.nginx.com/resources/wiki/start/topics/examples/x-accel/#x-accel-redirect `X-Accel-Redirect` header]. The problem was that nginx does not seem to like non ascii characters here.

In the end, i think a lot of bug could be avoided by forcing a NFD normalization the in `get_valid_filename` function. It'd be roughly the same behaviour as `slugify` with `allow_unicode=False`

What do you think?

"	Cleanup/optimization	closed	Utilities	3.0	Normal	wontfix	text		Unreviewed	0	0	0	0	0	0