Opened 4 years ago

Closed 4 years ago

#31225 closed Cleanup/optimization (wontfix)

Use NFD normalization in get_valid_filename().

Reported by: Guillaume Thomas Owned by: nobody
Component: Utilities Version: 3.0
Severity: Normal Keywords: text
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Django uses the function `get_valid_filename` to get a 'clean filename' from any input string.

Theoretically, this function only keeps unicode characters (underscore included), dashes (-) and points (.). It relies on the standard re package to match unicode characters.

There are several forms of unicode normalization (https://docs.python.org/3.8/library/unicodedata.html#unicodedata.normalize) and after having done some tests, it appears that re only handle the NFC normalization.

For instance:

import re
import unicodedata

re.match("^\w$", unicodedata.normalize("NFC", "é"), re.UNICODE)
# <_sre.SRE_Match object; span=(0, 1), match='é'>

re.match("^\w$", unicodedata.normalize("NFD", "é"), re.UNICODE)
# None

This makes get_valid_filename behave differently according to the unicode normalization of the input string. Thus:

import unicodedata
from django.utils.text import get_valid_filename

get_valid_filename(unicodedata.normalize("NFC", "é"))
# é

get_valid_filename(unicodedata.normalize("NFD", "é"))                                                                                                                                                    
# e

It appears that this normalization depends on the operating system. On MacOS, it uses a nearly NFD. On Unix, it's NFC. In the end, for files coming from MacOS systems, filenames are "slugified" which is not the case for other operating systems. My feeling at this stage is that this complexity could be abstracted for the developer and have a "normalization independant" handling of strings for this function.

I also think we could go further and force filenames to only contain ascii characters. This curiosity was found after we had an issue with our setup which consists of a django app behing a nginx. To retrieve private media files, django returns an empty http response and provide the internal filename with the `X-Accel-Redirect` header. The problem was that nginx does not seem to like non ascii characters here.

In the end, i think a lot of bug could be avoided by forcing a NFD normalization the in get_valid_filename function. It'd be roughly the same behaviour as slugify with allow_unicode=False

What do you think?

Change History (2)

comment:1 by Guillaume Thomas, 4 years ago

Summary: Use slugify in get_valid_filenameUse NFD normalization in get_valid_filename

comment:2 by Mariusz Felisiak, 4 years ago

Resolution: wontfix
Status: newclosed
Summary: Use NFD normalization in get_valid_filenameUse NFD normalization in get_valid_filename().
Type: UncategorizedCleanup/optimization

Thanks for this ticket, however I don't think that Django should normalize filenames, the current behavior is tested and documented (see #16315 with a discussion and arguments against a similar change in FileSystemStorage). You can start a discussion on DevelopersMailingList if you don't agree.

Note: See TracTickets for help on using tickets.
Back to Top