Opened 13 years ago
Closed 12 years ago
#16315 closed Bug (wontfix)
FileSystemStorage.listdir returns names with unicode normalization form that is different from names in database
Reported by: | philomat | Owned by: | nobody |
---|---|---|---|
Component: | File uploads/storage | Version: | 1.3 |
Severity: | Normal | Keywords: | storage unicode normalization |
Cc: | Triage Stage: | Accepted | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
When you want to write a function that finds files on disk that are not stored in the database anymore, and use FileSystemStorage.listdir to compare what's returned with what's in the database: You will not be able to compare strings without normalizing them first since unicode characters can be encoded using different normalization forms.
This problem is best demonstrated with some example code:
# Assuming that my storage root contains one folder named u'ä' >>> import os >>> from django.core.files.storage import FileSystemStorage >>> import unicodedata >>> # listdir returns u'a' followed by 'COMBINING DIAERESIS' (U+0308) >>> FileSystemStorage().listdir('')[0][0] u'a\u0308' # in the database, this character is stored using a different normalization form: >>> os.path.basename(FileSystemStorage().path(u'ä')) u'\xe4' # the values should be normalized: >>> unicodedata.normalize('NFC', FileSystemStorage().listdir('')[0][0]) u'\xe4'
Change History (9)
comment:1 by , 13 years ago
comment:2 by , 13 years ago
Resolution: | → needsinfo |
---|---|
Status: | new → closed |
At this point, we don't have enough information to assess if this is a bug in Django. Please reopen the ticket if you can provide the answers to my questions above.
follow-up: 4 comment:3 by , 13 years ago
Resolution: | needsinfo |
---|---|
Status: | closed → reopened |
- writing a file called u'\xe4', then listdir(), and see if it has turned into u'a\u0308'
This indeed turns into u'a\u0308'. The file system is Mac OS Extended (Journaled).
- saving the string u'a\u0308' in the database (in any CharField), then retreive it, and see if it has turned into u'\xe4'
This fails, MySQL gives me an OperationalError: (1267, "Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='")
comment:4 by , 13 years ago
Resolution: | → needsinfo |
---|---|
Status: | reopened → closed |
Replying to anonymous:
This fails, MySQL gives me an OperationalError: (1267, "Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='")
I'm not a MySQL expert, but this looks like a configuration error. As the original reporter didn't provided more info, pushing this back to "needsinfo".
comment:5 by , 12 years ago
Resolution: | needsinfo |
---|---|
Status: | closed → reopened |
Triage Stage: | Unreviewed → Accepted |
I'm getting some test failures for the django staticfiles tests on my Ubuntu 10.04.4 LTS box due to unicode issues similar to those as reported in this ticket.
The root of the problem is that my box's filesystem apparently uses combining diacritical marks for encoding certain characters when creating filenames. In the particular case of these failing tests, my box encodes the character u'ş' (u'\u015f') from the filename fişier.txt as: u's\u0327', that is with a 's' followed by a combining cedilla.
Here's one way of illustrating the problem:
>>> u'ş' u'\u015f' >>> print u'\u015f' ş >>> print u's\u0327' # Combining cedilla ş
It seems like the right approach would be to make Django normalize filenames:
>>> import unicodedata >>> unicodedata.normalize("NFC", u"s\u0327") u'\u015f'
comment:6 by , 12 years ago
On a second thought, rather than systematically normalizing, Django should preserve the original encoding.
comment:7 by , 12 years ago
Indeed, in my experience, preserving the encoding is easier.
The reason is that if you normalize the filename, you cannot look up the file on disk by name any longer, you have to try all normalization forms.
comment:8 by , 12 years ago
For reference, this issue is discussed here: http://nedbatchelder.com/blog/201106/filenames_with_accents.html
comment:9 by , 12 years ago
Resolution: | → wontfix |
---|---|
Status: | reopened → closed |
If I'm reading this ticket correctly — our decision is *not* to perform any normalization.
That's what Django does currently: it relies on the fact that normalization is preserved both in the database and in the filesystem (which wasn't true for the reporter; maybe the files were moved from one filesystem to another one with different normalization; maybe his database wasn't set up correctly; etc.)
This may not always hold true, but I'm convinced that it isn't a problem that can (or should) be fixed at the framework level.
If I understand correctly, the bug is the fact that the file name is normalized in NFC form in the database and in NFD form on the filesystem.
Django doesn't do any unicode normalisation — well, it does in two places, but they're obviously unrelated to the situation you describe.
Maybe the normalization in NFC form appears when the string round-trips in the database. Or maybe the normalization in NFD form appears when the file is written on the file system. In both cases, that's outside the control of Django, but I'd like to understand what happens.
Can you test:
u'\xe4'
, thenlistdir()
, and see if it has turned intou'a\u0308'
?u'a\u0308'
in the database (in anyCharField
), then retreive it, and see if it has turned intou'\xe4'
Also, which database and which filesystem are you using?