#3119 closed defect (duplicate)
Problem for the up-loading of non-ASCII character file name.
Reported by: | Owned by: | Leah Culver | |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | normal | Keywords: | fs-rf-docs |
Cc: | Triage Stage: | Design decision needed | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
When the file name composed only of non-ASCII character is passed to FileField and ImageField, the file name is mostly lost.
And the character number limitation of the column is exceeded by 100 times or less.
Ex. ééééé.txt -> .txt ààààà.txt -> _.txt àéàéé.txt -> __.txt
Attachments (2)
Change History (16)
follow-up: 3 comment:1 by , 18 years ago
Has patch: | set |
---|
by , 18 years ago
Attachment: | store_filename_as_punycode_for_trunk_4462.diff added |
---|
comment:2 by , 18 years ago
Why punycode? I'd think that most filesystems these days support UTF-8 (though, with different normalization, which *is* a problem).
- Wouldn't it be better to support any arbitrary settings.FILE_SYSTEM_ENCODING?
- What encoding does python use if you pass unicode to open()?
comment:3 by , 18 years ago
As I mentioned in django-dev: http://groups.google.com/group/django-developers/browse_thread/thread/d9d590962817fd78 ,
It would be better just providing a hook to customize filename normalizer rather than persueing the only-one-flawless normalization
scheme. Here I post a patch to add filename_normalizer option to FileField's constructor so that developers can specify their own
normalization func. The patch includes code changes in django/db/models/fields/init.py and docs/model-api.txt.
by , 18 years ago
Attachment: | filename_nomalizer_fix.diff added |
---|
adds filename_normalizer to django.db.models.fields.FileField. also includes short description on model-api.txt.
comment:4 by , 18 years ago
Triage Stage: | Unreviewed → Design decision needed |
---|
comment:6 by , 17 years ago
Cc: | added |
---|
comment:7 by , 17 years ago
Cc: | removed |
---|
#5361 (Pluggable backends for FileField) could solve this problem.
Until the pluggable backends are available, I use my own version which uses django.utils.http.urlquote()
for the filenames.
comment:8 by , 17 years ago
Keywords: | fs-r added |
---|
comment:9 by , 17 years ago
Keywords: | fs-rf added; fs-r removed |
---|
comment:10 by , 17 years ago
Keywords: | fs-rf-docs added; fs-rf removed |
---|
comment:11 by , 17 years ago
milestone: | → 1.0 beta |
---|
comment:12 by , 17 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
Multibyte characters in a filename are lost in get_valid_filaname().
As in django.db.models.fields, FileField and its subtype calls django.utils.text.get_valid_filename() to remove all "filename-unsafe" characters from given filename.
The resulting filename consists of alphabets, numbers, hyphens and underscores.
However, the behaviour raises undesirable effect for those country using multibyte filenames.
For example, if original filename consists all of multibyte characters and '.txt' extension (such as 'ファイル.txt'), the resulting filename becomes '.txt' (no filename body but only extension).
Underscore-suffix uniquification easily collapses
Things get worse if we have a lot of such files: since FileField suffixes underscores after filename until the filename become unique, if we have files of ['壱号文書.doc', '弐号文書.doc', '参号文書.doc', ...],
then filename records will become ['.doc', '_.doc', '.doc', ...].
When the number of underscores reaches to maxlength of filename field (100 or so), then FileField will begin to raise errors because length of the filename exceeds limit.
Proposed solution: punicode conversion before call django.util.text.get_valid_filename.
Add STORE_FILENAME_AS_PUNYCODE to global_settings as False by default.
Encodes the given string in punycode except the extension if STORE_FILENAME_AS_PUNYCODE is True.
Then generate a clean file name in get_valid_filename and return it.