Opened 18 years ago

Closed 16 years ago

Last modified 13 years ago

#3119 closed defect (duplicate)

Problem for the up-loading of non-ASCII character file name.

Reported by: makoto tsuyuki <mtsuyuki@…> Owned by: Leah Culver
Component: Database layer (models, ORM) Version: dev
Severity: normal Keywords: fs-rf-docs
Cc: Triage Stage: Design decision needed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

When the file name composed only of non-ASCII character is passed to FileField and ImageField, the file name is mostly lost.

And the character number limitation of the column is exceeded by 100 times or less.

Ex.
ééééé.txt -> .txt
ààààà.txt -> _.txt
àéàéé.txt -> __.txt

Attachments (2)

store_filename_as_punycode_for_trunk_4462.diff (4.0 KB ) - added by makoto tsuyuki <mtsuyuki@…> 18 years ago.
filename_nomalizer_fix.diff (2.4 KB ) - added by ymasuda <ymasuda@…> 18 years ago.
adds filename_normalizer to django.db.models.fields.FileField. also includes short description on model-api.txt.

Download all attachments as: .zip

Change History (16)

comment:1 by makoto tsuyuki <mtsuyuki@…>, 18 years ago

Has patch: set

Multibyte characters in a filename are lost in get_valid_filaname().

As in django.db.models.fields, FileField and its subtype calls django.utils.text.get_valid_filename() to remove all "filename-unsafe" characters from given filename.

The resulting filename consists of alphabets, numbers, hyphens and underscores.

However, the behaviour raises undesirable effect for those country using multibyte filenames.

For example, if original filename consists all of multibyte characters and '.txt' extension (such as 'ファイル.txt'), the resulting filename becomes '.txt' (no filename body but only extension).

Underscore-suffix uniquification easily collapses

Things get worse if we have a lot of such files: since FileField suffixes underscores after filename until the filename become unique, if we have files of ['壱号文書.doc', '弐号文書.doc', '参号文書.doc', ...],
then filename records will become ['.doc', '_.doc', '.doc', ...].

When the number of underscores reaches to maxlength of filename field (100 or so), then FileField will begin to raise errors because length of the filename exceeds limit.

Proposed solution: punicode conversion before call django.util.text.get_valid_filename.

Add STORE_FILENAME_AS_PUNYCODE to global_settings as False by default.

Encodes the given string in punycode except the extension if STORE_FILENAME_AS_PUNYCODE is True.

Then generate a clean file name in get_valid_filename and return it.

by makoto tsuyuki <mtsuyuki@…>, 18 years ago

comment:2 by Michael Radziej <mir@…>, 18 years ago

Why punycode? I'd think that most filesystems these days support UTF-8 (though, with different normalization, which *is* a problem).

  • Wouldn't it be better to support any arbitrary settings.FILE_SYSTEM_ENCODING?
  • What encoding does python use if you pass unicode to open()?

in reply to:  1 comment:3 by ymasuda <ymasuda@…>, 18 years ago

As I mentioned in django-dev: http://groups.google.com/group/django-developers/browse_thread/thread/d9d590962817fd78 ,
It would be better just providing a hook to customize filename normalizer rather than persueing the only-one-flawless normalization
scheme. Here I post a patch to add filename_normalizer option to FileField's constructor so that developers can specify their own
normalization func. The patch includes code changes in django/db/models/fields/init.py and docs/model-api.txt.

by ymasuda <ymasuda@…>, 18 years ago

Attachment: filename_nomalizer_fix.diff added

adds filename_normalizer to django.db.models.fields.FileField. also includes short description on model-api.txt.

comment:4 by MichaelRadziej <mir@…>, 18 years ago

Triage Stage: UnreviewedDesign decision needed

comment:5 by James Bennett, 17 years ago

#1355 was a duplicate.

comment:6 by Thomas Güttler <hv@…>, 17 years ago

Cc: hv@… added

comment:7 by Thomas Guettler (Home), 17 years ago

Cc: hv@… removed

#5361 (Pluggable backends for FileField) could solve this problem.
Until the pluggable backends are available, I use my own version which uses django.utils.http.urlquote()
for the filenames.

comment:8 by Jacob, 17 years ago

Keywords: fs-r added

comment:9 by Marty Alchin, 17 years ago

Keywords: fs-rf added; fs-r removed

comment:10 by Marty Alchin, 17 years ago

Keywords: fs-rf-docs added; fs-rf removed

comment:11 by Marty Alchin, 17 years ago

milestone: 1.0 beta

comment:12 by Leah Culver, 16 years ago

Owner: changed from nobody to Leah Culver
Status: newassigned

comment:13 by Leah Culver, 16 years ago

Resolution: duplicate
Status: assignedclosed

dupe of #6009

comment:14 by Jacob, 13 years ago

milestone: 1.0 beta

Milestone 1.0 beta deleted

Note: See TracTickets for help on using tickets.
Back to Top