Code

Opened 7 years ago

Closed 6 years ago

Last modified 3 years ago

#3119 closed defect (duplicate)

Problem for the up-loading of non-ASCII character file name.

Reported by: makoto tsuyuki <mtsuyuki@…> Owned by: leahculver
Component: Database layer (models, ORM) Version: master
Severity: normal Keywords: fs-rf-docs
Cc: Triage Stage: Design decision needed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description

When the file name composed only of non-ASCII character is passed to FileField and ImageField, the file name is mostly lost.

And the character number limitation of the column is exceeded by 100 times or less.

Ex.
ééééé.txt -> .txt
ààààà.txt -> _.txt
àéàéé.txt -> __.txt

Attachments (2)

store_filename_as_punycode_for_trunk_4462.diff (4.0 KB) - added by makoto tsuyuki <mtsuyuki@…> 7 years ago.
filename_nomalizer_fix.diff (2.4 KB) - added by ymasuda <ymasuda@…> 7 years ago.
adds filename_normalizer to django.db.models.fields.FileField. also includes short description on model-api.txt.

Download all attachments as: .zip

Change History (16)

comment:1 follow-up: Changed 7 years ago by makoto tsuyuki <mtsuyuki@…>

  • Has patch set

Multibyte characters in a filename are lost in get_valid_filaname().

As in django.db.models.fields, FileField and its subtype calls django.utils.text.get_valid_filename() to remove all "filename-unsafe" characters from given filename.

The resulting filename consists of alphabets, numbers, hyphens and underscores.

However, the behaviour raises undesirable effect for those country using multibyte filenames.

For example, if original filename consists all of multibyte characters and '.txt' extension (such as 'ファイル.txt'), the resulting filename becomes '.txt' (no filename body but only extension).

Underscore-suffix uniquification easily collapses

Things get worse if we have a lot of such files: since FileField suffixes underscores after filename until the filename become unique, if we have files of ['壱号文書.doc', '弐号文書.doc', '参号文書.doc', ...],
then filename records will become ['.doc', '_.doc', '.doc', ...].

When the number of underscores reaches to maxlength of filename field (100 or so), then FileField will begin to raise errors because length of the filename exceeds limit.

Proposed solution: punicode conversion before call django.util.text.get_valid_filename.

Add STORE_FILENAME_AS_PUNYCODE to global_settings as False by default.

Encodes the given string in punycode except the extension if STORE_FILENAME_AS_PUNYCODE is True.

Then generate a clean file name in get_valid_filename and return it.

Changed 7 years ago by makoto tsuyuki <mtsuyuki@…>

comment:2 Changed 7 years ago by Michael Radziej <mir@…>

Why punycode? I'd think that most filesystems these days support UTF-8 (though, with different normalization, which *is* a problem).

  • Wouldn't it be better to support any arbitrary settings.FILE_SYSTEM_ENCODING?
  • What encoding does python use if you pass unicode to open()?

comment:3 in reply to: ↑ 1 Changed 7 years ago by ymasuda <ymasuda@…>

As I mentioned in django-dev: http://groups.google.com/group/django-developers/browse_thread/thread/d9d590962817fd78 ,
It would be better just providing a hook to customize filename normalizer rather than persueing the only-one-flawless normalization
scheme. Here I post a patch to add filename_normalizer option to FileField's constructor so that developers can specify their own
normalization func. The patch includes code changes in django/db/models/fields/init.py and docs/model-api.txt.

Changed 7 years ago by ymasuda <ymasuda@…>

adds filename_normalizer to django.db.models.fields.FileField. also includes short description on model-api.txt.

comment:4 Changed 7 years ago by MichaelRadziej <mir@…>

  • Triage Stage changed from Unreviewed to Design decision needed

comment:5 Changed 7 years ago by ubernostrum

#1355 was a duplicate.

comment:6 Changed 7 years ago by Thomas Güttler <hv@…>

  • Cc hv@… added

comment:7 Changed 6 years ago by Thomas Guettler (Home)

  • Cc hv@… removed

#5361 (Pluggable backends for FileField) could solve this problem.
Until the pluggable backends are available, I use my own version which uses django.utils.http.urlquote()
for the filenames.

comment:8 Changed 6 years ago by jacob

  • Keywords fs-r added

comment:9 Changed 6 years ago by Gulopine

  • Keywords fs-rf added; fs-r removed

comment:10 Changed 6 years ago by Gulopine

  • Keywords fs-rf-docs added; fs-rf removed

comment:11 Changed 6 years ago by Gulopine

  • milestone set to 1.0 beta

comment:12 Changed 6 years ago by leahculver

  • Owner changed from nobody to leahculver
  • Status changed from new to assigned

comment:13 Changed 6 years ago by leahculver

  • Resolution set to duplicate
  • Status changed from assigned to closed

dupe of #6009

comment:14 Changed 3 years ago by jacob

  • milestone 1.0 beta deleted

Milestone 1.0 beta deleted

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.