Django

Code

Ticket #3119 (closed: duplicate)

Opened 2 years ago

Last modified 5 months ago

Problem for the up-loading of non-ASCII character file name.

Reported by: makoto tsuyuki <mtsuyuki@gmail.com> Assigned to: leahculver
Milestone: 1.0 beta Component: Database layer (models, ORM)
Version: SVN Keywords: fs-rf-docs
Cc: Triage Stage: Design decision needed
Has patch: 1 Needs documentation: 0
Needs tests: 0 Patch needs improvement: 0

Description

When the file name composed only of non-ASCII character is passed to FileField? and ImageField?, the file name is mostly lost.

And the character number limitation of the column is exceeded by 100 times or less.

Ex.
ééééé.txt -> .txt
ààààà.txt -> _.txt
àéàéé.txt -> __.txt

Attachments

store_filename_as_punycode_for_trunk_4462.diff (4.0 kB) - added by makoto tsuyuki <mtsuyuki@gmail.com> on 02/06/07 22:57:24.
filename_nomalizer_fix.diff (2.4 kB) - added by ymasuda <ymasuda@ethercube.com> on 02/10/07 23:56:58.
adds filename_normalizer to django.db.models.fields.FileField?. also includes short description on model-api.txt.

Change History

(follow-up: ↓ 3 ) 02/06/07 22:55:52 changed by makoto tsuyuki <mtsuyuki@gmail.com>

  • has_patch set to 1.

Multibyte characters in a filename are lost in get_valid_filaname().

As in django.db.models.fields, FileField? and its subtype calls django.utils.text.get_valid_filename() to remove all "filename-unsafe" characters from given filename.

The resulting filename consists of alphabets, numbers, hyphens and underscores.

However, the behaviour raises undesirable effect for those country using multibyte filenames.

For example, if original filename consists all of multibyte characters and '.txt' extension (such as 'ファイル.txt'), the resulting filename becomes '.txt' (no filename body but only extension).

Underscore-suffix uniquification easily collapses

Things get worse if we have a lot of such files: since FileField? suffixes underscores after filename until the filename become unique, if we have files of ['壱号文書.doc', '弐号文書.doc', '参号文書.doc', ...], then filename records will become ['.doc', '_.doc', '.doc', ...].

When the number of underscores reaches to maxlength of filename field (100 or so), then FileField? will begin to raise errors because length of the filename exceeds limit.

Proposed solution: punicode conversion before call django.util.text.get_valid_filename.

Add STORE_FILENAME_AS_PUNYCODE to global_settings as False by default.

Encodes the given string in punycode except the extension if STORE_FILENAME_AS_PUNYCODE is True.

Then generate a clean file name in get_valid_filename and return it.

02/06/07 22:57:24 changed by makoto tsuyuki <mtsuyuki@gmail.com>

  • attachment store_filename_as_punycode_for_trunk_4462.diff added.

02/06/07 23:19:16 changed by Michael Radziej <mir@noris.de>

Why punycode? I'd think that most filesystems these days support UTF-8 (though, with different normalization, which *is* a problem).

  • Wouldn't it be better to support any arbitrary settings.FILE_SYSTEM_ENCODING?
  • What encoding does python use if you pass unicode to open()?

(in reply to: ↑ 1 ) 02/10/07 23:54:35 changed by ymasuda <ymasuda@ethercube.com>

As I mentioned in django-dev: http://groups.google.com/group/django-developers/browse_thread/thread/d9d590962817fd78 , It would be better just providing a hook to customize filename normalizer rather than persueing the only-one-flawless normalization scheme. Here I post a patch to add filename_normalizer option to FileField?'s constructor so that developers can specify their own normalization func. The patch includes code changes in django/db/models/fields/init.py and docs/model-api.txt.

02/10/07 23:56:58 changed by ymasuda <ymasuda@ethercube.com>

  • attachment filename_nomalizer_fix.diff added.

adds filename_normalizer to django.db.models.fields.FileField?. also includes short description on model-api.txt.

04/24/07 06:26:12 changed by MichaelRadziej <mir@noris.net>

  • stage changed from Unreviewed to Design decision needed.

09/16/07 10:22:46 changed by ubernostrum

#1355 was a duplicate.

10/10/07 08:17:33 changed by Thomas Güttler <hv@tbz-pariv.de>

  • cc set to hv@tbz-pariv.de.

10/28/07 15:39:54 changed by Thomas Guettler (Home)

  • cc deleted.

#5361 (Pluggable backends for FileField?) could solve this problem. Until the pluggable backends are available, I use my own version which uses django.utils.http.urlquote() for the filenames.

12/01/07 07:58:14 changed by jacob

  • keywords set to fs-r.

12/11/07 13:26:23 changed by Gulopine

  • keywords changed from fs-r to fs-rf.

12/16/07 17:29:25 changed by Gulopine

  • keywords changed from fs-rf to fs-rf-docs.

06/16/08 13:28:48 changed by Gulopine

  • milestone set to 1.0 beta.

07/18/08 11:42:58 changed by leahculver

  • owner changed from nobody to leahculver.
  • status changed from new to assigned.

07/18/08 13:25:38 changed by leahculver

  • status changed from assigned to closed.
  • resolution set to duplicate.

dupe of #6009


Add/Change #3119 (Problem for the up-loading of non-ASCII character file name.)




Change Properties
Action