Django

Code

Ticket #6009 (new)

Opened 6 months ago

Last modified 1 month ago

UnicodeDecodeError when uploading file with non-english filename.

Reported by: bear330 Assigned to: nobody
Component: Internationalization Version: SVN
Keywords: files, unicode, FileBackend fs-rf-docs Cc:
Triage Stage: Accepted Has patch: 1
Needs documentation: 0 Needs tests: 0
Patch needs improvement: 0

Description

Upload a file using newforms I will get a UploadedFile? object which contains filename and content for uploaded file.

If I upload a file with english file name ('abcd.jpg'), everything is right. But if not (for example: '中文.jpg', when I assign the UploadedFile? object's filename to ImageField? or FileField? in a model and save it, I will get a UnicodeDecodeError?.

This is because the django.http.parse_file_upload will treat filename as 'str' object not 'unicode' object.

I must do this manually to avoid this bug:

filename = uploadedFileObj.filename.decode('utf8')

After that, UnicodeDecodeError? will not happen again, but the FileField?'s value in database will be '.jpg'.

OH! terrible! That is because the django.utils.text.get_valid_filename do this:

re.sub(r'[-A-Za-z0-9_.]', , s)

This will be good in english file name, but not in other languages. After the re.sub, '中文.jpg' => u'\u4e2d\u6587.jpg' will be u'.jpg'.

For me, this is very serious problem. At this time, I can fix that by doing decode('utf8') and override get_valid_filename manually. But I hope this bug will be fixed by django officially.

Thanks for your effort. :)

Attachments

patch.diff (1.6 kB) - added by sema on 03/17/08 17:26:12.

Change History

12/01/07 22:12:58 changed by Simon G <dev@simon.net.nz>

  • needs_better_patch changed.
  • stage changed from Unreviewed to Accepted.
  • summary changed from Error while upload file with non-english filename. to UnicodeDecodeError when uploading file with non-english filename..
  • needs_tests changed.
  • needs_docs changed.

(follow-up: ↓ 3 ) 02/26/08 01:42:01 changed by lukestebbing

I tried uploading 中文.jpg using Safari and Firefox on the Mac:

Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-us) AppleWebKit/523.15.1 (KHTML, like Gecko) Version/3.0.4 Safari/523.15
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9b3) Gecko/2008020511 Firefox/3.0b3

Here's the associated content disposition header that showed up both times in the raw post data:

Content-Disposition: form-data; name="file"; filename="&#20013;&#25991;.jpg"

This interacts with a three year old bug in cgi.parse_header, so Django sees the filename as &#20013.

These two user agents seem to encode filenames in Latin-1, and if a character doesn't fit in that charset, it's encoded as an HTML character entity. Allowing all of Latin-1 looks like a violation of RFC 2388/5.4 to me, but I suppose it's due to a trend set by some old browser.

(in reply to: ↑ 2 ) 02/26/08 14:14:11 changed by lukestebbing

Oops, I was missing a <meta http-equiv> directive. I see what the reporter sees when I use a unicode charset. Disregard my comment.

03/17/08 12:36:48 changed by sema

  • owner changed from nobody to sema.

03/17/08 17:25:34 changed by sema

  • has_patch set to 1.

This ticket touches two problems with non-latin filenames.

1. Models throw an exception, this is a result of the UploadedFile? being given a normal string by Form and serving it as unicode. Disregarding the fact that the data given in request.FILES can contain non-latin characters. 2. The models tries to sanitize the filename, stripping all non-latin characters. This is a problem if the filename only contains non-latin characters.

I have attached a patch and unittests showing and solving problem number one.

03/17/08 17:26:12 changed by sema

  • attachment patch.diff added.

03/19/08 11:18:08 changed by sema

  • owner changed from sema to nobody.

03/25/08 22:47:45 changed by axiak

  • keywords set to files, unicode, FileBackend.

As sema as said above, there are actually two distinct issues going on here:

  1. The filename is not being correctly parsed into its correct form.
  2. The filename is being rejected and/or mangled when the file is being saved.

The first problem is easy (that is, it can be solved without providing hooks). My latest patch in #2070 solves this (I think) correctly. Allowing things like:

Content-Disposition: form-data; name="file"; filename="&#20013;&#25991;.jpg"

to be parsed correctly. I have added a test to that patch which will hopefully prevent that feature from breaking.

I think the second problem really doesn't belong in #2070 for a few reasons. The main issue is that most people will probably want the current behavior, and it's pretty special-case behavior to say I want the filesystem to accept non-ascii file names. That being said, I think it should be easy to alter this behavior, and I think that's what the FileField? backends over at #5361 will help with. (I've spoken with Gul on IRC and he seems acknowledge that it's easy to provide the hooks in #5361.)

04/02/08 21:28:23 changed by Gulopine

  • keywords changed from files, unicode, FileBackend to files, unicode, FileBackend fs-rf-docs.

Add/Change #6009 (UnicodeDecodeError when uploading file with non-english filename.)




Change Properties
Action