Opened 17 years ago

Closed 16 years ago

Last modified 13 years ago

#6009 closed (fixed)

UnicodeDecodeError when uploading file with non-english filename.

Reported by: bear330 Owned by: Leah Culver
Component: Internationalization Version: dev
Severity: Keywords: files, unicode, FileBackend fs-rf
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Upload a file using newforms I will get a UploadedFile object which contains filename and content for uploaded file.

If I upload a file with english file name ('abcd.jpg'), everything is right.
But if not (for example: '中文.jpg', when I assign the UploadedFile object's filename to ImageField or FileField in a model and save it, I will get a UnicodeDecodeError.

This is because the django.http.parse_file_upload will treat filename as 'str' object not 'unicode' object.

I must do this manually to avoid this bug:

filename = uploadedFileObj.filename.decode('utf8')

After that, UnicodeDecodeError will not happen again, but the FileField's value in database will be '.jpg'.

OH! terrible! That is because the django.utils.text.get_valid_filename do this:

re.sub(r'[-A-Za-z0-9_.]', , s)

This will be good in english file name, but not in other languages.
After the re.sub, '中文.jpg' => u'\u4e2d\u6587.jpg' will be u'.jpg'.

For me, this is very serious problem.
At this time, I can fix that by doing decode('utf8') and override get_valid_filename manually.
But I hope this bug will be fixed by django officially.

Thanks for your effort. :)

Attachments (2)

patch.diff (1.6 KB ) - added by Casper Jensen 17 years ago.
patch-6009-1.diff (3.0 KB ) - added by Leah Culver 16 years ago.
added tests for unicode filenames in forms and model field

Download all attachments as: .zip

Change History (21)

comment:1 by Simon G <dev@…>, 17 years ago

Summary: Error while upload file with non-english filename.UnicodeDecodeError when uploading file with non-english filename.
Triage Stage: UnreviewedAccepted

comment:2 by lukestebbing, 17 years ago

I tried uploading 中文.jpg using Safari and Firefox on the Mac:

Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-us) AppleWebKit/523.15.1 (KHTML, like Gecko) Version/3.0.4 Safari/523.15
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9b3) Gecko/2008020511 Firefox/3.0b3

Here's the associated content disposition header that showed up both times in the raw post data:

Content-Disposition: form-data; name="file"; filename="&#20013;&#25991;.jpg"

This interacts with a three year old bug in cgi.parse_header, so Django sees the filename as &#20013.

These two user agents seem to encode filenames in Latin-1, and if a character doesn't fit in that charset, it's encoded as an HTML character entity. Allowing all of Latin-1 looks like a violation of RFC 2388/5.4 to me, but I suppose it's due to a trend set by some old browser.

in reply to:  2 comment:3 by lukestebbing, 17 years ago

Oops, I was missing a <meta http-equiv> directive. I see what the reporter sees when I use a unicode charset. Disregard my comment.

comment:4 by Casper Jensen, 17 years ago

Owner: changed from nobody to Casper Jensen

comment:5 by Casper Jensen, 17 years ago

Has patch: set

This ticket touches two problems with non-latin filenames.

  1. Models throw an exception, this is a result of the UploadedFile being given a normal string by Form and serving it as unicode. Disregarding the fact that the data given in request.FILES can contain non-latin characters.
  2. The models tries to sanitize the filename, stripping all non-latin characters. This is a problem if the filename only contains non-latin characters.

I have attached a patch and unittests showing and solving problem number one.

by Casper Jensen, 17 years ago

Attachment: patch.diff added

comment:6 by Casper Jensen, 17 years ago

Owner: changed from Casper Jensen to nobody

comment:7 by Michael Axiak, 17 years ago

Keywords: files unicode FileBackend added

As sema as said above, there are actually two distinct issues going on here:

  1. The filename is not being correctly parsed into its correct form.
  2. The filename is being rejected and/or mangled when the file is being saved.

The first problem is easy (that is, it can be solved without providing hooks). My latest patch in #2070 solves this (I think) correctly. Allowing things like:

Content-Disposition: form-data; name="file"; filename="&#20013;&#25991;.jpg"

to be parsed correctly. I have added a test to that patch which will hopefully prevent that feature from breaking.

I think the second problem really doesn't belong in #2070 for a few reasons. The main issue is that most people will probably want the current behavior, and it's pretty special-case behavior to say I want the filesystem to accept non-ascii file names. That being said, I think it should be easy to alter this behavior, and I think that's what the FileField backends over at #5361 will help with. (I've spoken with Gul on IRC and he seems acknowledge that it's easy to provide the hooks in #5361.)

comment:8 by Marty Alchin, 17 years ago

Keywords: fs-rf-docs added

comment:9 by Marc Garcia, 17 years ago

If I'm not wrong, this ticket is a duplicate of #3119 and should be closed as duplicated.

comment:10 by anonymous, 17 years ago

I have tried this patch, but it's no use for uploading a unicode filename.
I have tried #3119's patch, that made the unicode filename to became a unknowable one.
Finally, I changed get_valide_filename(s) of which the line "return re.sub(r'[-A-Za-z0-9_.]', , s)" to "return s".
That's OK now, but I don't know if it is harmfull.

comment:11 by Marty Alchin, 17 years ago

milestone: 1.0 beta

comment:12 by Leah Culver, 16 years ago

Owner: changed from nobody to Leah Culver
Status: newassigned

by Leah Culver, 16 years ago

Attachment: patch-6009-1.diff added

added tests for unicode filenames in forms and model field

comment:13 by Leah Culver, 16 years ago

Triage Stage: AcceptedReady for checkin

comment:14 by Marty Alchin, 16 years ago

Keywords: fs-rf added; fs-rf-docs removed

I had originally planned to simply document how this behavior could be supported after #5361 lands, without actually implementing it in core, but with all the tests supplied, I'll go ahead and wrap these things up into my next patch to be submitted tonight. I'm changing the tag back to fs-rf for now, and I'll update it to fs-rf-fixed once I get it integrated into the patch.

comment:15 by Malcolm Tredinnick, 16 years ago

@Gulopine: this particular issue should never be a case of "could be supported". It should Just Work(tm). Django supports unicode always. Leah's tests show that it now does work in trunk and that should hopefully not change with your patch, either.

comment:16 by Marty Alchin, 16 years ago

@mtredinnick: I'll admit, I took the easy way out, acknowledging that a fix was easy, while I waited for someone else to come along and verify how it *should* be done, and that there weren't any unfortunate consequences, before I included it in my own patch. I had always wanted it to be included, I jsut didn't think I was the best person to verify that the patches covered all the bases. Now that it's been fully fleshed out, I'll be adding it to the next patch for #5361 first thing tonight.

comment:17 by Malcolm Tredinnick, 16 years ago

Resolution: fixed
Status: assignedclosed

(In [7987]) Fixed #6009 -- Added regression tests to show that uploading non-ASCII
filenames now works properly. Patch from Leah Culver.

comment:18 by Antti Kaihola, 16 years ago

Just in case someone else lands here wondering why they still get this error:

Debian runs Apache with the LANG=C locale by default, which breaks uploading files with special characters in their names at least when running with mod_wsgi. Activating a UTF-8 locale in /etc/apache2/envvars should resolve the issue.

comment:19 by Jacob, 13 years ago

milestone: 1.0 beta

Milestone 1.0 beta deleted

Note: See TracTickets for help on using tickets.
Back to Top