Code

Opened 6 years ago

Closed 6 years ago

Last modified 3 years ago

#6009 closed (fixed)

UnicodeDecodeError when uploading file with non-english filename.

Reported by: bear330 Owned by: leahculver
Component: Internationalization Version: master
Severity: Keywords: files, unicode, FileBackend fs-rf
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description

Upload a file using newforms I will get a UploadedFile object which contains filename and content for uploaded file.

If I upload a file with english file name ('abcd.jpg'), everything is right.
But if not (for example: '中文.jpg', when I assign the UploadedFile object's filename to ImageField or FileField in a model and save it, I will get a UnicodeDecodeError.

This is because the django.http.parse_file_upload will treat filename as 'str' object not 'unicode' object.

I must do this manually to avoid this bug:

filename = uploadedFileObj.filename.decode('utf8')

After that, UnicodeDecodeError will not happen again, but the FileField's value in database will be '.jpg'.

OH! terrible! That is because the django.utils.text.get_valid_filename do this:

re.sub(r'[-A-Za-z0-9_.]', , s)

This will be good in english file name, but not in other languages.
After the re.sub, '中文.jpg' => u'\u4e2d\u6587.jpg' will be u'.jpg'.

For me, this is very serious problem.
At this time, I can fix that by doing decode('utf8') and override get_valid_filename manually.
But I hope this bug will be fixed by django officially.

Thanks for your effort. :)

Attachments (2)

patch.diff (1.6 KB) - added by sema 6 years ago.
patch-6009-1.diff (3.0 KB) - added by leahculver 6 years ago.
added tests for unicode filenames in forms and model field

Download all attachments as: .zip

Change History (21)

comment:1 Changed 6 years ago by Simon G <dev@…>

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Summary changed from Error while upload file with non-english filename. to UnicodeDecodeError when uploading file with non-english filename.
  • Triage Stage changed from Unreviewed to Accepted

comment:2 follow-up: Changed 6 years ago by lukestebbing

I tried uploading 中文.jpg using Safari and Firefox on the Mac:

Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-us) AppleWebKit/523.15.1 (KHTML, like Gecko) Version/3.0.4 Safari/523.15
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9b3) Gecko/2008020511 Firefox/3.0b3

Here's the associated content disposition header that showed up both times in the raw post data:

Content-Disposition: form-data; name="file"; filename="&#20013;&#25991;.jpg"

This interacts with a three year old bug in cgi.parse_header, so Django sees the filename as &#20013.

These two user agents seem to encode filenames in Latin-1, and if a character doesn't fit in that charset, it's encoded as an HTML character entity. Allowing all of Latin-1 looks like a violation of RFC 2388/5.4 to me, but I suppose it's due to a trend set by some old browser.

comment:3 in reply to: ↑ 2 Changed 6 years ago by lukestebbing

Oops, I was missing a <meta http-equiv> directive. I see what the reporter sees when I use a unicode charset. Disregard my comment.

comment:4 Changed 6 years ago by sema

  • Owner changed from nobody to sema

comment:5 Changed 6 years ago by sema

  • Has patch set

This ticket touches two problems with non-latin filenames.

  1. Models throw an exception, this is a result of the UploadedFile being given a normal string by Form and serving it as unicode. Disregarding the fact that the data given in request.FILES can contain non-latin characters.
  2. The models tries to sanitize the filename, stripping all non-latin characters. This is a problem if the filename only contains non-latin characters.

I have attached a patch and unittests showing and solving problem number one.

Changed 6 years ago by sema

comment:6 Changed 6 years ago by sema

  • Owner changed from sema to nobody

comment:7 Changed 6 years ago by axiak

  • Keywords files, unicode, FileBackend added

As sema as said above, there are actually two distinct issues going on here:

  1. The filename is not being correctly parsed into its correct form.
  2. The filename is being rejected and/or mangled when the file is being saved.

The first problem is easy (that is, it can be solved without providing hooks). My latest patch in #2070 solves this (I think) correctly. Allowing things like:

Content-Disposition: form-data; name="file"; filename="&#20013;&#25991;.jpg"

to be parsed correctly. I have added a test to that patch which will hopefully prevent that feature from breaking.

I think the second problem really doesn't belong in #2070 for a few reasons. The main issue is that most people will probably want the current behavior, and it's pretty special-case behavior to say I want the filesystem to accept non-ascii file names. That being said, I think it should be easy to alter this behavior, and I think that's what the FileField backends over at #5361 will help with. (I've spoken with Gul on IRC and he seems acknowledge that it's easy to provide the hooks in #5361.)

comment:8 Changed 6 years ago by Gulopine

  • Keywords fs-rf-docs added

comment:9 Changed 6 years ago by garcia_marc

If I'm not wrong, this ticket is a duplicate of #3119 and should be closed as duplicated.

comment:10 Changed 6 years ago by anonymous

I have tried this patch, but it's no use for uploading a unicode filename.
I have tried #3119's patch, that made the unicode filename to became a unknowable one.
Finally, I changed get_valide_filename(s) of which the line "return re.sub(r'[-A-Za-z0-9_.]', , s)" to "return s".
That's OK now, but I don't know if it is harmfull.

comment:11 Changed 6 years ago by Gulopine

  • milestone set to 1.0 beta

comment:12 Changed 6 years ago by leahculver

  • Owner changed from nobody to leahculver
  • Status changed from new to assigned

Changed 6 years ago by leahculver

added tests for unicode filenames in forms and model field

comment:13 Changed 6 years ago by leahculver

  • Triage Stage changed from Accepted to Ready for checkin

comment:14 Changed 6 years ago by Gulopine

  • Keywords fs-rf added; fs-rf-docs removed

I had originally planned to simply document how this behavior could be supported after #5361 lands, without actually implementing it in core, but with all the tests supplied, I'll go ahead and wrap these things up into my next patch to be submitted tonight. I'm changing the tag back to fs-rf for now, and I'll update it to fs-rf-fixed once I get it integrated into the patch.

comment:15 Changed 6 years ago by mtredinnick

@Gulopine: this particular issue should never be a case of "could be supported". It should Just Work(tm). Django supports unicode always. Leah's tests show that it now does work in trunk and that should hopefully not change with your patch, either.

comment:16 Changed 6 years ago by Gulopine

@mtredinnick: I'll admit, I took the easy way out, acknowledging that a fix was easy, while I waited for someone else to come along and verify how it *should* be done, and that there weren't any unfortunate consequences, before I included it in my own patch. I had always wanted it to be included, I jsut didn't think I was the best person to verify that the patches covered all the bases. Now that it's been fully fleshed out, I'll be adding it to the next patch for #5361 first thing tonight.

comment:17 Changed 6 years ago by mtredinnick

  • Resolution set to fixed
  • Status changed from assigned to closed

(In [7987]) Fixed #6009 -- Added regression tests to show that uploading non-ASCII
filenames now works properly. Patch from Leah Culver.

comment:18 Changed 5 years ago by akaihola

Just in case someone else lands here wondering why they still get this error:

Debian runs Apache with the LANG=C locale by default, which breaks uploading files with special characters in their names at least when running with mod_wsgi. Activating a UTF-8 locale in /etc/apache2/envvars should resolve the issue.

comment:19 Changed 3 years ago by jacob

  • milestone 1.0 beta deleted

Milestone 1.0 beta deleted

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.