Opened 7 years ago

Closed 6 years ago

Last modified 4 years ago

#10254 closed (fixed)

The use of 'FileField' Upload file missing Chinese characters (non-ascii)

Reported by: isoemail@… Owned by: kmtracey
Component: File uploads/storage Version: 1.0
Severity: Keywords: FileField non-ascii
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description

file : /home/jane/铅笔画.xcf

use of 'FileField' Upload file,

get "Image/.xcf" lost (non-ascii)characters part

Attachments (1)

unicodefn.diff (505 bytes) - added by kmtracey 6 years ago.

Download all attachments as: .zip

Change History (10)

comment:1 Changed 7 years ago by kmtracey

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Accepted

This has been reported before, see #3119 and #6009, and apparently thought to have been fixed, but it isn't fixed. (Also, near as I can tell this is not a recent regression, this part of the issues identified in those earlier tickets just was never addressed, so far as I can see). Earlier there were two problems: an exception on upload of a file with non-ASCII chars in the name, and stripping of non-ASCII chars from the filename. The 1st seems to have been fixed, the 2nd is still in place.

django/utils/text.py get_valid_filename still strips non-ASCII chars from the filename, and this routine is called by core/files/storage.py to generate a "valid" file name when a FileField is saved. Though r7987 added a test to ensure that uploading files with non-ASCII names "works", the part of the test that checks that the name is preserved simply checks the file name in cleaned_data, and this stripping of non-ASCII chars is done long after form validation. So while the file name remains "correct" in what can be seen in the form's cleaned_data, that name is stripped of its non-ASCII chars (and perhaps spaces are turned into underscores and additional underscores are appended to the name to avoid overwriting an existing file of the same name). So the name you can see in cleaned_data isn't necessarily the name the file will be stored under.

I don't know the right way to fix this, but the existing get_valid_filename in django/utils/text.py certainly seems over-restrictive in terms of what is an allowed character in a file name.

comment:2 Changed 6 years ago by gulliver

I changed get_valid_filename(s) of utils/text.py,

the last line of get_valid_filename(s) is became

return re.sub(r'(?u)[-\w_.]', , s)

So it worked.

comment:3 Changed 6 years ago by kmtracey

  • milestone set to 1.1

Can anyone with more history than I have comment on the proposed change to get_valid_filename? The very restrictive version that strips out all but A-Z, a-z, dot, dash, and underscore has been there since day 1 as far as the public Django repository goes, so I can't get any sense for the reason why it was made so restrictive.

I don't see the harm in allowing unicode alphanumerics, and I get the impression this was thought to have already been fixed -- see #6009, where tests were added that attempt to verify files with non-ASCII chars can be uploaded...only the tests don't go so far as to actually save the file to the file system, which is when the non-ASCII chars are being stripped. I've tested uploading files with non-ASCII chars on Ubuntu and Windows, and they both work fine. The only potential problem I see is that an English Windows box doesn't necessarily display really exotic (e.g. Japanese and Chinese) characters in file names correctly in either a command prompt or the file explorer, so it allows them but doesn't really fully support them. I'd expect there are versions of Windows that handle this better but all I have is plain English Windows...and even there all you get are files that may be a little difficult to deal with since you cannot type their names, but you can still get auto-complete to do it for you, or select them via a GUI, etc.

At any rate it seems we should be able to resolve this one way or the other for 1.1.

Changed 6 years ago by kmtracey

comment:4 Changed 6 years ago by kmtracey

Also reported here: http://groups.google.com/group/django-users/browse_thread/thread/337848b9d16f79de#

so I added the patch as an attachment so as to have something to point this user to for a possible fix. (Note the patch as-is isn't something we'd want to check in, as it doesn't update the comment to match the changed code.)

comment:5 Changed 6 years ago by mtredinnick

That patch looks fine to me, Karen. I (and apparently everybody else with more motivation to find this, since they would have been using non-ASCII filenames) overlooked this in the unicode changes. That's why it wasn't changed then.

I can't think of any security problems it will introduce, as directory traversal still isn't possible with that change (for any of the os.sep values on Mac (even older ones), Windows, Linux, ....)

comment:6 Changed 6 years ago by kmtracey

  • Owner changed from nobody to kmtracey
  • Status changed from new to assigned

OK, I will make the regex change. Looks like there is also some part of a test in file_uploads that is attempting to verify that upload with unicode chars works, but it only tests the value seen in the POST data, it doesn't verify the actual file field after save still has the unicode characters in its name. I'll fix up that test to follow through with the save and thus verify the whole process is working.

comment:7 Changed 6 years ago by kmtracey

  • Resolution set to fixed
  • Status changed from assigned to closed

(In [10388]) Fixed #10254: Changed the regex in get_valid_filename to allow unicode alphanumerics (thanks gulliver). Also updated the file_uploads test for this case to check the name after saving the uploaded file. As it was the test ensured that files with unicode characters in their names could be uploaded, but it wasn't actually ensuring that the unicode characters were preserved through save.

comment:8 Changed 6 years ago by kmtracey

(In [10389]) [1.0.X] Fixed #10254: Changed the regex in get_valid_filename to allow unicode alphanumerics (thanks gulliver). Also updated the file_uploads test for this case to check the name after saving the uploaded file. As it was the test ensured that files with unicode characters in their names could be uploaded, but it wasn't actually ensuring that the unicode characters were preserved through save.

Backport of [10388]

comment:9 Changed 4 years ago by jacob

  • milestone 1.1 deleted

Milestone 1.1 deleted

Note: See TracTickets for help on using tickets.
Back to Top