#9893 closed Bug (fixed)
Filename + path length greater than 100 truncated on database insertion in Core.Storage
Reported by: | Refefer | Owned by: | Pavel Shpilev |
---|---|---|---|
Component: | File uploads/storage | Version: | dev |
Severity: | Normal | Keywords: | storage, filename |
Cc: | drmeers@…, jeroen@…, walter+django@…, Simon Charette | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
In core.files.storage, the storage object doesn't check the length of filename + upload_to to ensure that it's less than 100 characters. If the path length is greater than 100, it is truncated to 100 characters when inserted into the database. With filename collision mitigation appending an '_' to the end of the filename, popular filenames can easily reach lengths that exceed the size of the.
To reproduce the issue, try uploading a file/image with a length over 100 characters.
Solution:
Here's some quick hackery that attempts truncate the filename. Note that it assumes the upload_to length to be less than 30 characters long. Also note that this should probably be divided up into a couple of different methods and this is more to get the ball rolling than anything.
def get_available_name(self, name): """ Returns a filename that's free on the target storage system, and available for new content to be written to. """ # If the filename already exists, append an incrementing integer # to the file until the filename doesn't exist. dir_name, file_name = os.path.split(name) flength = len(file_name) if flength > 70: # If filenameis longer than 70, truncate filename offset = flength - (flength % 40 + 20) # modulus of file name + 20 to prevent file type truncation file_name = file_name[offset:] name = os.path.join(dir_name, file_name) if self.exists(name): # filename exists, get dot index try: dot_index = file_name.rindex('.') except ValueError: # filename has no dot dot_index = -1 inc = 0 # Set incrementer to zero while self.exists(name): inc += 1 if dot_index == -1: # If no dot, append to end tname = file_name + str(inc) else: tname = file_name[:dot_index] + str(inc) + file_name[dot_index:] name = os.path.join(dir_name, tname) return name
Attachments (1)
Change History (48)
comment:1 by , 16 years ago
comment:3 by , 16 years ago
milestone: | → 1.1 |
---|---|
Triage Stage: | Unreviewed → Accepted |
comment:4 by , 16 years ago
I ran into this issue when using the django-attachments re-usable app, but in my case the file name wasn't truncated – an exception was thrown instead. I'm storing files in subdirectories named after the user who uploads (max 30 characters), and apparently one of my users was uploading a file with a really long name so len("attachments/<username>/<filename.ext>")
exceeded 100 characters.
I wonder why an exception happened instead of truncation. I'm using PostgreSQL, maybe that's significant?
Here's the relevant part of the traceback:
File "attachments/attachments/models.py", line 137, in save super(Attachment, self).save(force_insert, force_update) File "django/db/models/base.py", line 408, in save self.save_base(force_insert=force_insert, force_update=force_update) File "django/db/models/base.py", line 484, in save_base result = manager._insert(values, return_id=update_pk) File "django/db/models/manager.py", line 177, in _insert return insert_query(self.model, values, **kwargs) File "django/db/models/query.py", line 1035, in insert_query return query.execute_sql(return_id) File "django/db/models/sql/subqueries.py", line 320, in execute_sql cursor = super(InsertQuery, self).execute_sql(None) File "django/db/models/sql/query.py", line 2290, in execute_sql cursor.execute(sql, params) DataError: value too long for type character varying(100)
comment:5 by , 16 years ago
Issues to consider:
Smart truncation can't be done in the storage object
since it doesn't know about the max_length=
of the FileField
.
The maximum length can be set to a different value than 100 in the model.
One possibility is to extend the signature of
Storage.get_valid_name()
and .get_available_name()
to accept
a maximum file name length argument, but I assume that's sub-optimal.
Truncation should avoid collisions with existing file names.
Simply shortening the file name isn't sufficient,
but something similar to what VFAT did with its FILENA~1.EXT
8.3-character filename representations should be used.
Storage.exists()
can be called to check for collisions.
The upload_to=
kwarg of FileField
can be either
- a string specifying the target directory, or
- a callable which replaces
FileField.generate_filename
and returns the path to the file to be saved.
In case of a callable, do we leave length checking as responsibility of the callable,
or should we prepare to truncate the generated path?
Truncation could be done in FileField.generate_filename()
,
but how to ensure the generated filename is still valid for the storage?
This is an issue if a numbering scheme must be used to avoid file name collisions.
What if the directory is already too long without the actual file name?
This shouldn't be an issue when upload_to=
is a string,
but a callable might generate a long path if not coded carefully.
follow-ups: 7 17 comment:6 by , 16 years ago
What are the drawbacks to just making a FileField use Text instead of varchar in the database? It seems like a pretty simple solution that would completely resolve the issue.
comment:7 by , 16 years ago
comment:8 by , 16 years ago
milestone: | 1.1 → 1.2 |
---|
I think ultimately akaihola's idea of modifying the get_valid_name
and get_available_name
to accept a max length is the right way to go. Using a text field is a bad idea for a bunch of reason -- speed, storage size, compatibility with Oracle -- but any hardcoded length could possibly be too short.
At this point past feature-freeze, though, changing that signature could cause too much heartache. I'm going to punt this to 1.2; for now, it's a simple workaround: use FileField(max_length=SOMETHING_BIGGER)
.
comment:9 by , 15 years ago
Has patch: | unset |
---|
comment:10 by , 15 years ago
Cc: | added |
---|
comment:11 by , 15 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:12 by , 15 years ago
Owner: | changed from | to
---|---|
Status: | assigned → new |
comment:13 by , 15 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:14 by , 15 years ago
Jacob,
To elaborate the changes needed:
- the following methods need a
max_length=100
kwarg:Storage.get_valid_name
Storage.get_available_name
FileSystemStorage._save
Storage.save
FileField.get_filename
needs to callStorage.get_valid_name
with the field'smax_length
FileField.save
needs to callself.storage.save
with the field'smax_length
Storage.save
needs to pass that value toself._save
FileSystemStorage._save
needs to pass it on toself.get_available_name
In addition, shouldn't a smart filename truncation / collision prevention mechanism be put in place anyway? As pointed out in the ticket description, appending underscores will cause collisions if truncation happens.
I propose appending an underscore and a counter integer to the first part of a dot-separated filename. The first part should be truncated before appending to ensure that no truncation will happen at the end. Examples (with a max_length of 36):
/long/path/long-file-name.ext1.ext2
(35 characters)/long/path/long-file-nam_1.ext1.ext2
(36 characters; first part truncated)/long/path/long-file-nam_2.ext1.ext2
(36 chars) and so on until.../long/path/long-file-na_10.ext1.ext2
(36 chars; first part further truncated)
This solution assumes that storage backends which don't allow underscores or digits in file names will implement their own get_available_name
(just like the same is currently assumed for underscores only).
It doesn't solve the case when the length of the directory path approaches max_length
and causes even truncated file names to overflow.
streetcleaner, do you intend to write a patch? If not, I volunteer to do that.
comment:16 by , 15 years ago
milestone: | 1.2 |
---|
With no concrete patch at this point, I don't see this having a chance to make it into 1.2.
comment:17 by , 14 years ago
Replying to wwinham:
What are the drawbacks to just making a FileField use Text instead of varchar in the database? It seems like a pretty simple solution that would completely resolve the issue.
Another point from Shai Berger in the googlegroups thread mentioned earlier:
In many engines, text fields are kept out-of-table; then, what's kept in the
table, is (effectively) the name of a file where the text is kept. Even if
this doesn't seriously affect performance, using such a field to keep a
filename must raise a few eyebrows.
by , 14 years ago
Attachment: | filefield_length_validation.diff added |
---|
comment:18 by , 14 years ago
I just ran into this problem last week and decided to try and write a patch. The problem was reported before model validation was incorporated, so it seemed like an easy fix to just add a validate method to the FileField model class now that those are validated. The patch just runs generate_filename to get the full length of the path that will be stored in the database and ensure it is less than max_length. Be kind, it's my first attempt at a patch.
comment:19 by , 14 years ago
Has patch: | set |
---|
comment:20 by , 14 years ago
Severity: | → Normal |
---|---|
Type: | → Bug |
comment:21 by , 14 years ago
Needs tests: | set |
---|
I'd be nice if there was also some way of passing the maximum length allowed to the storage engine too so when it generates a unique name, it can truncate it down.
But even without that, this is a worthy addition.
comment:24 by , 13 years ago
Cc: | added |
---|
comment:25 by , 13 years ago
Owner: | changed from | to
---|---|
Status: | assigned → new |
comment:26 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:28 by , 12 years ago
Resolution: | fixed |
---|---|
Status: | closed → new |
I'm going to revert that commit because of the regression reported in #19525.
Here are the relevant bits of that ticket:
As of dcd43831 (fixing #9893), a FileField will call generate_filename()
as part of the validation step for a FileField on a form. This was then updated in 05d333ba to address #18515.
Unfortunately, this means that the filename function provided as an argument to upload_to can no longer reference any field with a pre-save behavior.
The common use case for this is to organize files on disk according to upload date. For example:
def user_filename(instance, filename): return os.path.join('user_files', instance.uploaded_timestamp.strftime('%Y-%m-%d'), filename) class UserFile(models.Model): uploaded_timestamp = models.DateTimeField(auto_now_add=True) data = models.FileField(upload_to=user_filename)
Under Django 1.5, attempting to call is_valid() on a Modelform for this model will raise a "'NoneType' object has no attribute 'strftime'" exception, because instance.uploaded_timestamp hasn't been instantiated yet. This is despite the fact that the uploaded data has been provided, the generated filename would be valid, and the upload timestamp can be computed.
In Django 1.4 and earlier, this works because no validation was performed for FileFields filenames; the uploaded_timestamp was evaluated as part of the model pre-save, and the persistence of the file to disk occurred after the model was saved.
To my reading, the documentation is ambiguous on whether this is expected behavior or not. It says that the model may not be saved to the database yet, but points at AutoField as the cause for problems. However, it also explicitly talks about using strftime as part of file paths. A file datetimes of 'now' would seem to be an obvious usage of this feature.
For the record, I discovered this by upgrading a commercial project to Django 1.5, so there is at least one project in the wild that will be affected by this change. Although I've discovered it with a auto_now_add FileField, it's not hard to see that this change also affects any field with a pre_save behaviour.
It also has the potential to lead to incorrect validation. Consider the case of a field with a pre_save behavior that updates the field (auto_now is one example, but any denormalization/summary field would be an example). The call to validate occurs *before* the call to pre_save is made, which means that you're going to get the pre_save value used as part of your validation. If you then save the model, the pre_save() will be called, and the actual filename that is used for saving the file will be different to the one used for validation.
Some initial thoughts about possible solutions:
- Document that you can't use a field with pre-save behaviour. Not ideal IMHO, since it rules out an obvious use case for upload_to.
- Roll back the fix. Also less than ideal; #9893 is an edge case bug, but it's something that has been seen in the wild, and isn't *too* hard to generate.
- Invoke pre_save on all model fields prior to validation. Given that most validation doesn't need this, this approach seems a little excessive.
I've just checked with my production code, and yes, default=timezone.now
works for the auto_now_add case
. However, it won't address auto_now
, or the pre_save
conflict problem.
comment:32 by , 12 years ago
Status: | new → assigned |
---|
comment:33 by , 12 years ago
Owner: | removed |
---|---|
Status: | assigned → new |
I don't know how to solve this properly.
comment:34 by , 12 years ago
Component: | Core (Other) → File uploads/storage |
---|
comment:35 by , 12 years ago
Cc: | added |
---|
comment:36 by , 12 years ago
So, I ran into bug #13314. These are all similar but not quite the same.
#9893 -- Filename + path length greater than 100 (when making filenames unique)
#10410 -- FileField saves one filename on disk and another on DB
#13314 -- validation does not account for "upload_to" when counting characters
This ticket #9893 is about the changed filenames when making them unique.
Ticket #13314 is just about the upload_to parameter. For that problem the fix could be done like this:
--- django/db/models/fields/files.py.orig 2013-04-01 17:03:59.752332630 +0200 +++ django/db/models/fields/files.py 2013-04-01 17:08:15.833870239 +0200 @@ -290,6 +290,8 @@ if 'initial' in kwargs: defaults['required'] = False defaults.update(kwargs) + # And deduct the upload_to length from the max_length. + defaults['max_length'] -= len(self.upload_to) return super(FileField, self).formfield(**defaults) class ImageFileDescriptor(FileDescriptor):
With this environment:
- ImageField with max_length default 100
- An upload_to value of "profile/" (8 characters)
Without the fix, I get a DatabaseError for a 99-byte length filename.
With the fix, I get a nice ValidationError until I reduce the filename to 92 bytes.
That should be at least be one less problem, right?
This was tested with Django 1.3.7.
comment:37 by , 11 years ago
Owner: | set to |
---|---|
Status: | new → assigned |
comment:38 by , 11 years ago
Hey guys
Long time no activity on the ticket...
I have recently run into this bug on production myself. We needed a solution ASAP, so I ended up with an overwrite for get_available_name(), providing max_filename_length as an argument. Pretty much what @akaihola first suggested in https://code.djangoproject.com/ticket/9893#comment:5. Seem to be working fine for us. I can try implement this as a general solution now, if you think this would be a proper fix.
Cheers.
comment:39 by , 10 years ago
Cc: | added |
---|---|
Needs documentation: | set |
Patch needs improvement: | set |
pavel_shpilev put together a PR based on ticket:9893#comment:1 that still requires some adjustments.
comment:41 by , 10 years ago
Needs documentation: | unset |
---|---|
Needs tests: | unset |
Triage Stage: | Accepted → Ready for checkin |
PR #3369 LGTM.
comment:42 by , 10 years ago
Patch needs improvement: | set |
---|---|
Triage Stage: | Ready for checkin → Accepted |
Comment for improvement is on the PR.
comment:43 by , 10 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
comment:44 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
comment:46 by , 8 years ago
Hello,
could it be that this bug still exists?
Both on Django 1.8.18 and 1.11.2 I see this behavior isn't corrected yet.
The validation of max_length takes only file_name into account and ignores upload_to, causing truncation on the database level, exactly as described in https://code.djangoproject.com/ticket/13314
When I check out the relevant code I strongly suspect this bug was never fixed:
https://github.com/django/django/blob/master/django/forms/fields.py#L553
Wouldn't it be easy to change the length of the field to > 100 characters? It's not like increasing the space to 255 would hurt anybody and it would surely make the problem go away for a huge percentage of file names. I just ran in to this myself.