Opened 16 years ago

Closed 10 years ago

Last modified 8 years ago

#9893 closed Bug (fixed)

Filename + path length greater than 100 truncated on database insertion in Core.Storage

Reported by: Refefer Owned by: Pavel Shpilev
Component: File uploads/storage Version: dev
Severity: Normal Keywords: storage, filename
Cc: drmeers@…, jeroen@…, walter+django@…, Simon Charette Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

In core.files.storage, the storage object doesn't check the length of filename + upload_to to ensure that it's less than 100 characters. If the path length is greater than 100, it is truncated to 100 characters when inserted into the database. With filename collision mitigation appending an '_' to the end of the filename, popular filenames can easily reach lengths that exceed the size of the.

To reproduce the issue, try uploading a file/image with a length over 100 characters.

Solution:
Here's some quick hackery that attempts truncate the filename. Note that it assumes the upload_to length to be less than 30 characters long. Also note that this should probably be divided up into a couple of different methods and this is more to get the ball rolling than anything.

    def get_available_name(self, name):
        """
        Returns a filename that's free on the target storage system, and
        available for new content to be written to.
        """
        # If the filename already exists, append an incrementing integer
        # to the file until the filename doesn't exist.
        dir_name, file_name = os.path.split(name)
        flength = len(file_name)
	if flength > 70: # If filenameis longer than 70, truncate filename
		offset = flength - (flength % 40 + 20) # modulus of file name + 20 to prevent file type truncation
		file_name = file_name[offset:]
		name = os.path.join(dir_name, file_name)

        if self.exists(name): # filename exists, get dot index
            try:
                dot_index = file_name.rindex('.')
            except ValueError: # filename has no dot
                dot_index = -1

            inc = 0 # Set incrementer to zero
            while self.exists(name):
                inc += 1
                if dot_index == -1: # If no dot, append to end
                    tname = file_name + str(inc)
                else:
                    tname = file_name[:dot_index] + str(inc) + file_name[dot_index:]
                name = os.path.join(dir_name, tname)

        return name

Attachments (1)

filefield_length_validation.diff (1.7 KB ) - added by Carson Gee 14 years ago.

Download all attachments as: .zip

Change History (48)

comment:1 by wwinham, 16 years ago

Wouldn't it be easy to change the length of the field to > 100 characters? It's not like increasing the space to 255 would hurt anybody and it would surely make the problem go away for a huge percentage of file names. I just ran in to this myself.

comment:2 by (none), 16 years ago

milestone: post-1.0

Milestone post-1.0 deleted

comment:3 by Jacob, 16 years ago

milestone: 1.1
Triage Stage: UnreviewedAccepted

comment:4 by Antti Kaihola, 16 years ago

I ran into this issue when using the django-attachments re-usable app, but in my case the file name wasn't truncated – an exception was thrown instead. I'm storing files in subdirectories named after the user who uploads (max 30 characters), and apparently one of my users was uploading a file with a really long name so len("attachments/<username>/<filename.ext>") exceeded 100 characters.

I wonder why an exception happened instead of truncation. I'm using PostgreSQL, maybe that's significant?

Here's the relevant part of the traceback:

  File "attachments/attachments/models.py", line 137, in save
    super(Attachment, self).save(force_insert, force_update)

  File "django/db/models/base.py", line 408, in save
    self.save_base(force_insert=force_insert, force_update=force_update)

  File "django/db/models/base.py", line 484, in save_base
    result = manager._insert(values, return_id=update_pk)

  File "django/db/models/manager.py", line 177, in _insert
    return insert_query(self.model, values, **kwargs)

  File "django/db/models/query.py", line 1035, in insert_query
    return query.execute_sql(return_id)

  File "django/db/models/sql/subqueries.py", line 320, in execute_sql
    cursor = super(InsertQuery, self).execute_sql(None)

  File "django/db/models/sql/query.py", line 2290, in execute_sql
    cursor.execute(sql, params)

DataError: value too long for type character varying(100)

comment:5 by Antti Kaihola, 16 years ago

Issues to consider:

Smart truncation can't be done in the storage object
since it doesn't know about the max_length= of the FileField.
The maximum length can be set to a different value than 100 in the model.

One possibility is to extend the signature of
Storage.get_valid_name() and .get_available_name() to accept
a maximum file name length argument, but I assume that's sub-optimal.

Truncation should avoid collisions with existing file names.
Simply shortening the file name isn't sufficient,
but something similar to what VFAT did with its FILENA~1.EXT
8.3-character filename representations should be used.
Storage.exists() can be called to check for collisions.

The upload_to= kwarg of FileField can be either

  • a string specifying the target directory, or
  • a callable which replaces FileField.generate_filename and returns the path to the file to be saved.

In case of a callable, do we leave length checking as responsibility of the callable,
or should we prepare to truncate the generated path?

Truncation could be done in FileField.generate_filename(),
but how to ensure the generated filename is still valid for the storage?
This is an issue if a numbering scheme must be used to avoid file name collisions.

What if the directory is already too long without the actual file name?
This shouldn't be an issue when upload_to= is a string,
but a callable might generate a long path if not coded carefully.

comment:6 by wwinham, 16 years ago

What are the drawbacks to just making a FileField use Text instead of varchar in the database? It seems like a pretty simple solution that would completely resolve the issue.

in reply to:  6 comment:7 by Antti Kaihola, 16 years ago

Replying to wwinham:

What are the drawbacks to just making a FileField use Text instead of varchar in the database? It seems like a pretty simple solution that would completely resolve the issue.

I started a thread with your question on the django-developers list.

comment:8 by Jacob, 16 years ago

milestone: 1.11.2

I think ultimately akaihola's idea of modifying the get_valid_name and get_available_name to accept a max length is the right way to go. Using a text field is a bad idea for a bunch of reason -- speed, storage size, compatibility with Oracle -- but any hardcoded length could possibly be too short.

At this point past feature-freeze, though, changing that signature could cause too much heartache. I'm going to punt this to 1.2; for now, it's a simple workaround: use FileField(max_length=SOMETHING_BIGGER).

comment:9 by Chris Beaven, 15 years ago

Has patch: unset

comment:10 by Simon Meers, 15 years ago

Cc: drmeers@… added

comment:11 by j.c.sackett, 15 years ago

Owner: changed from nobody to j.c.sackett
Status: newassigned

comment:12 by j.c.sackett, 15 years ago

Owner: changed from j.c.sackett to nobody
Status: assignednew

comment:13 by streetcleaner, 15 years ago

Owner: changed from nobody to streetcleaner
Status: newassigned

comment:14 by Antti Kaihola, 15 years ago

Jacob,

To elaborate the changes needed:

  • the following methods need a max_length=100 kwarg:
    • Storage.get_valid_name
    • Storage.get_available_name
    • FileSystemStorage._save
    • Storage.save
  • FileField.get_filename needs to call Storage.get_valid_name with the field's max_length
  • FileField.save needs to call self.storage.save with the field's max_length
  • Storage.save needs to pass that value to self._save
  • FileSystemStorage._save needs to pass it on to self.get_available_name

In addition, shouldn't a smart filename truncation / collision prevention mechanism be put in place anyway? As pointed out in the ticket description, appending underscores will cause collisions if truncation happens.

I propose appending an underscore and a counter integer to the first part of a dot-separated filename. The first part should be truncated before appending to ensure that no truncation will happen at the end. Examples (with a max_length of 36):

  • /long/path/long-file-name.ext1.ext2 (35 characters)
  • /long/path/long-file-nam_1.ext1.ext2 (36 characters; first part truncated)
  • /long/path/long-file-nam_2.ext1.ext2 (36 chars) and so on until...
  • /long/path/long-file-na_10.ext1.ext2 (36 chars; first part further truncated)

This solution assumes that storage backends which don't allow underscores or digits in file names will implement their own get_available_name (just like the same is currently assumed for underscores only).

It doesn't solve the case when the length of the directory path approaches max_length and causes even truncated file names to overflow.

streetcleaner, do you intend to write a patch? If not, I volunteer to do that.

comment:15 by Adam Nelson, 15 years ago

#10149 is a related ticket.

comment:16 by Karen Tracey, 15 years ago

milestone: 1.2

With no concrete patch at this point, I don't see this having a chance to make it into 1.2.

in reply to:  6 comment:17 by Antti Kaihola, 14 years ago

Replying to wwinham:

What are the drawbacks to just making a FileField use Text instead of varchar in the database? It seems like a pretty simple solution that would completely resolve the issue.

Another point from Shai Berger in the googlegroups thread mentioned earlier:

In many engines, text fields are kept out-of-table; then, what's kept in the
table, is (effectively) the name of a file where the text is kept. Even if
this doesn't seriously affect performance, using such a field to keep a
filename must raise a few eyebrows.

by Carson Gee, 14 years ago

comment:18 by Carson Gee, 14 years ago

I just ran into this problem last week and decided to try and write a patch. The problem was reported before model validation was incorporated, so it seemed like an easy fix to just add a validate method to the FileField model class now that those are validated. The patch just runs generate_filename to get the full length of the path that will be stored in the database and ensure it is less than max_length. Be kind, it's my first attempt at a patch.

comment:19 by Carson Gee, 14 years ago

Has patch: set

comment:20 by Chris Beaven, 14 years ago

Severity: Normal
Type: Bug

comment:21 by Chris Beaven, 14 years ago

Needs tests: set

I'd be nice if there was also some way of passing the maximum length allowed to the storage engine too so when it generates a unique name, it can truncate it down.

But even without that, this is a worthy addition.

comment:22 by Aymeric Augustin, 13 years ago

UI/UX: unset

Change UI/UX from NULL to False.

comment:23 by Aymeric Augustin, 13 years ago

Easy pickings: unset

Change Easy pickings from NULL to False.

comment:24 by Jeroen Dekkers, 13 years ago

Cc: jeroen@… added

comment:25 by Aymeric Augustin, 13 years ago

Owner: changed from streetcleaner to Aymeric Augustin
Status: assignednew

comment:26 by Aymeric Augustin <aymeric.augustin@…>, 13 years ago

Resolution: fixed
Status: newclosed

In [dcd4383107d96c18bcb53312ca4de070374b334c]:

Fixed #9893 -- Validated the length of file names

after the full file name is generated by the storage class.

Thanks Refefer for the report, carsongee for the patch, and
everyone else involved in the discussion.

comment:27 by Claude Paroz <claude@…>, 12 years ago

In [05d333ba3bb16af024c11966d2072de38fe9f82f]:

Fixed #18515 -- Conditionally regenerated filename in FileField validation

When a FileField value has been saved, a new validation should not
regenerate a new filename when checking the length. Refs #9893.

comment:28 by Aymeric Augustin, 12 years ago

Resolution: fixed
Status: closednew

I'm going to revert that commit because of the regression reported in #19525.

Here are the relevant bits of that ticket:


As of dcd43831 (fixing #9893), a FileField will call generate_filename() as part of the validation step for a FileField on a form. This was then updated in 05d333ba to address #18515.

Unfortunately, this means that the filename function provided as an argument to upload_to can no longer reference any field with a pre-save behavior.

The common use case for this is to organize files on disk according to upload date. For example:

def user_filename(instance, filename):
    return os.path.join('user_files', instance.uploaded_timestamp.strftime('%Y-%m-%d'), filename)

class UserFile(models.Model):
    uploaded_timestamp = models.DateTimeField(auto_now_add=True)
    data = models.FileField(upload_to=user_filename)

Under Django 1.5, attempting to call is_valid() on a Modelform for this model will raise a "'NoneType' object has no attribute 'strftime'" exception, because instance.uploaded_timestamp hasn't been instantiated yet. This is despite the fact that the uploaded data has been provided, the generated filename would be valid, and the upload timestamp can be computed.

In Django 1.4 and earlier, this works because no validation was performed for FileFields filenames; the uploaded_timestamp was evaluated as part of the model pre-save, and the persistence of the file to disk occurred after the model was saved.

To my reading, the documentation is ambiguous on whether this is expected behavior or not. It says that the model may not be saved to the database yet, but points at AutoField as the cause for problems. However, it also explicitly talks about using strftime as part of file paths. A file datetimes of 'now' would seem to be an obvious usage of this feature.


For the record, I discovered this by upgrading a commercial project to Django 1.5, so there is at least one project in the wild that will be affected by this change. Although I've discovered it with a auto_now_add FileField, it's not hard to see that this change also affects any field with a pre_save behaviour.

It also has the potential to lead to incorrect validation. Consider the case of a field with a pre_save behavior that updates the field (auto_now is one example, but any denormalization/summary field would be an example). The call to validate occurs *before* the call to pre_save is made, which means that you're going to get the pre_save value used as part of your validation. If you then save the model, the pre_save() will be called, and the actual filename that is used for saving the file will be different to the one used for validation.

Some initial thoughts about possible solutions:

  • Document that you can't use a field with pre-save behaviour. Not ideal IMHO, since it rules out an obvious use case for upload_to.
  • Roll back the fix. Also less than ideal; #9893 is an edge case bug, but it's something that has been seen in the wild, and isn't *too* hard to generate.
  • Invoke pre_save on all model fields prior to validation. Given that most validation doesn't need this, this approach seems a little excessive.

I've just checked with my production code, and yes, default=timezone.now works for the auto_now_add case. However, it won't address auto_now, or the pre_save conflict problem.

comment:29 by Aymeric Augustin <aymeric.augustin@…>, 12 years ago

In db278c3bf9177043c42a9ed8b529a5c117938460:

Fixed #19525 -- Reverted dcd4383107 and 05d333ba3b.

Refs #9893, #18515.

Thanks Russell for the report.

comment:30 by Aymeric Augustin <aymeric.augustin@…>, 12 years ago

In 3cb87ec605fb9e7785aa0580bd8220797b622e0c:

[1.5.x] Fixed #19525 -- Reverted dcd4383107 and 05d333ba3b.

Refs #9893, #18515.

Thanks Russell for the report.

Backport of db278c3 from master.

comment:31 by Preston Holmes <preston@…>, 12 years ago

In 0e431e5dd7ed19aa2119ceba9ebed050c2988844:

[1.5.x] Fixed #19525 -- Reverted dcd4383107 and 05d333ba3b.

Refs #9893, #18515.

Thanks Russell for the report.

Backport of db278c3 from master.

comment:32 by Aymeric Augustin, 12 years ago

Status: newassigned

comment:33 by Aymeric Augustin, 12 years ago

Owner: Aymeric Augustin removed
Status: assignednew

I don't know how to solve this properly.

comment:34 by Aymeric Augustin, 12 years ago

Component: Core (Other)File uploads/storage

comment:35 by Walter Doekes, 12 years ago

Cc: walter+django@… added

comment:36 by Walter Doekes, 12 years ago

So, I ran into bug #13314. These are all similar but not quite the same.

#9893 -- Filename + path length greater than 100 (when making filenames unique)
#10410 -- FileField saves one filename on disk and another on DB
#13314 -- validation does not account for "upload_to" when counting characters

This ticket #9893 is about the changed filenames when making them unique.

Ticket #13314 is just about the upload_to parameter. For that problem the fix could be done like this:

--- django/db/models/fields/files.py.orig	2013-04-01 17:03:59.752332630 +0200
+++ django/db/models/fields/files.py	2013-04-01 17:08:15.833870239 +0200
@@ -290,6 +290,8 @@
         if 'initial' in kwargs:
             defaults['required'] = False
         defaults.update(kwargs)
+        # And deduct the upload_to length from the max_length.
+        defaults['max_length'] -= len(self.upload_to)
         return super(FileField, self).formfield(**defaults)
 
 class ImageFileDescriptor(FileDescriptor):

With this environment:

  • ImageField with max_length default 100
  • An upload_to value of "profile/" (8 characters)

Without the fix, I get a DatabaseError for a 99-byte length filename.
With the fix, I get a nice ValidationError until I reduce the filename to 92 bytes.

That should be at least be one less problem, right?

This was tested with Django 1.3.7.

comment:37 by Pavel Shpilev, 11 years ago

Owner: set to Pavel Shpilev
Status: newassigned

comment:38 by Pavel Shpilev, 11 years ago

Hey guys
Long time no activity on the ticket...
I have recently run into this bug on production myself. We needed a solution ASAP, so I ended up with an overwrite for get_available_name(), providing max_filename_length as an argument. Pretty much what @akaihola first suggested in https://code.djangoproject.com/ticket/9893#comment:5. Seem to be working fine for us. I can try implement this as a general solution now, if you think this would be a proper fix.
Cheers.

comment:39 by Simon Charette, 10 years ago

Cc: Simon Charette added
Needs documentation: set
Patch needs improvement: set

pavel_shpilev put together a PR based on ticket:9893#comment:1 that still requires some adjustments.

comment:40 by Collin Anderson, 10 years ago

Patch needs improvement: unset

Pavel Shpilev wants another review.

comment:41 by Berker Peksag, 10 years ago

Needs documentation: unset
Needs tests: unset
Triage Stage: AcceptedReady for checkin

PR #3369 LGTM.

comment:42 by Tim Graham, 10 years ago

Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

Comment for improvement is on the PR.

comment:43 by Pavel Shpilev, 10 years ago

Triage Stage: AcceptedReady for checkin

comment:44 by Tim Graham <timograham@…>, 10 years ago

Resolution: fixed
Status: assignedclosed

In a7c256cb5491bf2a77abdff01638239db5bfd9d5:

Fixed #9893 -- Allowed using a field's max_length in the Storage.

comment:45 by Tim Graham <timograham@…>, 9 years ago

In 1bb6ecf:

Refs #9893 -- Removed shims for lack of max_length support in file storage per deprecation timeline.

comment:46 by Jan Geboers, 8 years ago

Hello,

could it be that this bug still exists?
Both on Django 1.8.18 and 1.11.2 I see this behavior isn't corrected yet.
The validation of max_length takes only file_name into account and ignores upload_to, causing truncation on the database level, exactly as described in https://code.djangoproject.com/ticket/13314

When I check out the relevant code I strongly suspect this bug was never fixed:

https://github.com/django/django/blob/master/django/forms/fields.py#L553

comment:47 by Tim Graham, 8 years ago

#13314 is reopened per the last comment.

Note: See TracTickets for help on using tickets.
Back to Top