Opened 6 years ago

Closed 3 months ago

#9893 closed Bug (fixed)

Filename + path length greater than 100 truncated on database insertion in Core.Storage

Reported by: Refefer Owned by: pavel_shpilev
Component: File uploads/storage Version: master
Severity: Normal Keywords: storage, filename
Cc: drmeers@…, jeroen@…, walter+django@…, charettes Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

In core.files.storage, the storage object doesn't check the length of filename + upload_to to ensure that it's less than 100 characters. If the path length is greater than 100, it is truncated to 100 characters when inserted into the database. With filename collision mitigation appending an '_' to the end of the filename, popular filenames can easily reach lengths that exceed the size of the.

To reproduce the issue, try uploading a file/image with a length over 100 characters.

Solution:
Here's some quick hackery that attempts truncate the filename. Note that it assumes the upload_to length to be less than 30 characters long. Also note that this should probably be divided up into a couple of different methods and this is more to get the ball rolling than anything.

    def get_available_name(self, name):
        """
        Returns a filename that's free on the target storage system, and
        available for new content to be written to.
        """
        # If the filename already exists, append an incrementing integer
        # to the file until the filename doesn't exist.
        dir_name, file_name = os.path.split(name)
        flength = len(file_name)
	if flength > 70: # If filenameis longer than 70, truncate filename
		offset = flength - (flength % 40 + 20) # modulus of file name + 20 to prevent file type truncation
		file_name = file_name[offset:]
		name = os.path.join(dir_name, file_name)

        if self.exists(name): # filename exists, get dot index
            try:
                dot_index = file_name.rindex('.')
            except ValueError: # filename has no dot
                dot_index = -1

            inc = 0 # Set incrementer to zero
            while self.exists(name):
                inc += 1
                if dot_index == -1: # If no dot, append to end
                    tname = file_name + str(inc)
                else:
                    tname = file_name[:dot_index] + str(inc) + file_name[dot_index:]
                name = os.path.join(dir_name, tname)

        return name

Attachments (1)

filefield_length_validation.diff (1.7 KB) - added by carsongee 4 years ago.

Download all attachments as: .zip

Change History (45)

comment:1 Changed 6 years ago by wwinham

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

Wouldn't it be easy to change the length of the field to > 100 characters? It's not like increasing the space to 255 would hurt anybody and it would surely make the problem go away for a huge percentage of file names. I just ran in to this myself.

comment:2 Changed 6 years ago by anonymous

  • milestone post-1.0 deleted

Milestone post-1.0 deleted

comment:3 Changed 6 years ago by jacob

  • milestone set to 1.1
  • Triage Stage changed from Unreviewed to Accepted

comment:4 Changed 6 years ago by akaihola

I ran into this issue when using the django-attachments re-usable app, but in my case the file name wasn't truncated – an exception was thrown instead. I'm storing files in subdirectories named after the user who uploads (max 30 characters), and apparently one of my users was uploading a file with a really long name so len("attachments/<username>/<filename.ext>") exceeded 100 characters.

I wonder why an exception happened instead of truncation. I'm using PostgreSQL, maybe that's significant?

Here's the relevant part of the traceback:

  File "attachments/attachments/models.py", line 137, in save
    super(Attachment, self).save(force_insert, force_update)

  File "django/db/models/base.py", line 408, in save
    self.save_base(force_insert=force_insert, force_update=force_update)

  File "django/db/models/base.py", line 484, in save_base
    result = manager._insert(values, return_id=update_pk)

  File "django/db/models/manager.py", line 177, in _insert
    return insert_query(self.model, values, **kwargs)

  File "django/db/models/query.py", line 1035, in insert_query
    return query.execute_sql(return_id)

  File "django/db/models/sql/subqueries.py", line 320, in execute_sql
    cursor = super(InsertQuery, self).execute_sql(None)

  File "django/db/models/sql/query.py", line 2290, in execute_sql
    cursor.execute(sql, params)

DataError: value too long for type character varying(100)

comment:5 Changed 6 years ago by akaihola

Issues to consider:

Smart truncation can't be done in the storage object
since it doesn't know about the max_length= of the FileField.
The maximum length can be set to a different value than 100 in the model.

One possibility is to extend the signature of
Storage.get_valid_name() and .get_available_name() to accept
a maximum file name length argument, but I assume that's sub-optimal.

Truncation should avoid collisions with existing file names.
Simply shortening the file name isn't sufficient,
but something similar to what VFAT did with its FILENA~1.EXT
8.3-character filename representations should be used.
Storage.exists() can be called to check for collisions.

The upload_to= kwarg of FileField can be either

  • a string specifying the target directory, or
  • a callable which replaces FileField.generate_filename and returns the path to the file to be saved.

In case of a callable, do we leave length checking as responsibility of the callable,
or should we prepare to truncate the generated path?

Truncation could be done in FileField.generate_filename(),
but how to ensure the generated filename is still valid for the storage?
This is an issue if a numbering scheme must be used to avoid file name collisions.

What if the directory is already too long without the actual file name?
This shouldn't be an issue when upload_to= is a string,
but a callable might generate a long path if not coded carefully.

comment:6 follow-ups: Changed 6 years ago by wwinham

What are the drawbacks to just making a FileField use Text instead of varchar in the database? It seems like a pretty simple solution that would completely resolve the issue.

comment:7 in reply to: ↑ 6 Changed 6 years ago by akaihola

Replying to wwinham:

What are the drawbacks to just making a FileField use Text instead of varchar in the database? It seems like a pretty simple solution that would completely resolve the issue.

I started a thread with your question on the django-developers list.

comment:8 Changed 6 years ago by jacob

  • milestone changed from 1.1 to 1.2

I think ultimately akaihola's idea of modifying the get_valid_name and get_available_name to accept a max length is the right way to go. Using a text field is a bad idea for a bunch of reason -- speed, storage size, compatibility with Oracle -- but any hardcoded length could possibly be too short.

At this point past feature-freeze, though, changing that signature could cause too much heartache. I'm going to punt this to 1.2; for now, it's a simple workaround: use FileField(max_length=SOMETHING_BIGGER).

comment:9 Changed 6 years ago by SmileyChris

  • Has patch unset

comment:10 Changed 6 years ago by DrMeers

  • Cc drmeers@… added

comment:11 Changed 5 years ago by jcsackett

  • Owner changed from nobody to jcsackett
  • Status changed from new to assigned

comment:12 Changed 5 years ago by jcsackett

  • Owner changed from jcsackett to nobody
  • Status changed from assigned to new

comment:13 Changed 5 years ago by streetcleaner

  • Owner changed from nobody to streetcleaner
  • Status changed from new to assigned

comment:14 Changed 5 years ago by akaihola

Jacob,

To elaborate the changes needed:

  • the following methods need a max_length=100 kwarg:
    • Storage.get_valid_name
    • Storage.get_available_name
    • FileSystemStorage._save
    • Storage.save
  • FileField.get_filename needs to call Storage.get_valid_name with the field's max_length
  • FileField.save needs to call self.storage.save with the field's max_length
  • Storage.save needs to pass that value to self._save
  • FileSystemStorage._save needs to pass it on to self.get_available_name

In addition, shouldn't a smart filename truncation / collision prevention mechanism be put in place anyway? As pointed out in the ticket description, appending underscores will cause collisions if truncation happens.

I propose appending an underscore and a counter integer to the first part of a dot-separated filename. The first part should be truncated before appending to ensure that no truncation will happen at the end. Examples (with a max_length of 36):

  • /long/path/long-file-name.ext1.ext2 (35 characters)
  • /long/path/long-file-nam_1.ext1.ext2 (36 characters; first part truncated)
  • /long/path/long-file-nam_2.ext1.ext2 (36 chars) and so on until...
  • /long/path/long-file-na_10.ext1.ext2 (36 chars; first part further truncated)

This solution assumes that storage backends which don't allow underscores or digits in file names will implement their own get_available_name (just like the same is currently assumed for underscores only).

It doesn't solve the case when the length of the directory path approaches max_length and causes even truncated file names to overflow.

streetcleaner, do you intend to write a patch? If not, I volunteer to do that.

comment:15 Changed 5 years ago by adamnelson

#10149 is a related ticket.

comment:16 Changed 5 years ago by kmtracey

  • milestone 1.2 deleted

With no concrete patch at this point, I don't see this having a chance to make it into 1.2.

comment:17 in reply to: ↑ 6 Changed 4 years ago by akaihola

Replying to wwinham:

What are the drawbacks to just making a FileField use Text instead of varchar in the database? It seems like a pretty simple solution that would completely resolve the issue.

Another point from Shai Berger in the googlegroups thread mentioned earlier:

In many engines, text fields are kept out-of-table; then, what's kept in the
table, is (effectively) the name of a file where the text is kept. Even if
this doesn't seriously affect performance, using such a field to keep a
filename must raise a few eyebrows.

Changed 4 years ago by carsongee

comment:18 Changed 4 years ago by carsongee

I just ran into this problem last week and decided to try and write a patch. The problem was reported before model validation was incorporated, so it seemed like an easy fix to just add a validate method to the FileField model class now that those are validated. The patch just runs generate_filename to get the full length of the path that will be stored in the database and ensure it is less than max_length. Be kind, it's my first attempt at a patch.

comment:19 Changed 4 years ago by carsongee

  • Has patch set

comment:20 Changed 4 years ago by SmileyChris

  • Severity set to Normal
  • Type set to Bug

comment:21 Changed 4 years ago by SmileyChris

  • Needs tests set

I'd be nice if there was also some way of passing the maximum length allowed to the storage engine too so when it generates a unique name, it can truncate it down.

But even without that, this is a worthy addition.

comment:22 Changed 3 years ago by aaugustin

  • UI/UX unset

Change UI/UX from NULL to False.

comment:23 Changed 3 years ago by aaugustin

  • Easy pickings unset

Change Easy pickings from NULL to False.

comment:24 Changed 3 years ago by dekkers

  • Cc jeroen@… added

comment:25 Changed 3 years ago by aaugustin

  • Owner changed from streetcleaner to aaugustin
  • Status changed from assigned to new

comment:26 Changed 3 years ago by Aymeric Augustin <aymeric.augustin@…>

  • Resolution set to fixed
  • Status changed from new to closed

In [dcd4383107d96c18bcb53312ca4de070374b334c]:

Fixed #9893 -- Validated the length of file names

after the full file name is generated by the storage class.

Thanks Refefer for the report, carsongee for the patch, and
everyone else involved in the discussion.

comment:27 Changed 3 years ago by Claude Paroz <claude@…>

In [05d333ba3bb16af024c11966d2072de38fe9f82f]:

Fixed #18515 -- Conditionally regenerated filename in FileField validation

When a FileField value has been saved, a new validation should not
regenerate a new filename when checking the length. Refs #9893.

comment:28 Changed 2 years ago by aaugustin

  • Resolution fixed deleted
  • Status changed from closed to new

I'm going to revert that commit because of the regression reported in #19525.

Here are the relevant bits of that ticket:


As of dcd43831 (fixing #9893), a FileField will call generate_filename() as part of the validation step for a FileField on a form. This was then updated in 05d333ba to address #18515.

Unfortunately, this means that the filename function provided as an argument to upload_to can no longer reference any field with a pre-save behavior.

The common use case for this is to organize files on disk according to upload date. For example:

def user_filename(instance, filename):
    return os.path.join('user_files', instance.uploaded_timestamp.strftime('%Y-%m-%d'), filename)

class UserFile(models.Model):
    uploaded_timestamp = models.DateTimeField(auto_now_add=True)
    data = models.FileField(upload_to=user_filename)

Under Django 1.5, attempting to call is_valid() on a Modelform for this model will raise a "'NoneType' object has no attribute 'strftime'" exception, because instance.uploaded_timestamp hasn't been instantiated yet. This is despite the fact that the uploaded data has been provided, the generated filename would be valid, and the upload timestamp can be computed.

In Django 1.4 and earlier, this works because no validation was performed for FileFields filenames; the uploaded_timestamp was evaluated as part of the model pre-save, and the persistence of the file to disk occurred after the model was saved.

To my reading, the documentation is ambiguous on whether this is expected behavior or not. It says that the model may not be saved to the database yet, but points at AutoField as the cause for problems. However, it also explicitly talks about using strftime as part of file paths. A file datetimes of 'now' would seem to be an obvious usage of this feature.


For the record, I discovered this by upgrading a commercial project to Django 1.5, so there is at least one project in the wild that will be affected by this change. Although I've discovered it with a auto_now_add FileField, it's not hard to see that this change also affects any field with a pre_save behaviour.

It also has the potential to lead to incorrect validation. Consider the case of a field with a pre_save behavior that updates the field (auto_now is one example, but any denormalization/summary field would be an example). The call to validate occurs *before* the call to pre_save is made, which means that you're going to get the pre_save value used as part of your validation. If you then save the model, the pre_save() will be called, and the actual filename that is used for saving the file will be different to the one used for validation.

Some initial thoughts about possible solutions:

  • Document that you can't use a field with pre-save behaviour. Not ideal IMHO, since it rules out an obvious use case for upload_to.
  • Roll back the fix. Also less than ideal; #9893 is an edge case bug, but it's something that has been seen in the wild, and isn't *too* hard to generate.
  • Invoke pre_save on all model fields prior to validation. Given that most validation doesn't need this, this approach seems a little excessive.

I've just checked with my production code, and yes, default=timezone.now works for the auto_now_add case. However, it won't address auto_now, or the pre_save conflict problem.

comment:29 Changed 2 years ago by Aymeric Augustin <aymeric.augustin@…>

In db278c3bf9177043c42a9ed8b529a5c117938460:

Fixed #19525 -- Reverted dcd4383107 and 05d333ba3b.

Refs #9893, #18515.

Thanks Russell for the report.

comment:30 Changed 2 years ago by Aymeric Augustin <aymeric.augustin@…>

In 3cb87ec605fb9e7785aa0580bd8220797b622e0c:

[1.5.x] Fixed #19525 -- Reverted dcd4383107 and 05d333ba3b.

Refs #9893, #18515.

Thanks Russell for the report.

Backport of db278c3 from master.

comment:31 Changed 2 years ago by Preston Holmes <preston@…>

In 0e431e5dd7ed19aa2119ceba9ebed050c2988844:

[1.5.x] Fixed #19525 -- Reverted dcd4383107 and 05d333ba3b.

Refs #9893, #18515.

Thanks Russell for the report.

Backport of db278c3 from master.

comment:32 Changed 2 years ago by aaugustin

  • Status changed from new to assigned

comment:33 Changed 2 years ago by aaugustin

  • Owner aaugustin deleted
  • Status changed from assigned to new

I don't know how to solve this properly.

comment:34 Changed 2 years ago by aaugustin

  • Component changed from Core (Other) to File uploads/storage

comment:35 Changed 2 years ago by wdoekes

  • Cc walter+django@… added

comment:36 Changed 2 years ago by wdoekes

So, I ran into bug #13314. These are all similar but not quite the same.

#9893 -- Filename + path length greater than 100 (when making filenames unique)
#10410 -- FileField saves one filename on disk and another on DB
#13314 -- validation does not account for "upload_to" when counting characters

This ticket #9893 is about the changed filenames when making them unique.

Ticket #13314 is just about the upload_to parameter. For that problem the fix could be done like this:

--- django/db/models/fields/files.py.orig	2013-04-01 17:03:59.752332630 +0200
+++ django/db/models/fields/files.py	2013-04-01 17:08:15.833870239 +0200
@@ -290,6 +290,8 @@
         if 'initial' in kwargs:
             defaults['required'] = False
         defaults.update(kwargs)
+        # And deduct the upload_to length from the max_length.
+        defaults['max_length'] -= len(self.upload_to)
         return super(FileField, self).formfield(**defaults)
 
 class ImageFileDescriptor(FileDescriptor):

With this environment:

  • ImageField with max_length default 100
  • An upload_to value of "profile/" (8 characters)

Without the fix, I get a DatabaseError for a 99-byte length filename.
With the fix, I get a nice ValidationError until I reduce the filename to 92 bytes.

That should be at least be one less problem, right?

This was tested with Django 1.3.7.

comment:37 Changed 11 months ago by pavel_shpilev

  • Owner set to pavel_shpilev
  • Status changed from new to assigned

comment:38 Changed 10 months ago by pavel_shpilev

Hey guys
Long time no activity on the ticket...
I have recently run into this bug on production myself. We needed a solution ASAP, so I ended up with an overwrite for get_available_name(), providing max_filename_length as an argument. Pretty much what @akaihola first suggested in https://code.djangoproject.com/ticket/9893#comment:5. Seem to be working fine for us. I can try implement this as a general solution now, if you think this would be a proper fix.
Cheers.

comment:39 Changed 6 months ago by charettes

  • Cc charettes added
  • Needs documentation set
  • Patch needs improvement set

pavel_shpilev put together a PR based on ticket:9893#comment:1 that still requires some adjustments.

comment:40 Changed 5 months ago by collinanderson

  • Patch needs improvement unset

Pavel Shpilev wants another review.

comment:41 Changed 5 months ago by berkerpeksag

  • Needs documentation unset
  • Needs tests unset
  • Triage Stage changed from Accepted to Ready for checkin

PR #3369 LGTM.

comment:42 Changed 4 months ago by timgraham

  • Patch needs improvement set
  • Triage Stage changed from Ready for checkin to Accepted

Comment for improvement is on the PR.

comment:43 Changed 3 months ago by pavel_shpilev

  • Triage Stage changed from Accepted to Ready for checkin

comment:44 Changed 3 months ago by Tim Graham <timograham@…>

  • Resolution set to fixed
  • Status changed from assigned to closed

In a7c256cb5491bf2a77abdff01638239db5bfd9d5:

Fixed #9893 -- Allowed using a field's max_length in the Storage.

Note: See TracTickets for help on using tickets.
Back to Top