Code

Opened 5 years ago

Closed 4 years ago

Last modified 19 months ago

#11663 closed Uncategorized (wontfix)

Delete orphaned replaced files

Reported by: SmileyChris Owned by: SmileyChris
Component: File uploads/storage Version: master
Severity: Normal Keywords:
Cc: joh Triage Stage: Design decision needed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

When using a model instance with a FileField (or subclass), you can update the image by setting a new file as an instance matching the field. This leaves the potential of an orphaned file.

A common case is that you have a ModelForm (bound to an existing instance) for which you use to upload a new file:

  • The form is saved in your view which overwrites the linked FieldFile with a new one (which will be based on the data from the UploadedFile).
  • Now there is no reference to the old file, so even custom storage can't touch it (unless you use an upload_to callable which names it the same and your custom storage is changed to dangerously blindly overwrite files with the same name).

Note, the delete orphan logic already exists - FileField attaches a post_delete signal to it's Model class which does this.

This is obviously a change in functionality. To preserve backwards compatibility, I recommend that this is a new argument on FileField (off by default).

Attachments (3)

11663.diff (19.6 KB) - added by SmileyChris 5 years ago.
Patch with comprehensive tests
11663.2.diff (21.0 KB) - added by SmileyChris 5 years ago.
11663.3.diff (23.1 KB) - added by SmileyChris 4 years ago.
With some simple docs (and updated to r13350)

Download all attachments as: .zip

Change History (25)

Changed 5 years ago by SmileyChris

Patch with comprehensive tests

comment:1 Changed 5 years ago by SmileyChris

  • Needs documentation set
  • Needs tests unset
  • Owner changed from nobody to SmileyChris
  • Patch needs improvement unset
  • Status changed from new to assigned

comment:2 Changed 5 years ago by Gulopine

On the two prior tickets where Chris has been pushing this, there were notes indicating that this behavior was easy to achieve with the file-storage refactor that went into Django 1.0. Unfortunately, what I had been discussing was simply the ability to overwrite existing files easily enough. What Chris -- and many others -- are asking for here is the ability to make that decision on a per-instance basis, which currently isn't easy.

To clarify the comment about "dangerously" blindly overwriting files, Chris is referring to the situation where one user uploads a file and another uploads another with the same name, overwriting the first user's file. That was my suggestion in the past, and I do have to agree with Chris that it was a naive approach that has some pretty significant downsides.

comment:3 Changed 5 years ago by mtredinnick

I know naming things is hard in computer science, but come on... safe_delete_replaced?! Implies there is also dangerous_delete_replaced. All the usages of safe_ prefixes in this patch make my brain hurt. Either we're deleting or we're not. It's neither safe nor dangerous, so let's try to avoid confusing things with non-useful information in the parameter name.

Why does the save() method needs an extra parameter? Deleting is a property of the field itself (if you only want to delete sometimes, it's pretty easy already to override the save() method). I'm still fairly under-enthused by this idea, but at least let's make the patch correct so that we can see what's being evaluated.

Changed 5 years ago by SmileyChris

comment:4 Changed 5 years ago by SmileyChris

Thanks for your initial comments, Malcolm. I've uploaded a new version addressing these.

  • "safe" implied only deleting orphans. But you are correct, it wasn't a very good term.
  • This went through several iterations - save() doesn't need an extra property now and this has been removed.

I'm a bit saddened as to lack of enthusiasm for inclusion.

comment:5 Changed 5 years ago by Alex

  • Triage Stage changed from Unreviewed to Design decision needed

comment:6 Changed 5 years ago by olau

Just two quick comments from a luser: I fail to see the rationale in leaving orphaned files around, why would you make this configurable and turned off instead of just assuming it to be default behaviour? (On a similar note, I fail to see how two different objects can point to the same file unless someone has hacked the field paths, in which case they asked for it. So the database lookup appears unnecessary. But I guess it's just a carry over. By the way, does this field install an index? Otherwise, what happens when you try to delete user profiles with images on a site with 1.000.000 users?)

The other comment is that this still doesn't not fix issue 4339 since there is no easy knob for telling Django "this filename, and I mean it, no underscores" for fields where you do not use the original filename but auto-generate one (say based on DB id).

But it's great to see you're working on it. :)

comment:7 Changed 5 years ago by SmileyChris

It is currently optional because it is a change in functionality. Currently, files aren't touched if the model is updated (they are already removed if the model is deleted). A site may be relying on this behaviour. I didn't want to rock the boat more than possible.

But let's leave design discussion on the Django Developers google group. Add a comment here with a link to discussion thread.

Regarding #4339, if that's what it was after, then it's a straight dupe of the original meaning of #2983, which was closed as a wontfix (as this is possible by using custom FileStorage). Again, if you contest this then bring it up in the google group.

comment:9 Changed 4 years ago by joh

  • Cc joh added

Is there any progress on this ticket?

I was just hit by the behavior of FileField, and did notice it only notice it by accident. So if there are good reasons not to change the API, it may be a good thing to add a note or a warning note to the documentation of FileField, so users of FileField get a real chance to notice this behavior.

This just would have saved me some hours of googling around, so I am willing to prepare a patch for the docs if it is appreciated, this patch should be easier to integrate (no side effects in functionality and no API change) until a design decision will be done in the far future. ;-)

comment:10 follow-up: Changed 4 years ago by metamemetics <dholl57@…>

Hello this problem was bothering me A LOT.
I just added a custom save method for my model and it seems to be working fine though:

def save(self, *args, **kwargs):
    try:
        this = MyModelName.objects.get(id=self.id)
        if this.MyImageFieldName != self.MyImageFieldName:
            this.MyImageFieldName.delete()
    except: pass
    super(MyModelName, self).save(*args, **kwargs)

Hopefully this helps anyone looking for a quick work around, let me know if it doesn't work.

Changed 4 years ago by SmileyChris

With some simple docs (and updated to r13350)

comment:11 Changed 4 years ago by SmileyChris

  • Needs documentation unset

comment:12 Changed 4 years ago by dogada

IMO when you just need to update existing file, Django should be able to do this (with a flag like reuse_file=True) instead of creating new and then deleting old file. To achieve this I use following workaround for unfortunately missed base functionality.

import os
from base64 import b32encode
from django.db.models import FileField
from django.db.models.fields.files import FileDescriptor

def generate_random_filename(request_filename, default_ext=''):
    ext = os.path.splitext(request_filename)[1] or default_ext
    # base32 produces url-friendly strings, but shorter than hex.                                                              
    rnd_string = b32encode(os.urandom(settings.SAFE_IMAGE_BYTE_LENGTH))
    filename = rnd_string.lower() + ext
    return filename

def _initial_field_name(field_name):
    return '_initial_%s_name' % field_name

class StableFileDescriptor(FileDescriptor):
    """                                                                                                                        
    The descriptor allows to remember original filename of a FileField.                                                      
    """

    def __set__(self, instance, value):
        if isinstance(value, basestring): # filename of current file loaded from database                                      
            instance.__dict__[_initial_field_name(self.field.name)] = value
        super(StableFileDescriptor, self).__set__(instance, value)


class StableFileField(FileField):
    descriptor_class = StableFileDescriptor

    def generate_filename(self, instance, request_filename):
        filename = getattr(instance, _initial_field_name(self.name), None)
        if not filename:
            filename = generate_random_filename(request_filename)
        return super(StableFileField, self).generate_filename(instance, filename)

comment:13 Changed 4 years ago by SmileyChris

  • Resolution set to wontfix
  • Status changed from assigned to closed

Talking with malcolm, this has about a 0.01% chance of making it into django. On that note, I'm wontfixing the issue.

Oh well, I did learn a lot about the internals :)

comment:14 Changed 4 years ago by mtredinnick

For the record (since I just got a Trac collision whilst also wontfix'ing this), my logic is: Django shouldn't be in the business of deleting files, which is more of a system-level task. We have no way of knowing what else on the system is using those files. If we *don't* delete a file and the user wants to remove it, they can use the os module. If we *do* delete a file and that wasn't the right thing (from the point of view of something else on the system), the user is screwed. We cannot guarantee that removal is correct, so we shouldn't do it.

Writing a utility function to check for orphaned files and clean them up is a simple enough operation. Whether to include it in Django or not is debateable (it's kind of like the cleanup sessions management command, I guess) and the topic for another ticket. Doing it automatically feels too dangerous to be correct at any time.

comment:15 Changed 3 years ago by anonymous

People who need the cleanup command that Malcolm mentions will perhaps find https://github.com/mrts/django-commands/blob/devel/django_commands/management/commands/file_cleanup.py useful.

comment:16 in reply to: ↑ 10 Changed 3 years ago by webinvent

  • Easy pickings unset
  • Severity set to Normal
  • Type set to Uncategorized
  • UI/UX unset

Hi,
The solution by metamemetics works perfectly for me!
Thank you

Last edited 3 years ago by webinvent (previous) (diff)

comment:17 Changed 3 years ago by banerjs.sid

Having recently had this problem on my application, I will risk the wrath of the community in adding onto an already over-discussed, and perhaps a dead thread.

Malcolm's concerns regarding the deletion of files is valid, and honestly, I would think twice about deleting any file from my server (I am speaking from personal experience). However that said, I think that it is important to realize that more often than not, the files that are deleted are unimportant and a useless waste of disk space that many of us are paying for.

I think that as "The Web framework for perfectionists with deadlines" it is important that Django makes the development process quicker by allowing us developers to simply toggle one or more field options for file and image uploads rather than having us set up cron jobs or the like to deal with the orphaned files. It may be simple, but it takes time, and makes the process much more tedious.

Also to lay some of other Malcolm's fears to rest, I do not think that any developer worth any amount of programming talent would blame Django if an important file was deleted. The blame will lay solely on the developers inability to keep mission critical files separate from not as important ones.

Please do consider applying this as a fix to a future Django release.

comment:18 Changed 3 years ago by DanielClemente

Django cannot know whether the file is used somewhere else, but sometimes *the developer* does know this for certain and wants to delete old files. There should be an easy way to do it (say, by toggling a parameter after reading the documentation for some seconds) because there are enough interested users who are currently spending hours with this issue.

The proposed half-baked workarounds are not trivial enough; none worked for me at the first try.
I propose having FileField(upload_to='sw',replace=True), where replace=False would be the default.

comment:19 Changed 2 years ago by anonymous

+1 for beeing able to do something like this:

FileField(upload_to='sw',replace=True)

@DanielClemente:
The only "quick and dirty workaround" i found for this is: (see #4339 )

from django.core.files.storage import FileSystemStorage
class OverwriteStorage(FileSystemStorage):
    def _save(self, name, content):
        if self.exists(name):
            self.delete(name)
        return super(OverwriteStorage, self)._save(name, content)
    def get_available_name(self, name):
        return name

comment:20 Changed 2 years ago by olau

I hate to add to the comment spam, but really, people, seeing that this is WONTFIX and that 1.3 has removed feature that a file field will delete it's associated file when the object is deleted, it's clear that Malcolm's idea of never touching the file system has won this battle.

As much as you may disagree with this, please don't comment here. It's useless and will just spam all of us. If you want to do something about this, maybe use your energy to code a new file field that does in fact clean up?

Regarding the file_cleanup.py command link posted above: there's a possible race condition there if a file ends up on disk before the corresponding object is saved in the database. If you want to make a non-racy garbage collector for this, you need to be really careful.

comment:21 Changed 2 years ago by anonymous

There is a package for this.

http://pypi.python.org/pypi/django-orphaned

comment:22 Changed 19 months ago by anonymous

why not to compare files also by CRC together with assesing filename existence and returning name from _save is both are the same?

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.