Opened 2 years ago

Last modified 2 years ago

#26562 new New feature

Introduce Storage and FileSystemStorage alternate save behaviour

Reported by: Matt C Owned by: nobody
Component: File uploads/storage Version: master
Severity: Normal Keywords: file FileField storage overwrite filepath
Cc: piethon@… Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I have read through many discussions about justifying why uploaded files should not be deleted or overwritten, but I don't agree with them at all (I can get into that if needed) and I think the current implementation of django.core.files.storage.Storage.get_available_name() and django.core.files.storage.FileSystemStorage._save() are quite undesirable.

The behaviour I am referring to is the fact that the Storage.get_available_name() method is so crucial to the whole storage and FileField game and yet its implementation is susceptible to race conditions and that it is the root cause of a dev not being granted total control of the final file/storage path when saving a file using the storage API.

Flow on from that, the fact that the FileSystemStorage._save() method infinitely loops until a file path collision does not occur (to open a file) and then uses low-level locking mechanisms to try and retain ownership of the file for writing.

If the core devs are keen on retaining this behaviour, so be it.
However, can we please build upon Django's reputation of configurability and allow for an alternative (and non-default, to appease backwards compatibility) behaviour where Storage.get_available_name() effectively does nothing (except maybe length truncation) and FileSystemStorage._save() does not have the infinite loop and does not have the low-level file lock.

My proposals for configurability are:

  • Global setting in the settings file
  • Static attribute of the Storage class, which can be overridden

Change History (14)

comment:1 Changed 2 years ago by Aymeric Augustin

Triage Stage: UnreviewedAccepted

I don't see why a new configurability mechanism would be needed. There's already DEFAULT_FILE_STORAGE for this purpose.

Even if it's entirely doable to subclass FileSystemStorage to implement this behavior and configure it in DEFAULT_FILE_STORAGE, I think Django could provide such a storage backend natively e.g. as django.core.files.storage.OverwritingFileSystemStorage. (Better names welcome.)

OverwritingFileSystemStorage would likely be a superclass of FileSystemStorage because it has fewer features.

I'm accepting the ticket but I'd like another core dev to confirm before we move forwards with a patch.

comment:2 Changed 2 years ago by Claude Paroz

+1 from me.

comment:3 Changed 2 years ago by Matt C

A "native" OverwritingFileSystemStorage class would be all well and good for file-system storage backends, but then what about other kinds of backends (e.g. key-value cloud stores like Amazon S3)?

Half of the problem (as far as I see it), would still exist in the Storage.get_available_name() method.

Please refer to #26058 for another issue which relates to my concerns about coupling Storage with FileSystemStorage.

comment:4 Changed 2 years ago by Aymeric Augustin

Ah, indeed, my proposal wouldn't solve the "change the behavior of Storage.get_available_name()" part of the problem.

Perhaps we can make a backwards-incompatible change there and move the "try to find an available name" logic in a mixin? This would make it reasonably easy for projets that extend Storage and depend on that logic to keep it.

comment:5 Changed 2 years ago by Tim Graham

Maybe I'm misunderstanding but I don't think changing the default to overwriting files of the same name is acceptable. Allowing untrusted users (or even trusted users, accidentally) to clobber existing files is dangerous.

This isn't a concern if you're using a callable FileField.upload_to and generating a file name instead of using the one provided by the user, but we can't assume that.

The use case needs a bit more explanation for me to buy into the idea of including another Storage class in Django.

comment:6 Changed 2 years ago by Matt C

Tim, are you referring to changing the default behaviour of Storage?
If so, I never said to make this alternative behaviour default. I suggested the current behaviour be kept default, but just to have this alternative behaviour as a configurable option.

If you are saying this behaviour shouldn't exist in Django at all, I would like justification as to why, please.

Your argument about allowing users to clobber existing files, it doesn't make sense to me.

If you believe you need to be that protective, why does Django offer a delete() method on models?
Why aren't you overly protective of existing database records?
The answer is, because it makes no sense to natively not allow the ability for database records to be deleted.
Why should countless Django client devs have to duplicate the behaviour themselves, just because you think they are too careless to be able to prevent misuse?

That might be a bit of an extreme analogy, but the ability to make raw queries in Django, is less extreme, but still a valid analogy.
Yes it has the potential to be more risky than using model queries in the traditional way, but you still enable the clients to be able to utilise it if they need it (albeit with a lot of warnings).
Why not the same here?

I understand your (and other core devs') reservations, but I don't agree that they are justification for not allowing the behaviour to exist natively, at all.

comment:7 Changed 2 years ago by Aymeric Augustin

I suggested the current behaviour be kept default, but just to have this alternative behaviour as a configurable option.

I don't like the idea of piling up configuration mechanisms. There's already a setting for selecting the storage backend.

comment:8 Changed 2 years ago by Matt C

I don't like the idea of piling up configuration mechanisms.

Fair enough.

There's already a setting for selecting the storage backend.

Yes, I am currently using this to implement the desired behaviour using subclasses, but there are two problems with putting the responsibility on the client:

  • Code has to be duplicated from the FileSystemStorage._save() method, in order to override it with the desired behaviour.
  • These overriding subclasses will be duplicated by every other client who wants the desired behaviour.

How do you propose introducing the behaviour natively as a non-default option, without breaking backwards compability and without a configuration option?

comment:9 Changed 2 years ago by Matt C

Personally, I see the storage API as too coupled to traditional file systems and (along with the FileField implementation) as too protective in an undesirable and unpreditable way.

#26058 is step in the right direction (along with this ticket) in my opinion, but I can see how the storage API and FileField have backed you into a backwards compatibility corner.

I only opened this ticket for the sake of all current and future Django clients, not because I can't achieve the desired behaviour myself.

comment:10 Changed 2 years ago by Tim Graham

The concern about clobbering files is this: if user A uploads a file named "foo.txt" for model instance A, then user B could upload a file of the same name for model instance B and it would overwrite the first file.

Again, I'd ask if you could please elaborate on your use case so we can better understand exactly what problem needs to be solved.

comment:11 in reply to:  8 Changed 2 years ago by Aymeric Augustin

Replying to freshquiz:

Yes, I am currently using this to implement the desired behaviour using subclasses, but there are two problems with putting the responsibility on the client:

We're debating different ideas here, let's not mix them.

I didn't suggest to let each project implement its own storage backend. I originally accepted the ticket and I said it would be useful for Django to provide the behavior you're describing natively.

I also said that this behavior should be selected with the current configuration mechanism, the DEFAULT_FILE_STORAGE setting, not with an additional setting.

In my opinion, it would be a net negative for Django to add an ad-hoc configuration mechanism just for this feature. Irregular configuration options are hard to discover and easy to misconfigure. I'd rather not have the feature requested in this ticket than have it controlled with a new setting. That's my opinion; of course others may disagree :-)

How do you propose introducing the behaviour natively as a non-default option, without breaking backwards compability and without a configuration option?

There's a variety of ways to do this, the easiest being to add a BaseStorage or SimpleStorage or OverwritingStorage or UnsafeStorage class as a superclass of the current Storage class, which would keep its behavior. I'm happy to leave that part of the design to the person who writes the patch.

comment:12 in reply to:  10 Changed 2 years ago by Matt C

Replying to timgraham:

The concern about clobbering files is this: if user A uploads a file named "foo.txt" for model instance A, then user B could upload a file of the same name for model instance B and it would overwrite the first file.

I don't understand why you see that as too dangerous/hard for the client to be able to handle.
Again, I ask why Django provides Model.delete() and raw() if you think the clients are so careless?

Again, I'd ask if you could please elaborate on your use case so we can better understand exactly what problem needs to be solved.

I'm sorry, I missed the part about you asking for my use case.

Yes, I am using a callable upload_to, which does allow me to partially control the storage path/key, but the current behaviour of mangling the final path/key (with random characters) prevents me from having total control.

I use a combination of model name, field name and UUID to generate unique storage paths/keys, via upload_to.
My use case goal is to ensure the following tuple does not result in clobbering/overwriting:
(model_name, field_name, UUID), where the UUID part is just a substitute for a primary key.

I don't take into account file name for uniqueness, but that's just because of my use case.
E.g. If there is a model instance with model ModelABC that has FileField named xyz and gets a UUID of 123, regardless of how many files are uploaded and what their end-user specified names are, there will only ever be one file present in the storage system.

Please let me know if you need more info about my use case.

Maybe this use case is too esoteric to warrant a "native" new feature.
I just thought it would benefit others, given how many discussions and clunky workarounds I came across while trying to cook up a neat workaround/solution for myself.

comment:13 Changed 2 years ago by Tim Graham

My concern about file clobbering was related to Aymeric suggesting making a backwards-incompatible change in comment:4 (again, maybe I mis understood the ramifications of his proposal).

Anyway, I don't mind looking a patch that might make the use case easier. In the interim, it looks like django-cleanup might fit your needs.

comment:14 in reply to:  13 Changed 2 years ago by Matt C

Replying to timgraham:

My concern about file clobbering was related to Aymeric suggesting making a backwards-incompatible change in comment:4 (again, maybe I mis understood the ramifications of his proposal).

I see...

Anyway, I don't mind looking a patch that might make the use case easier. In the interim, it looks like django-cleanup might fit your needs.

That doesn't fit my needs unfortunately, because my use case warrants the preservation of the original end-user specified file name, i.e. the current file name mangling (to avoid overwrite) is undesirable.
Sorry, I forgot to mention this directly, even though I repeatedly said that the current implementation does not ensure the client-dev has total control over the final path/key.

So I guess for me it's two problems:

  • Not having total control over the storage path/key
  • Not being able to overwrite files

...where the latter is caused by the former.

I have a workaround (albeit I'm still finishing it off), which will also take into account the fact that there is no post_save() method on model Field sub-classes (meaning that files get saved into storage potentially before a database save is about to fail; another undesirable trait of the current implementation, to me), but it does not take into consideration backwards compatibility (otherwise I would have submitted a patch as well as the ticket).

Note: See TracTickets for help on using tickets.
Back to Top