#19021 closed New feature (wontfix)
collectstatic should support checksums as method to determine a file's changed state
Reported by: | Dan Loewenherz | Owned by: | Dan Loewenherz |
---|---|---|---|
Component: | Uncategorized | Version: | 1.4 |
Severity: | Normal | Keywords: | |
Cc: | django@… | Triage Stage: | Unreviewed |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
When running ./manage.py collectstatic
, all files in all static directories are copied to the location specified by STATICFILES_STORAGE
, regardless of whether they have already been copied or not.
I propose that collectstatic should only copy files to the destination if they have changed or don't yet exist. I wrote my own solution which doesn't incorporate staticfiles, but I'd like to see this in Django proper. Without this feature, it can take ages to upload static media for a large project. It makes sense to only update those assets which have changed between deploys.
I currently solve this problem by creating a file containing metadata of all the static media at the root of the destination. This file is a JSON object that contains file paths as keys and checksum as values. When an upload is started, the uploader checks to see if the file path exists as a key in the dictionary. If it does, it checks to see if the checksums have changed. If they haven't changed, the uploader skips the file. At the end of the upload, the checksum file is updated on the destination.
Change History (8)
comment:1 by , 12 years ago
comment:2 by , 12 years ago
Resolution: | → needsinfo |
---|---|
Status: | new → closed |
I agree with comment #1.
comment:4 by , 12 years ago
Cc: | added |
---|
comment:5 by , 12 years ago
Resolution: | needsinfo |
---|---|
Status: | closed → reopened |
Weird, I can't reproduce this (initially I commented saying I might have overlooked something).
First of all, when working on a team with multiple people, the current solution doesn't work. Every computer is on its own. Here's a real world scenario:
- Person A runs collectstatic. All files are uploaded.
- Person B makes a change to one file and runs collectstatic. Again, all files are uploaded.
This is untenable for a team size N > 1.
Secondly, there is a bug in the heuristic used to identify changed files. I just changed a js file in one of my static folder after running collectstatic earlier. When I run collectstatic again, nothing is re-uploaded, even though the file has changed.
if you look at the source, this is because collectstatic only checks if a file has changed by checking its path name. The contents of the files themselves are ignored.
comment:6 by , 12 years ago
Alright--I'm being educated here...I missed a bunch of things, but there are still a couple of problems. I noted them in an email to the developers list.
The heuristic is not file name. Last modified time is the heuristic, but some backends don't have a reliable implementation of it (or don't support it at all) and therefore this feature doesn't work for those backends.
Additionally, in any sort of source control, when a user updates their repo, local files that were updated remotely show up as modified at the time the repo is cloned or updated, not when the file was actually last saved by the last author. You then have the same scenario I pointed to earlier: when multiple people work on a project, they will re-upload the same files multiple times.
For the reasons noted above, I would propose moving towards checksums and away from last modified times to check if a file has been modified.
comment:7 by , 12 years ago
Resolution: | → wontfix |
---|---|
Status: | reopened → closed |
Summary: | collectstatic should only copy files if they have changed or don't exist in destination → collectstatic should support checksums as method to determine a file's changed state |
In an ideal world - checksums would be an perfectly viable way to compare files
for collectstatic. However, practically - they can't be well supported by the
collectstatic API.
While md5 may be somewhat common - it is neither universal nor standard. For
cloud based storage backends to support a comparison metric other than
modification times for use by collectstatic, they would need to provide that
value as a remote/api call, it would do no good if the only way to support
this involved retrieving the remote object to get a hash on it (even if you had
drastically asymetrical bandwidth - this is just poor design).
Checksums have a compute cost that modification dates don't - so a checksum
comparison would always need to be an alternate, not the primary comparison.
Without a good universal way for a range of storage backends to provide some
sort of fingerprint/hash - there is no good way for collectstatic to take
advantage of that approach.
In cases where your modification dates are rendered invalid because of some
specific environment set up (like the git based team issue), there are a couple
workarounds. Perhaps the best is to use collectstatic locally - where the
performance of copying every file isn't as bad, and then use a sync tool
(such as rsync --checksum) or something home built that can do the checksum
based comparison knowing the specific remote storage you are working with in
your project.
A couple other links that might prove useful for those working with git
(provided without endorsement or review)
This does not sound right. When I run collectstatic it does not copy unmodified files: