Code

Opened 19 months ago

Closed 19 months ago

Last modified 19 months ago

#19021 closed New feature (wontfix)

collectstatic should support checksums as method to determine a file's changed state

Reported by: dloewenherz Owned by: dloewenherz
Component: Uncategorized Version: 1.4
Severity: Normal Keywords:
Cc: django@… Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

When running ./manage.py collectstatic, all files in all static directories are copied to the location specified by STATICFILES_STORAGE, regardless of whether they have already been copied or not.

I propose that collectstatic should only copy files to the destination if they have changed or don't yet exist. I wrote my own solution which doesn't incorporate staticfiles, but I'd like to see this in Django proper. Without this feature, it can take ages to upload static media for a large project. It makes sense to only update those assets which have changed between deploys.

I currently solve this problem by creating a file containing metadata of all the static media at the root of the destination. This file is a JSON object that contains file paths as keys and checksum as values. When an upload is started, the uploader checks to see if the file path exists as a key in the dictionary. If it does, it checks to see if the checksums have changed. If they haven't changed, the uploader skips the file. At the end of the upload, the checksum file is updated on the destination.

Attachments (0)

Change History (7)

comment:1 Changed 19 months ago by anonymous

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

This does not sound right. When I run collectstatic it does not copy unmodified files:

$ ./manage.py collectstatic
You have requested to collect static files at the destination
location as specified in your settings file.

This will overwrite existing files.
Are you sure you want to do this?

Type 'yes' to continue, or 'no' to cancel: yes

0 static files copied to '/var/www/html/static' (83 unmodified).

comment:2 Changed 19 months ago by aaugustin

  • Resolution set to needsinfo
  • Status changed from new to closed

I agree with comment 1.

Last edited 19 months ago by aaugustin (previous) (diff)

comment:3 Changed 19 months ago by dloewenherz

.

Last edited 19 months ago by dloewenherz (previous) (diff)

comment:4 Changed 19 months ago by streeter

  • Cc django@… added

comment:5 Changed 19 months ago by dloewenherz

  • Resolution needsinfo deleted
  • Status changed from closed to reopened

Weird, I can't reproduce this (initially I commented saying I might have overlooked something).

First of all, when working on a team with multiple people, the current solution doesn't work. Every computer is on its own. Here's a real world scenario:

  1. Person A runs collectstatic. All files are uploaded.
  2. Person B makes a change to one file and runs collectstatic. Again, all files are uploaded.

This is untenable for a team size N > 1.

Secondly, there is a bug in the heuristic used to identify changed files. I just changed a js file in one of my static folder after running collectstatic earlier. When I run collectstatic again, nothing is re-uploaded, even though the file has changed.

if you look at the source, this is because collectstatic only checks if a file has changed by checking it's path name. The contents of the files themselves are ignored.

Version 0, edited 19 months ago by dloewenherz (next)

comment:6 Changed 19 months ago by dloewenherz

Alright--I'm being educated here...I missed a bunch of things, but there are still a couple of problems. I noted them in an email to the developers list.

The heuristic is not file name. Last modified time is the heuristic, but some backends don't have a reliable implementation of it (or don't support it at all) and therefore this feature doesn't work for those backends.

Additionally, in any sort of source control, when a user updates their repo, local files that were updated remotely show up as modified at the time the repo is cloned or updated, not when the file was actually last saved by the last author. You then have the same scenario I pointed to earlier: when multiple people work on a project, they will re-upload the same files multiple times.

For the reasons noted above, I would propose moving towards checksums and away from last modified times to check if a file has been modified.

comment:7 Changed 19 months ago by ptone

  • Resolution set to wontfix
  • Status changed from reopened to closed
  • Summary changed from collectstatic should only copy files if they have changed or don't exist in destination to collectstatic should support checksums as method to determine a file's changed state

In an ideal world - checksums would be an perfectly viable way to compare files
for collectstatic. However, practically - they can't be well supported by the
collectstatic API.

While md5 may be somewhat common - it is neither universal nor standard. For
cloud based storage backends to support a comparison metric other than
modification times for use by collectstatic, they would need to provide that
value as a remote/api call, it would do no good if the only way to support
this involved retrieving the remote object to get a hash on it (even if you had
drastically asymetrical bandwidth - this is just poor design).

Checksums have a compute cost that modification dates don't - so a checksum
comparison would always need to be an alternate, not the primary comparison.

Without a good universal way for a range of storage backends to provide some
sort of fingerprint/hash - there is no good way for collectstatic to take
advantage of that approach.

In cases where your modification dates are rendered invalid because of some
specific environment set up (like the git based team issue), there are a couple
workarounds. Perhaps the best is to use collectstatic locally - where the
performance of copying every file isn't as bad, and then use a sync tool
(such as rsync --checksum) or something home built that can do the checksum
based comparison knowing the specific remote storage you are working with in
your project.

A couple other links that might prove useful for those working with git
(provided without endorsement or review)

http://gitorious.org/sstamp

http://toroid.org/ams/etc/git-last-modified

http://repo.or.cz/w/metastore.git

Last edited 19 months ago by ptone (previous) (diff)

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.