Opened 11 years ago

Closed 21 hours ago

#23517 closed New feature (wontfix)

Collect static files in parallel

Reported by: thenewguy Owned by: Carles Barrobés i Meix
Component: contrib.staticfiles Version: dev
Severity: Normal Keywords:
Cc: wgordonw1@…, Carles Barrobés i Meix Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

It would really speed up collectstatic on remote storages to copy files in parallel.

It shouldn't be too complicated to refactor the command to work with multiprocessing.

I am submitting the ticket as a reminder to myself when I have a free moment. Would this be accepted into Django?

Change History (8)

comment:1 by Aymeric Augustin, 11 years ago

Resolution: needsinfo
Status: newclosed

I'm afraid we'll be reluctant to hardcode concurrent behavior in Django if there's another solution.

You shoud be able to implement parallel upload in the storage backend with:

  • a save method that enqueues the operation for processing by a thread pool and returns immediately,
  • a post_process method that waits until the thread pool has completed all uploads.

Can you try that approach, and if it doesn't work, reopen this ticket?

Thanks!

comment:2 by thenewguy, 11 years ago

Just wanted to post back on this. I was able to write a quick 20 line proof of concept for this using the threading module. The speedup was pretty significant so I figured I would reopen this again. I could be wrong, but I imagine something like this would be beneficial to the general django userbase. Granted, I don't know if others get as restless as I do while waiting on static files to upload.

I've quickly tested collectstatic with 957 static files. All files are post processed in some fashion (at least being hashed by ManifestFilesMixin) and also a gzipped file is created if the saved file benefits from gzip compression. The storage backend stored the files on AWS S3. The AWS S3 console listed 3254 files were deleted when I deleted the files after each test. So in total, 3254 files were created during collectstatic per case.

The following times are generated by the command line and should not be interpreted as quality benchmarks... but they are good enough to show the significance.

set startTime=%time%
python manage.py collectstatic --noinput
echo Start Time:  %startTime%
echo Finish Time: %time%

Times (keep in mind staticfiles collectstatic does not output the count for gzipped files, so there are roughly 957*2 more files than it reports)

957 static files copied, 957 post-processed.
	async using 100 threads (ParallelUploadStaticS3Storage)
		Start Time:  16:43:57.01
		Finish Time: 16:49:30.31
		Duration: 5.55500 minutes

	sync using regular s3 storage (StaticS3Storage)
		Start Time:  16:19:24.21
		Finish Time: 16:41:46.78
		Duration: 22.3761667 minutes

This storage is derived from ManifestFilesMixin and a subclass of S3BotoStorage (django-storages) that creates gzipped copies and checks for file changes to keep reliable modification dates before saving:

class ParallelUploadStaticS3Storage(StaticS3Storage):
    """
        THIS STORAGE ASSUMES THAT UPLOADS ONLY OCCUR
        FROM CALLS TO THE COLLECTSTATIC MANAGEMENT
        COMMAND. SAVING TO THIS STORAGE DIRECTLY IS
        NOT RECOMMENDED BECAUSE THE UPLOAD THREADS
        ARE NOT JOINED UNTIL POST_PROCESS IS CALLED.
    """
    
    active_uploads = []
    thread_count = 100
    
    def remove_completed_uploads(self):
        for i, thread in reversed(list(enumerate(self.active_uploads))):
            if not thread.is_alive():
                del self.active_uploads[i]
    
    def _save_content(self, key, content, **kwargs):
        while self.thread_count < len(self.active_uploads):
            self.remove_completed_uploads()
        
        # copy the file to memory for the moment to get around file closed errors -- BAD HACK FIXME FIX
        content = ContentFile(content.read(), name=content.name)
        
        f = super(ParallelUploadStaticS3Storage, self)._save_content
        thread = threading.Thread(target=f, args=(key, content), kwargs=kwargs)
        
        self.active_uploads.append(thread)
        thread.start()
    
    def post_process(self, *args, **kwargs):
        # perform post processing
        for post_processed in super(ParallelUploadStaticS3Storage, self).post_process(*args, **kwargs):
            yield post_processed
        
        # wait for the remaining uploads to finish
        print "Post processing completed. Now waiting for the remaining uploads to finish."
        for thread in self.active_uploads:
            thread.join()

comment:3 by thenewguy, 11 years ago

Resolution: needsinfo
Status: closednew

comment:4 by thenewguy, 11 years ago

Cc: wgordonw1@… added

comment:5 by Tim Graham, 11 years ago

Resolution: wontfix
Status: newclosed

I think Aymeric was trying to say that if Django has enough sufficient hooks so that users can implement this on their own, then that's enough. Maybe StaticS3Storage would like to include this in their code, but it's not obvious to me that we should include this in Django itself.

comment:6 by Carles Barrobés i Meix, 5 days ago

Cc: Carles Barrobés i Meix added
Has patch: set
Resolution: wontfix
Status: closednew

Discussed and worked on this during the Django on the med event https://djangomed.eu/

One conclusion was that despite this being possible with the current hooks to implement this at the storage backend, it is a non-trivial endeavour and needs to be implemented by any backend. Whereas it can be solved relatively simply within the collectstatic command in a way that can support any existing and future backends.

This PR shows one implementation based on a threadpool https://github.com/django/django/pull/19935

Version 0, edited 5 days ago by Carles Barrobés i Meix (next)

comment:7 by Jacob Walls, 26 hours ago

Owner: changed from nobody to Carles Barrobés i Meix
Status: newassigned
Triage Stage: UnreviewedSomeday/Maybe
Type: UncategorizedCleanup/optimization
Version: 1.7dev

This could be worth adding so that any backend can benefit, but there is a concern on the PR about race conditions. Carles responds:

I had a go at adding workers to django-manifeststaticfiles-enhanced And ended up with an approach of, find the files first and then process them with workers.
It required more code changes, and I had to make more use of locks to control access to the shared state and I had to figure out how to handle thread safety issues with umask when making directories. Have a look and see if it has any value.

If we can advance the proof of concept so that it does not have race conditions and doesn't come with a high complexity cost, I think we can move this out of Maybe status.

comment:8 by Natalia Bidart, 21 hours ago

Resolution: wontfix
Status: assignedclosed
Triage Stage: Someday/MaybeUnreviewed
Type: Cleanup/optimizationNew feature

I think this change should go through the New Feature process. While it improves performance, it would also change how collectstatic would behave, from sequential to concurrent execution. This can affect execution order, error timing, and resource usage, and may introduce thread-safety or race-condition issues for third-party or custom storage backends (which there are plenty).

Since it revisits a prior design decision (Django intentionally left concurrency to backends) and could require new configuration options (worker count, failure handling, opt-in), to me it is more than a simple optimization. While I really appreciate the discussions during the Django on the Med sprint around this ticket, I think it should be treated as a new feature so could you please open an issue for this on new feature tracker? If accepted, we could reopen this same ticket.

Note: See TracTickets for help on using tickets.
Back to Top