Opened 7 years ago

Closed 7 years ago

#28055 closed Cleanup/optimization (wontfix)

Staticfiles HashedFilesMixin postprocess optimization

Reported by: Konrad Lisiczyński Owned by: Konrad Lisiczyński
Component: contrib.staticfiles Version: 1.11
Severity: Normal Keywords: staticfiles HashedFilesMixin post_process
Cc: David Sanders Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Konrad Lisiczyński)

Django 1.11 refactored the way how collectstatic postprocessing is handled to calculate content based hashes and saving those files.
Good point of this refactoring is, than now handling of nested css is done correctly, for example here, in style.css:

@import 'substyles.css';

body {
  background-color: green;
}

when content of substyles.css in changed, the hash of substyles.css will also change, which at the same time will change the content of style.css (updated for example to @import 'substyles.some-hash.css';) and now Django will take into consideration by upgrading hash of style.css as well. This is good, and in previous versions of Django hash of style.css wouldnt be updated.

However, the problem is, that HashedFilesMixin.postprocess is written in such a way, that adjustable_paths, so with the default implementation - css files - will be always saved 4 times! Below snippet taken from django source code explains this:

## Snippet from HashedFilesMixin.post_process

# Do a single pass first. Post-process all files once, then repeat for
# adjustable files.
for name, hashed_name, processed, _ in self._post_process(paths, adjustable_paths, hashed_files):
    yield name, hashed_name, processed

paths = {path: paths[path] for path in adjustable_paths}
print('adjusted paths are', paths)

for i in range(self.max_post_process_passes):
    substitutions = False
    for name, hashed_name, processed, subst in self._post_process(paths, adjustable_paths, hashed_files):


## Snippet from HashedFilesMixin._post_process

if name in adjustable_paths:
    old_hashed_name = hashed_name
    content = original_file.read().decode(settings.FILE_CHARSET)
    for extension, patterns in iteritems(self._patterns):
        if matches_patterns(path, (extension,)):
            for pattern, template in patterns:
                converter = self.url_converter(name, hashed_files, template)
                try:
                    content = pattern.sub(converter, content)
                except ValueError as exc:
                    yield name, None, exc, False
    if hashed_file_exists:
        self.delete(hashed_name)
    # then save the processed result
    content_file = ContentFile(force_bytes(content))
    # Save intermediate file for reference
    saved_name = self._save(hashed_name, content_file)
    hashed_name = self.hashed_name(name, content_file)

    if self.exists(hashed_name):
        self.delete(hashed_name)

    saved_name = self._save(hashed_name, content_file)
    hashed_name = force_text(self.clean_name(saved_name))
    # If the file hash stayed the same, this file didn't change
    if old_hashed_name == hashed_name:
        substitutions = False
    processed = True

After looking at the above code carefully, you will see that all css will be saved 4 times, even when there is no point doing so in the first place, like for example when there is no @import statement in your css. I am also not sure why _save method is called twice in _postprocess.

Generally it would't be maybe extremely big problem for staticstorage which extends local file storage, but for remote storages like AWS this is unacceptable in my opinion. It would be great to change postprocessing algorythm to reduce number of save calls to minimum.

Change History (10)

comment:1 by Konrad Lisiczyński, 7 years ago

Description: modified (diff)

comment:2 by Tim Graham, 7 years ago

Triage Stage: UnreviewedAccepted

If you can propose a patch, optimizations would certainly be welcome.

comment:3 by Tim Graham, 7 years ago

Cc: David Sanders added

David, maybe you could provide some guidance or explanations here.

comment:4 by David Sanders, 7 years ago

There definitely appears to be a duplicate delete and save around line 304, I'm not sure how that slipped through code review. Must have been introduced during some refactoring of the code.

It should be possible to clean up that section and trim the extra delete/save. Might be possible to also drop a save if the hash hasn't changed as well since the hash can be calculated from the ContentFile without saving it to disk. Since the tests now expect the correct hashes (and have a few more cases) as long as tests pass it should be fairly safe. I don't have the time to look at it deeply, however.

comment:5 by Konrad Lisiczyński, 7 years ago

Triage Stage: AcceptedUnreviewed

Thanks for your suggestions, David. I will try to provide a patch today.

comment:6 by Konrad Lisiczyński, 7 years ago

Owner: changed from nobody to Konrad Lisiczyński
Status: newassigned

comment:7 by Konrad Lisiczyński, 7 years ago

I tried to comment line 304: saved_name = self._save(hashed_name, content_file).

Unfortunately it is not so easy, as several tests fail now:

ERROR: test_cache_invalidation (staticfiles_tests.test_storage.TestCollectionCachedStorage)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.4/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/usr/lib/python3.4/unittest/case.py", line 577, in run
    testMethod()
  File "/home/konrad/PycharmProjects/django/tests/staticfiles_tests/test_storage.py", line 252, in test_cache_invalidation
    self.assertEqual(self.hashed_file_path(name), hashed_name)
  File "/home/konrad/PycharmProjects/django/tests/staticfiles_tests/test_storage.py", line 21, in hashed_file_path
    fullpath = test.render_template(test.static_template_snippet(path))
  File "/home/konrad/PycharmProjects/django/tests/staticfiles_tests/cases.py", line 33, in render_template
    return template.render(Context(**kwargs)).strip()
  File "/home/konrad/PycharmProjects/django/django/template/base.py", line 176, in render
    return self._render(context)
  File "/home/konrad/PycharmProjects/django/django/test/utils.py", line 101, in instrumented_test_render
    return self.nodelist.render(context)
  File "/home/konrad/PycharmProjects/django/django/template/base.py", line 944, in render
    bit = node.render_annotated(context)
  File "/home/konrad/PycharmProjects/django/django/template/base.py", line 911, in render_annotated
    return self.render(context)
  File "/home/konrad/PycharmProjects/django/django/templatetags/static.py", line 106, in render
    url = self.url(context)
  File "/home/konrad/PycharmProjects/django/django/templatetags/static.py", line 103, in url
    return self.handle_simple(path)
  File "/home/konrad/PycharmProjects/django/django/templatetags/static.py", line 118, in handle_simple
    return staticfiles_storage.url(path)
  File "/home/konrad/PycharmProjects/django/django/contrib/staticfiles/storage.py", line 155, in url
    return self._url(self.stored_name, name, force)
  File "/home/konrad/PycharmProjects/django/django/contrib/staticfiles/storage.py", line 134, in _url
    hashed_name = hashed_name_func(*args)
  File "/home/konrad/PycharmProjects/django/django/contrib/staticfiles/storage.py", line 364, in stored_name
    self.hashed_name(name, content=None, filename=intermediate_name)
  File "/home/konrad/PycharmProjects/django/django/contrib/staticfiles/storage.py", line 94, in hashed_name
    raise ValueError("The file '%s' could not be found with %r." % (filename, self))
ValueError: The file 'cached/styles.bb84a0240107.css' could not be found with <django.contrib.staticfiles.storage.CachedStaticFilesStorage object at 0x7f6b90e00978>.

======================================================================
ERROR: test_path_with_querystring (staticfiles_tests.test_storage.TestCollectionCachedStorage)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.4/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/usr/lib/python3.4/unittest/case.py", line 577, in run
    testMethod()
  File "/home/konrad/PycharmProjects/django/tests/staticfiles_tests/test_storage.py", line 79, in test_path_with_querystring
    relpath = self.hashed_file_path("cached/styles.css?spam=eggs")
  File "/home/konrad/PycharmProjects/django/tests/staticfiles_tests/test_storage.py", line 21, in hashed_file_path
    fullpath = test.render_template(test.static_template_snippet(path))
  File "/home/konrad/PycharmProjects/django/tests/staticfiles_tests/cases.py", line 33, in render_template
    return template.render(Context(**kwargs)).strip()
  File "/home/konrad/PycharmProjects/django/django/template/base.py", line 176, in render
    return self._render(context)
  File "/home/konrad/PycharmProjects/django/django/test/utils.py", line 101, in instrumented_test_render
    return self.nodelist.render(context)
  File "/home/konrad/PycharmProjects/django/django/template/base.py", line 944, in render
    bit = node.render_annotated(context)
  File "/home/konrad/PycharmProjects/django/django/template/base.py", line 911, in render_annotated
    return self.render(context)
  File "/home/konrad/PycharmProjects/django/django/templatetags/static.py", line 106, in render
    url = self.url(context)
  File "/home/konrad/PycharmProjects/django/django/templatetags/static.py", line 103, in url
    return self.handle_simple(path)
  File "/home/konrad/PycharmProjects/django/django/templatetags/static.py", line 118, in handle_simple
    return staticfiles_storage.url(path)
  File "/home/konrad/PycharmProjects/django/django/contrib/staticfiles/storage.py", line 155, in url
    return self._url(self.stored_name, name, force)
  File "/home/konrad/PycharmProjects/django/django/contrib/staticfiles/storage.py", line 134, in _url
    hashed_name = hashed_name_func(*args)
  File "/home/konrad/PycharmProjects/django/django/contrib/staticfiles/storage.py", line 364, in stored_name
    self.hashed_name(name, content=None, filename=intermediate_name)
  File "/home/konrad/PycharmProjects/django/django/contrib/staticfiles/storage.py", line 94, in hashed_name
    raise ValueError("The file '%s' could not be found with %r." % (filename, self))
ValueError: The file 'cached/styles.bb84a0240107.css' could not be found with <django.contrib.staticfiles.storage.CachedStaticFilesStorage object at 0x7f6b910014e0>.

======================================================================
FAIL: test_corrupt_intermediate_files (staticfiles_tests.test_storage.TestCollectionCachedStorage)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.4/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/usr/lib/python3.4/unittest/case.py", line 577, in run
    testMethod()
  File "/home/konrad/PycharmProjects/django/tests/staticfiles_tests/test_storage.py", line 298, in test_corrupt_intermediate_files
    self.hashed_file_path('cached/styles.css')
  File "/usr/lib/python3.4/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/konrad/PycharmProjects/django/django/test/testcases.py", line 591, in _assert_raises_message_cm
    self.assertIn(expected_message, str(cm.exception))
  File "/usr/lib/python3.4/unittest/case.py", line 1056, in assertIn
    self.fail(self._formatMessage(msg, standardMsg))
  File "/usr/lib/python3.4/unittest/case.py", line 642, in fail
    raise self.failureException(msg)
AssertionError: "The name 'cached/styles.css' could not be hashed with <django.contrib.staticfiles.storage.CachedStaticFilesStorage object at 0x7f6b90a03fd0>." not found in "The file 'cached
/styles.bb84a0240107.css' could not be found with <django.contrib.staticfiles.storage.CachedStaticFilesStorage object at 0x7f6b90a03fd0>."

This implies, that this line is needed and it is not some kind of duplicate after all. Maybe we could contact the author of this code so he or she could provide us some insight why it was done this way and is it possible to optimize it somehow.

comment:8 by David Sanders, 7 years ago

Maybe we could contact the author of this code so he or she could provide us some insight why it was done this way and is it possible to optimize it somehow.

I am the original author of some of the code (the patch that refactored the code you mentioned), it's just been a while and some details have slipped, and I don't have time to get back into it deeply.

Looking at it a tiny bit more, though, I think line 304 is necessary, and to make this more clear it should be using old_hashed_name for the _save call instead of hashed_name. It is saving the file content with the old hashed name before recalculating the new hash on the (potentially) changed content.

The test failures you saw when removing it are due to the fact that CachedStaticFilesStorage relies on these intermediate files for proper behavior. The alternative would be for CachedStaticFilesStorage to recalculate all hashes on a single cache miss, which is very not good. With the intermediate files it can re-calculate the hash for a single file with a few file accesses.

If the concern is using a storage backend like S3, maybe a final upload step could be refactored in? The collectstatic process is never going to be very efficient if it's writing and reading directly to S3. The post-processing requires lots of file reads and writes, so remote storage isn't great for that.

comment:9 by Konrad Lisiczyński, 7 years ago

Thank you for elaboration, David. I think using current collectstatic as prestep for saving static files to a remote storage is really good solution. So I guess we could close this ticket?

comment:10 by Tim Graham, 7 years ago

Resolution: wontfix
Status: assignedclosed
Note: See TracTickets for help on using tickets.
Back to Top