#8307 closed Cleanup/optimization (fixed)
ImageFile use of width_field and height_field is slow with remote storage backends
Reported by: | Owned by: | Jacob | |
---|---|---|---|
Component: | File uploads/storage | Version: | dev |
Severity: | Normal | Keywords: | imagefile height width |
Cc: | Alan Justino da Silva, wiktor@…, walter+django@… | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
im attaching a patch that could make it work
Attachments (3)
Change History (28)
by , 16 years ago
Attachment: | images.py.diff added |
---|
comment:1 by , 16 years ago
milestone: | → 1.0 |
---|
comment:2 by , 16 years ago
Has patch: | set |
---|---|
Keywords: | imagefile height width added |
Triage Stage: | Unreviewed → Design decision needed |
comment:4 by , 16 years ago
no, is not the same bug. this one is related with the use of the width_field and height_field
comment:5 by , 16 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
Triage Stage: | Design decision needed → Accepted |
comment:6 by , 16 years ago
Needs documentation: | set |
---|
This is a tough one. On one hand, I can appreciate the performance gain of using the cache fields, since they're already available, and they *should* always be in sync with what's on the file. But I can't help thinking that I like the idea of ImageFile.width
and ImageFile.height
always returning the actual dimensions of the file, no matter what.
If they used the cache fields and those ever get out of sync with the file itself (which is possible any number of ways, depending on your application), the only way to get the true dimensions would be to use django.core.files.images.get_image_dimensions
manually. And, after all, isn't the point of having properties for width and height to avoid having to call that manually? Plus, using get_image_dimensions
directly means you'll have to cache those values manually if you expect to need them more than once.
I suppose I'm -0 on this ticket; someone with a decent argument might be able to sway me. Either way, though, I think the docs could be a bit more explicit about where the width
and height
values come from.
comment:7 by , 16 years ago
About your worries, the width and height values are updated every time the imagefield save method is fired, they could only be out of sync from a bug or a wrong use of the FileStorage mechanism (like setting the ._name field attr manually).
Also I have been using the patch I submit and has been working fine for me.
Without this patch the people that use remote filestorage backends (like S3, nirvanix or mogilefs), suffer a big performance hit for having to load the entire file every time a Model with an imagefield gets loaded. That means in this scenario a lot of more traffic, less performance and also a monetary cost.
English is not my primary language, sorry if something is not clear.
PD: im attaching another patch that could be more safe.
by , 16 years ago
Attachment: | images.1.diff added |
---|
comment:8 by , 16 years ago
For the record, here are a few of the ways I know of that the two could get out of sync:
- A bug in Django (as you mention) which shouldn't happen, but is still possible
- Improperly using Django's API (as you mention) to cause a mismatch
- An admin or staff member could manually update a file directly, without using Django
- A cron job could rename, resize or otherwise alter files
- An admin or staff member could manually update the database directly, without using the model's
save()
method - A monkeypatch (mistakenly or not) altered or removed the code to update
width_field
andheight_field
- A custom
ImageFile
subclass (mistakenly or not) sets thewidth_field
andheight_field
to the wrong values - ... and the list goes on
As for the performance (and potentially monetary) hit when accessing files, the width
and height
shouldn't be loaded when retrieving the file; they only get populated when actually accessing them. If you find they are in fact getting accessed every time regardless of whether they're being used, that's a legitimate bug that should be filed separately.
I have no doubt that the attached patch works correctly for the behavior you want. I'm just not sure it's the behavior Django should use.
comment:9 by , 16 years ago
Resolution: | → wontfix |
---|---|
Status: | assigned → closed |
Yeah, I like the access methods always getting the *real* file sizes.
Think of it this way: Model.image_width
is a number the model knows about, but to get the real story, check Model.image.width
-- all sorts of things can happen on the disk behind your back!
comment:11 by , 13 years ago
Cc: | added |
---|---|
Easy pickings: | unset |
Resolution: | wontfix |
Severity: | → Normal |
Status: | closed → reopened |
Summary: | ImageFile doesn't use the width_field and height_field for cache → ImageFile use of width_field and height_field is slow with remote storage backends |
Type: | → Cleanup/optimization |
UI/UX: | unset |
What about letting the storage to choose if is worth to touch the file just to get its height/width?
This solves the problem for me. And it is a huge problem: My EC2 instance with S3 backend wastes about 4sec more each pageview just to know the width/height of 8 photos. And looks like the absent of this info harms not my app.
Is this an acceptable solution?
def _get_image_dimensions(self): if not hasattr(self, '_dimensions_cache'): if getattr(self.storage, 'IGNORE_IMAGE_DIMENSIONS', False): self._dimensions_cache = (0, 0) # Too costly to hit the file? else: close = self.closed self.open() self._dimensions_cache = get_image_dimensions(self, close=close) return self._dimensions_cache
by , 13 years ago
Attachment: | images.py.2.diff added |
---|
Let the storage backend disable touching the image file to get height/width
comment:12 by , 12 years ago
Status: | reopened → new |
---|
comment:13 by , 11 years ago
Cc: | added |
---|
comment:14 by , 11 years ago
Cc: | added |
---|
comment:16 by , 10 months ago
Here's the contents from my duplicate issue
I have prepped a github repo here with the basic tests: https://github.com/john-parton/django-image-field-extra-read
Conditions for behavior
You must have an ImageField with the width_field
or height_field
arguments set.
Description of current behavior
When a model is saved, the image file is written out using the Storage API, and then
in order for the width and height fields to be updated, the file is read back out
and then the width and height are extracted from the image.
In the case the storage is local, the performance impact is probably negligible,
unless the application is seriously IO constrained, however if the storage is
remote, the performance impact is significant, and there can be other impacts on
operations.
For instance, if using S3 as backing, more GET operations are performed than
strictly necessary. This could be a few dollars a day of operational costs if your
scale is small, but can be significant if egress charges are high.
As another example, CloudFlare Images rate limits requests. This effectively cuts
the rate limit in half because every save operations requires an additional GET.
Proposed behavior
The proposed behavior is to simple read the image file which is resident in memory
without reading it back out from the storage backend.
Possible breaking issues
The vast majority of storage backends and use cases likely guarantee that if you
write a file into storage, and then retrieve it, you will get the same file back.
However, for some image-specific services, they will compress or crush larger images.
For users who specifically have this use case, they may end up with the width_field
and height_field
not representing the actual size of the image in the store, but
rather the size of the image at time of upload.
Explanation of current behavior
It looks like when a model's save() method is called, the field's pre_save() method
is called which results in the descriptor for the field having its get method
called and then immediately having its set method called with a similar value.
The effect is to coerce the value of the field to ImageFieldFile
which is a subclass
of ImageFile
. The ImageFieldFile
instance is assigned a property of file
which
is the wrapped original value.
The image field then saves and persists the data using the storage API, and then the
wrapped file isn't referred to later to get the width and height. When the width and
height are requested, the file is read back out of storage.
Proposed fix
No specific fix at this time.
Mitigating breaking issues
Considering how unusual this use case is, it may be sufficient to document the change
in behavior and provide a code snippet to wire up a signal handler to do the
additional read for those users who have the unusual storage backends and actually
care about the width/height being what is on disk. This would also allow users to
customize the behavior. For instance, maybe if the image is under a certain resolution,
the storage provider guarantees they don't mangle the image. A user could enshrine
that logic in the signal handler, so they could still get the performance uplift where
appropriate.
Summary
It seems pretty unlikely that this is the intended behavior, and seems to largely be a byproduct of heavy descriptor use and "magic" in the code. I've tried mitigating this in my application code, but it requires me to monkey patch some private methods, and I'm not even sure I got it right. Any attempt to "fix" this at a level higher than Django's internals will just result in an unmaintainable mess. Back when Django was new, I'm not sure I would make this argument, but now storage is usually backed by some API like S3, I think it makes sense to make avoiding an extra network request the "sane default."
follow-up: 18 comment:17 by , 10 months ago
I've got a fix that I think should pretty much always work.
Like I alluded to earlier, it requires some deep knowledge of Django's internals. You have to know exactly when to side-step a descriptor and what the purpose of the descriptor is.
Basically, the FileField.save() method will call the descriptors set method with the value being a string. That's required, because the storage backend might return a different string than you originally fed in. I believe the motivating reason for this was to avoid accidentally overwriting files. The other function of the descriptor is to calculate and store the width and height values if it's configured. However, the descriptor's set method had to of been already called to even get to this point in the save flow. Because the set method of the descriptor basically only serves this one function and the function has been satisfied earlier, I just side step the descriptor by manipulating the instances dict directly.
Here's the meat of the code: https://github.com/john-parton/django-image-field-extra-read/blob/bugfix/proposed-fix/test_app/fields.py
Of course, I wouldn't want to make a change like this without some tests. As I said, I believe the primary purpose of the descriptor use there is update the string value of the file field, which I have made sure that it's updated to a non-conflicting value on subsequent saves.
Here's a test case that I wrote which captures some of the spirit of what I'm trying to do: https://github.com/john-parton/django-image-field-extra-read/blob/bugfix/proposed-fix/test_app/tests.py
I'm happy to take feedback, or I can work on building a proper patch.
follow-ups: 19 24 comment:18 by , 10 months ago
Replying to john-parton:
I've got a fix that I think should pretty much always work.
...
Basically, the FileField.save() method will call the descriptors set method with the value being a string.
First, thanks for refreshing this 15yo issue, john-parton.
That said, I do not see how this solution fixes a template page rendering ~10 images from a remote storage as there is no .save()
involved for using existing images:
When I reopened the issue, my application needed only the URL of the files, already stored in the database, thus a O(1) call to DB. That is because I needed not to put height and width in the <img />
HTML tag, as the browser deals with whatever it fetches from the URL. If grabbing the URL keeps forcing to .get_image_dimentions()
, then this call becomes O(10) calls to the storage, meaning O(10) HTTP HEAD calls to S3. That is why I proposed the storage to decide if this should be done, or cached, or faked.
Do you think that letting the storage proxy the image dimensions would be another way to fix your issues as well?
comment:19 by , 9 months ago
Replying to Alan Justino da Silva:
For existing images, if you have a width_field
and height_field
set, then .get_image_dimensions()
is not ordinarily called. Usually the width_field
and height_field
are referred to. That's the point of those fields.
My patch does nothing if your ImageField doesn't have a width_field or height_field, and that's intentional. Adding the width_field and height_field should already solve the behavior you're describing.
If you could provide an actual test case where you think the behavior isn't correct, I'd really appreciate it. Ideally a mini Django project using my branch with a clear minimal reproduction of the undesired behavior you're referring it.
Thanks.
EDIT Here's the current working PR https://github.com/django/django/pull/17775
I have a test for the regression and explicit tests for the behavior I'm trying to fix. It's VERY limited in scope because I don't want to break backwards compat.
DOUBLE EDIT I think the behavior that you're describing has been fixed elsewhere in the descriptor logic some time between twelve years ago and today.
comment:20 by , 9 months ago
I added some tests. Big test is that there's a model with a Storage backend that raises an exception if you try and read from it. Test suite appears to be passing. Added a small note about the old behavior in the docs.
comment:21 by , 9 months ago
Needs documentation: | unset |
---|
I added a small amount of documentation. A 'version changed' describing the old behavior and under which circumstances the new behavior might affect a user.
comment:22 by , 7 months ago
I've created a different pull request with a different approach, which is better than my other patch in at least two ways:
- It removes an extra container class I was using as a signal for when to update
- It removes an extra keyword argument I had to add (returning the function signature to as it was before the patch)
https://github.com/django/django/pull/18070
I think we need to come to consensus on whether this is a bug that we can just patch, or if this change is significant enough that we need to introduce a setting to opt-in to the behavior (most likely changing the setting in a future release).
comment:23 by , 6 months ago
Patch needs improvement: | set |
---|
follow-up: 25 comment:24 by , 6 months ago
Has patch: | unset |
---|---|
Patch needs improvement: | unset |
Resolution: | → fixed |
Status: | new → closed |
Replying to Alan Justino da Silva:
That said, I do not see how this solution fixes a template page rendering ~10 images from a remote storage as there is no
.save()
involved for using existing images:
When I reopened the issue, my application needed only the URL of the files, already stored in the database, thus a O(1) call to DB. That is because I needed not to put height and width in the
<img />
HTML tag, as the browser deals with whatever it fetches from the URL. If grabbing the URL keeps forcing to.get_image_dimentions()
, then this call becomes O(10) calls to the storage, meaning O(10) HTTP HEAD calls to S3. That is why I proposed the storage to decide if this should be done, or cached, or faked.
Hi Alan, the issue you are describing was fixed in #34517.
I will close this issue and reopen #35139 as this is a separate concern.
comment:25 by , 6 months ago
I think it's very important that they provide a minimal test.
Based on my research, I do not think the scenario they are describing is actually a problem anymore, and had been fixed sometime in the last twelve years by another change.
I could be wrong, of course, but without actual code to pore over, my gut says it's fixed.
Replying to Sarah Boyce:
Replying to Alan Justino da Silva:
That said, I do not see how this solution fixes a template page rendering ~10 images from a remote storage as there is no
.save()
involved for using existing images:
When I reopened the issue, my application needed only the URL of the files, already stored in the database, thus a O(1) call to DB. That is because I needed not to put height and width in the
<img />
HTML tag, as the browser deals with whatever it fetches from the URL. If grabbing the URL keeps forcing to.get_image_dimentions()
, then this call becomes O(10) calls to the storage, meaning O(10) HTTP HEAD calls to S3. That is why I proposed the storage to decide if this should be done, or cached, or faked.
Hi Alan, the issue you are describing was fixed in #34517.
I will close this issue and reopen #35139 as this is a separate concern.
Seems like a bug to me