Opened 13 years ago

Closed 12 years ago

#16035 closed Bug (fixed)

GZipMiddleware doesn't change an ETag

Reported by: anonymous Owned by: Jakub Wiśniowski
Component: HTTP handling Version: 1.4
Severity: Normal Keywords:
Cc: Maciej Wiśniowski Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: yes UI/UX: no

Description

The GZipMiddleware does not alter an ETag, if one is present. This means that e.g. if USE_ETAGS is True and CommonMiddleware is used, or if UpdateCacheMiddleware is used, that the same ETag is returned for both gzipped and non-gzipped responses. This is against spec (RFC2616 section 13.3.3 says a strong validator should change if any entity header or the body changes - when gzipped, the body has changed and Content-Encoding is an entity header). This is the second pitfall listed here: http://www.mnot.net/blog/2007/08/07/etags

A patch to gzip.py would, I think, simply see if an ETag is set, and if so add ";gzip" to the end (inside the quote, of course) or similar.

Attachments (3)

gzip.py.patch (484 bytes ) - added by anonymous 13 years ago.
Patch to gzip.py to update ETag upon gzipping
gzip.patch.tests.diff (3.5 KB ) - added by Matthew Somerville 13 years ago.
Patch from right directory, and some tests
patch_for_16035.diff (2.8 KB ) - added by Jakub Wiśniowski 13 years ago.
Uses new way to override settings

Download all attachments as: .zip

Change History (20)

by anonymous, 13 years ago

Attachment: gzip.py.patch added

Patch to gzip.py to update ETag upon gzipping

comment:1 by anonymous, 13 years ago

Has patch: set

comment:2 by Aymeric Augustin, 13 years ago

Needs tests: set
Triage Stage: UnreviewedAccepted

by Matthew Somerville, 13 years ago

Attachment: gzip.patch.tests.diff added

Patch from right directory, and some tests

comment:3 by Matthew Somerville, 13 years ago

(I submitted the bug not logged in, sorry.) I've attached the same patch run from the right directory with svn diff this time, and also added some tests of the GZipMiddleware - one that tests short things aren't gzipped, and one to test the ETag does indeed differ for gzipped and non-gzipped versions. Hope that's helpful.

comment:4 by Aymeric Augustin, 13 years ago

Needs tests: unset
Patch needs improvement: set

Your patch should use the new way to alter settings during tests: https://docs.djangoproject.com/en/dev/topics/testing/#overriding-settings

comment:5 by anonymous, 13 years ago

Easy pickings: set
UI/UX: unset

by Jakub Wiśniowski, 13 years ago

Attachment: patch_for_16035.diff added

Uses new way to override settings

comment:6 by Maciej Wiśniowski, 13 years ago

Owner: changed from nobody to Jakub Wiśniowski
Patch needs improvement: unset
Triage Stage: AcceptedReady for checkin

comment:7 by Maciej Wiśniowski, 13 years ago

Cc: Maciej Wiśniowski added

comment:8 by Jannis Leidel, 13 years ago

Resolution: fixed
Status: newclosed

In [17471]:

Fixed #16035 -- Appended the Etag response header if the GZipMiddleware is in use to follow RFC2616 better. Thanks, ext and dracos2.

comment:9 by anonymous, 13 years ago

Resolution: fixed
Status: closedreopened

The path does not work since the gzip middeware does strip the ;gzip etag in process_request. CommonMiddleware generates an etag for the response (this excludes the ";gzip") and compares this to HTTP_IF_NONE_MATCH (this includes the ";gzip").

comment:10 by Aymeric Augustin, 12 years ago

Triage Stage: Ready for checkinUnreviewed

comment:11 by Jannis Leidel, 12 years ago

Feel free to provide a test case, I can't completely follow.

comment:12 by anonymous, 12 years ago

I request an endpoint with

Accept-Encoding: gzip 

I get the following headers back

Status Code: 200
Date: Tue, 22 May 2012 08:04:13 GMT
Content-Encoding: gzip
Content-Length: 192
Server: WSGIServer/0.1 Python/2.7.1
ETag: "ea57c4c505a7d588dd03d7251a33f846;gzip"
Vary: Accept-Encoding
Content-Type: application/json; charset=utf-8

Now, GETing the same endpoint again, this time passing the ETag

Accept-Encoding: gzip 
If-None-Match: "0dbb5568198250a9cb983ad6c201e02a;gzip"

still gives me back a 200

Status Code: 200
Date: Tue, 22 May 2012 08:04:13 GMT
Content-Encoding: gzip
Content-Length: 192
Server: WSGIServer/0.1 Python/2.7.1
ETag: "ea57c4c505a7d588dd03d7251a33f846;gzip"
Vary: Accept-Encoding
Content-Type: application/json; charset=utf-8

Passing the ETag without the ";gzip"-part, I get a 304

Accept-Encoding: gzip 
If-None-Match: "ea57c4c505a7d588dd03d7251a33f846"

response

Status Code: 304
Date: Tue, 22 May 2012 08:07:53 GMT
Content-Length: 0
Server: WSGIServer/0.1 Python/2.7.1
Content-Type: text/html; charset=utf-8

The problem is that the ";gzip" part is not stripped when processing a new request as mentioned above.

comment:13 by Aymeric Augustin, 12 years ago

Triage Stage: UnreviewedAccepted

I didn't reproduce the sequence above myself, but it's a totally plausible bug.

comment:14 by anonymous, 12 years ago

Version: 1.31.4

This bug is in 1.4

comment:15 by Tai Lee, 12 years ago

This is definitely a bug, and it depends on the position of GZipMiddleware relative to CommonMiddleware (and probably other middleware).

CommonMiddleware generates an etag based on response.content, and returns an HttpResponseNotModified when the If-None-Match header is the same as the generated etag.

GZipMiddleware alters an existing etag (adding ";gzip"), but it never checks If-None-Match and never returns an HttpResponseNotModified.

If GZipMiddleware comes before CommonMiddleware in the MIDDLEWARE_CLASSES setting (GZipMiddleware.process_response runs after CommonMiddleware.process_response), the browser will supply a ";gzip" etag for the If-None-Match header in subsequent requests, that ContentMiddleware will see as different to the etag that it generates, and will never return an HttpResponseNotModified.

If GZipMiddleware comes after CommonMiddleware, and the user hasn't explicitly set an etag on their response (which is usually the case -- people expect CommonMiddleware to do this for them), then things get crazy. You will likely get a different etag on every request, even though the content is the same.

The reason is because GZipMiddleware will not an etag when there is none. CommonMiddleware will then try to generate an etag for the compressed content. But all gzip streams must include a timestamp, and GzipFile will use the current timestamp if none is supplied. So the etag that CommonMiddleware generates will be different for the compressed version of the same content, if the requests are a few seconds apart.

We can resolve this side-issue by adding an arbitrary timestamp to GzipFile. The GzipFile docs say that the module ignores the timestamp when decompressing, but some programs such as gunzip make use of it. I'm not sure if this is the right thing to do here.

Back to the main issue. GZipMiddleware, and any other middleware that alters the etag header, should also generate an etag when there is none. And GZipMiddleware needs to do it *before* it compresses the content (unless we fudge the timestamp).

And generally speaking, if we're supposed to generate a different etag if content OR headers change, then shouldn't we be generating etags based on response.serialize(), not response.content?

But I'm not convinced that GZipMiddleware (or any middleware) should alter existing etags at all. If we allow any middleware to alter an etag (or encourage by example), that middleware must also set etags when there are none and compare the new or updated etag against the If-None-Match header and return HttpResponseNotModified when they match (including moving cookies from the old response to the HttpResponseNotModified). This is repetitive code (basically half of CommonMiddleware.process_response) that shouldn't need to be repeated.

Perhaps the creation of etags should be moved out of middleware and be taken care of in the response.content setter, and have it take into account the full HTTP message including headers? We would need to make sure that compresesd content is deterministic in this case (by fudging the timestamp), but middleware would no longer need to worry about altering or setting etags. If middleware (or view code) assigns to response.content, the etag would be updated (if etags are enabled).

This would leave the issue of when to return an HttpResponseNotModified. I would argue that this should happen after all middleware is applied, and should not happen inside any middleware classes. This way, the etag of the final response (after having passed through all response middleware) is the one that is compared to the If-None-Match header, and there is no need for repetitive code in middleware to make this comparison and return HttpResponseNotModified.

comment:16 by Tai Lee, 12 years ago

I've added a branch on GitHub with a fix and tests.

https://github.com/thirstydigital/django/tree/tickets/16035-gzip-middleware-etag

The test has a 1 second sleep in it to demonstrate the mtime issue with compress_string(). I'm not sure if there's a better way to test that.

comment:17 by Aymeric Augustin, 12 years ago

Resolution: fixed
Status: reopenedclosed

This is a different problem, and it will be solved by #19705. For the sake of clarity let's continue the discussion in that ticket.

The comment and the patch above are too complicated and ignore ConditionalGetMiddleware. I'm -1 on triplicating ETag handling in gzip middleware.

I think the problem can be solved by removing some code and adding some docs.

Note: See TracTickets for help on using tickets.
Back to Top