Django

Code

Ticket #7581 (new)

Opened 1 year ago

Last modified 1 month ago

Middleware accessing HttpResponse.content breaks streaming HttpResponse objects.

Reported by: mrmachine Assigned to: nobody
Milestone: 1.2 Component: Core framework
Version: SVN Keywords: stream HttpResponse Content-Length
Cc: real.human@mrmachine.net Triage Stage: Design decision needed
Has patch: 1 Needs documentation: 0
Needs tests: 0 Patch needs improvement: 0

Description

ConditionalGetMiddleware, GZipMiddleware, CommonMiddleware, and CsrfMiddleware all access response.content directly or indirectly. This prevents a streaming response from being initiated until any generator passed to the HttpResponse as content is consumed, which can cause a timeout when trying to stream large dynamically generated content.

I've already put together a patch based on the assumption that HttpResponse._is_string being False indicates content has been passed in as a dynamic generator and thus we shouldn't delay streaming a response to the browser while consuming the entire generator.

The patch implements the following:

* Allow middleware to assign a new generator to HttpResponse.content without consuming it (e.g. GZipMiddleware)

* Compress chunks of content yielded by HttpResponse._container progressively in GZipMiddleware to allow streaming GZipped content

* Only generate an ETag in CommonMiddleware from HttpResponse.content if HttpResponse._is_string is True

* Only check that the length of HttpResponse.content is less than 200 in GZipMiddleware if HttpResponse._is_string is True

* Only set the Content-Length header in GZipMiddleware and ConditionalGetMiddleware if HttpResponse._is_string is True

With CommonMiddleware enabled by default and breaking the streaming response functionality if ETags are enabled in settings.py, I'd consider this a bug. It can be worked around by manually specifying a bogus ETag before returning the response, which doesn't seem ideal.

With this patch, users still have the option of consuming a generator before passing it to the HttpResponse in order to enable ETag and Content-Length headers, and conditional GZipping when the content length is less than 200.

With CsrfMiddleware, the generator is only consumed when the Content-Type header is text/html or application/xhtml+xml, which may be an acceptable compromise - no streaming response for HTML content if you choose to use django.contrib.csrf.

This patch accesses HttpResponse._is_string and HttpResponse._container from external middleware classes. Perhaps these properties could be made public and/or renamed to be more logical in this context?

Attachments

streaming_response.diff (6.8 kB) - added by mrmachine on 07/03/08 21:13:53.
add HttpResponse.content_generator, consume generators only once, set Content-Length in WSGIHander.
streaming_response.2.diff (7.5 kB) - added by Tai Lee <real.human@mrmachine.net> on 07/05/08 23:19:38.
fix django.contrib.csrf to work with content generators.
streaming_response.3.diff (8.1 kB) - added by mrmachine on 07/13/08 21:58:48.
add stream_content argument.
text.py.diff (0.7 kB) - added by didditdoug on 05/21/09 13:45:59.
Bug fix for invalid compression output in compress_sequence

Change History

07/03/08 21:09:51 changed by mrmachine

  • needs_better_patch changed.
  • needs_docs changed.
  • stage changed from Unreviewed to Design decision needed.
  • needs_tests changed.
  • milestone changed from 1.0 to post-1.0.

Changed to DDN after comments from Malcolm and Ivan on google groups.

The arguments against are that: 3rd party middleware authors will be unable to access the content for an HttpResponse because it's content may be an un-consumable generator; middleware will repetitively be required to replace HttpResponse.content with a string if a generator is consumed; Content-Length *must* always be set with WSGI.

The argument for is that 3rd party middleware should have the *option* of working with HttpResponse objects which have an un-consumable generator (e.g. GZipMiddleware). If middleware authors fail or choose not to account for this possibility, behaviour will remain as it is today, as it would be if generators were consumed immediately when instantiating an HttpResponse object.

A possible alternative raised by Malcolm is to add a way to bypass process_response in middleware (or bypass all middleware) for specific HttpResponse objects. However, this would disallow potentially useful middleware that *can* work without consuming the content generator (e.g. GZipMiddleware).

I've updated the patch with a few improvements:

* Added HttpResponse.content_generator which will return either a generator if one is present. Middleware authors can use this to check for the existence of, or actually get the content generator. We no longer need to access the private attributes HttpResponse._is_string or HttpResponse._container from middleware. We can still assign a string or generator to HttpResponse.content in the case where we need to replace the content or generator in middleware.

* Accessing HttpResponse.content will now consume any content generator that exists and replace HttpResponse._container with the resulting string, ensuring that the generator is only consumed once.

* Added a set_content_length method to WSGIHandler.response_fixes to ensure that Content-Length is *always* set with WSGI. Although I have found some discussion to support the removal of Content-Length by WSGI servers with specific examples to GZipped content.

Requiring Content-Length on every single response which has content with WSGI will make streaming a response impossible under any circumstances. This is quite useful functionality (exporting large sets of data), so if it turns out to be acceptable to stream content through WSGI without a Content-Length header (as opposed to an incorrect Content-Length header), or the WSGI spec changes to allow for this, I'd love to see the WSGIHandler.set_content_length method removed.

FYI, I have tested this without HttpResponse.set_content_length and with GZipMiddleware enabled on Apache with mod_wsgi and it does work (stream) as intended.

Personally I don't feel that it is an unreasonable restriction to place on middleware authors by asking them to make use of HttpResponse.content_generator *if* they want or need their middleware to work with streaming content. It's also not unreasonable for developers to check that the middleware they are using will work with streaming content, if they require streaming content. After all, some middleware by their very nature will not work with streaming content.

07/03/08 21:13:53 changed by mrmachine

  • attachment streaming_response.diff added.

add HttpResponse.content_generator, consume generators only once, set Content-Length in WSGIHander.

07/05/08 16:04:31 changed by Ivan Sagalaev <Maniac@SoftwareManiacs.Org>

From reading the patch I'm not sure why _get_content_generator is needed. You can always iterate over HttpResponse by calling iter() on it. What am I missing?

(follow-up: ↓ 4 ) 07/05/08 23:08:50 changed by Tai Lee <real.human@mrmachine.net>

content_generator (and thus _get_content_generator) are needed so middleware have a public attribute to check for the existence of and return the actual content generator that was passed to HttpResponse. We could change _get_content_generator to return iter(self) instead, but we'd still need _get_content_generator to return a generator only when one was passed to HttpResponse.

I think there might also be a problem if we return iter(self) to middleware. Because HttpResponse.next encodes each chunk as a bytestring, this could potentially encode each chunk multiple times if middleware sets a new content generator which yields unicode (or other) data, e.g. response.content = (something(e) for e in iter(response)) if something returned unicode. We'd also be passing a bytestring to something instead of the original content chunk. Even if something did not return unicode, it would at least call str on each chunk multiple times.

Isn't it better to pass the original content generator to middleware, and only call HttpResponse.next on each chunk once when HttpResponse.content is finally accessed, which can only happen once as the generator is replaced with string content when that happens?

07/05/08 23:19:38 changed by Tai Lee <real.human@mrmachine.net>

  • attachment streaming_response.2.diff added.

fix django.contrib.csrf to work with content generators.

(in reply to: ↑ 3 ; follow-up: ↓ 5 ) 07/06/08 05:58:13 changed by Ivan Sagalaev <Maniac@SoftwareManiacs.Org>

Replying to Tai Lee <real.human@mrmachine.net>:

content_generator (and thus _get_content_generator) are needed so middleware have a public attribute to check for the existence of and return the actual content generator that was passed to HttpResponse. We could change _get_content_generator to return iter(self) instead, but we'd still need _get_content_generator to return a generator only when one was passed to HttpResponse.

Sorry, I wasn't clear. By "why" I meant a deeper meaning: why a middleware would ever want to know if a response has a generator as a source or something else. From the middleware's point of view content of a response can be obtained either wholly with response.content or as a stream with iter(response). After discussion on the list I agree with Malcolm that middleware already knows if it wants the whole content or can stream it.

I think there might also be a problem if we return iter(self) to middleware. Because HttpResponse.next encodes each chunk as a bytestring, this could potentially encode each chunk multiple times if middleware sets a new content generator which yields unicode

I have as many as 3 points to say about it, please don't fall asleep! :-)

1. As matters stand now, only byte strings should appear on the output side of the response. Which means that every middleware that does something to a response should work on byte strings and return byte strings. Actually it might be useful to have this conversion as late as possible -- in the WSGIHandler -- but it's another thing anyway.

2. When HttpResponse iterates over itself the first time it should just save already encoded stream to avoid encoding multiple times as well as iterating. I mean instead of self._container = [''.join(self._container)] it should be like self._container = [''.join([s for s in self])]

3. More importantly, I think every piece of code that can access response in a streaming fashion should never exhaust it entirely. Not because it wouldn't work but because it's not efficient. As an example GzipMiddleware? can replace response's generator with its own but shouldn't read a single chunk from it.

BTW I now understand why you're exposing content_generator publicly: a middleware would want to replace it leaving other aspects of a response intact, right? If yes I think it could be done in a more functional fashion: provide a helper response wrapper that will have its own generator wrapper but will delegate everything else to a wrapped response. Like this:

class HttpResponseWrapper(object):
    def __init__(self, response, iterable):
        self.response, self.iterable = response, iterable
    def __iter__(self):
        return iter(self.iterable)
    def __getattr__(self, name):
        return getattr(self.response, name)

response = HttpResponseWrapper(response, gzip_content_generator(response))

(in reply to: ↑ 4 ) 07/06/08 10:13:10 changed by Tai Lee <real.human@mrmachine.net>

Replying to Ivan Sagalaev <Maniac@SoftwareManiacs.Org>:

Sorry, I wasn't clear. By "why" I meant a deeper meaning: why a middleware would ever want to know if a response has a generator as a source or something else. From the middleware's point of view content of a response can be obtained either wholly with response.content or as a stream with iter(response). After discussion on the list I agree with Malcolm that middleware already knows if it wants the whole content or can stream it.

I think that middleware needs to be able to modify the response in different ways (or not at all) if the content is derived from a generator, so there needs to be a public attribute we can check to determine if the content is derived from a generator or not. An example is that CommonMiddleware should not generate an ETag or ConditionalGetMiddleware should not set Content-Length if content is derived from a generator. Another is that GZipMiddleware can compress the entire content in one operation if content is not derived from a generator, which may be more efficient or result in higher compression then compressing each chunk sequentially?

1. As matters stand now, only byte strings should appear on the output side of the response. Which means that every middleware that does something to a response should work on byte strings and return byte strings. Actually it might be useful to have this conversion as late as possible -- in the WSGIHandler -- but it's another thing anyway.

By giving middleware access to the original content generator we can delay the conversion to byte string until HttpResponse.content is accessed, either by a middleware that requires the entire content or by WSGIHandler, and be sure that it occurs only once.

2. When HttpResponse iterates over itself the first time it should just save already encoded stream to avoid encoding multiple times as well as iterating. I mean instead of self._container = [''.join(self._container)] it should be like self._container = [''.join([s for s in self])]

If encoding should be performed in next, then should _get_content be altered to simply return ''.join(self._container) after consuming the generator with self._container = [''.join(self)]? And should next behave like _get_content does now and not encode if Content-Encoding is set, otherwise use smart_str?

3. More importantly, I think every piece of code that can access response in a streaming fashion should never exhaust it entirely. Not because it wouldn't work but because it's not efficient. As an example GzipMiddleware? can replace response's generator with its own but shouldn't read a single chunk from it.

Indeed, those middleware that can operate on chunks of content should not consume the content generator, but should only replace it with a new generator. This is the problem solved by this patch where middleware that are not aware of a content generator consume it and break streaming.

BTW I now understand why you're exposing content_generator publicly: a middleware would want to replace it leaving other aspects of a response intact, right? If yes I think it could be done in a more functional fashion: provide a helper response wrapper that will have its own generator wrapper but will delegate everything else to a wrapped response. Like this:

Middleware can already replace the content generator with response.content = (something(chunk) for chunk in response.content_generator), because _set_content will check if the value is an iterable or a string.

As mentioned above, I think we need to expose content_generator for two purposes. The first is to indicate to middleware that the response is derived from a generator so that the middleware can skip incompatible functionality (e.g. ETag generation) or replace the content generator instead of replacing the content directly.

The second is to provide access to the original content generator. Currently next will encode any value that is unicode. If a middleware sets a new content generator which yields a unicode value (e.g. response.content = (unicode(chunk) for chunk in response), the encoding will occur twice.

07/09/08 05:58:45 changed by ubernostrum

The more I think about this, the more I'm -1 on it. The complications of trying to force lazily-evaluated responses are far too numerous (ranging all the way from resource leaks on the Django side out to things like WSGI forbidding useful HTTP features which help with this) and the existence of specialized solutions for doing long-lived HTTP and/or asynchronous stuff are too plentiful to justify Django special-casing enough stuff to make this work.

07/09/08 18:55:58 changed by mrmachine

I don't understand the resistence. Streaming HttpResponse functionality has been in trunk for around 2 years now. All I'm suggesting is that we fix certain middleware that ships with Django to work with the streaming functionality that's already present in trunk where possible and appropriate - e.g. GZip, ETag, etc., and to add a public attribute to HttpResponse to aid the conditional execution of middleware code.

If there are certain middleware that cannot work with dynamic content generators, they continue to work as they do now. If there are certain handlers (WSGI) which don't at present work without a Content-Length header, they continue to work as they do now. Middleware authors can continue to develop middleware without even being aware that anything has changed. Existing 3rd party middleware will continue to work as it does now. This should be backwards compatible.

I don't see how what's been added (essentially just skipping certain middleware functionality if a generator was used, or being able to replace an existing generator with a new one) would cause any of the numerous complications you mention that aren't already and haven't already been present for the past 2 years.

07/13/08 14:08:44 changed by isagalaev

I was thinking about this some time and changed my mind a little...

I agree with Tai Lee that this ticket is not about some abstract ultimate way for keeping streaming responses remain streaming after middleware. It's about fixing some points that can be fixed. However I now think that current patch does it the wrong way for two reasons.

1. It relies on how the response was constructed which doesn't always correlate with the need to apply certain middleware. For example one may still want to count ETag or gzip content of small files which will be treated as generators.

2. Gzipping a stream just can't work because WSGI wants Content-length to be set for response and it's impossible to count without gzipping the whole contents.

So may be what's really needed here is a simple way for a view to explicitly flag the response with "don't gzip it" or "don't etag it".

07/13/08 19:23:44 changed by mrmachine

I agree that it would also be useful to be able to disable certain middleware per HttpResponse. You are correct in saying that the way a response is created won't always indicate which middleware we want to disable. However, if a certain middleware CAN run without breaking a content generator (e.g. GZip when not using WSGI), shouldn't it do so?

How would such a feature be implemented? Pass a list of middleware (e.g. ['django.middleware.common.CommonMiddleware', 'django.middleware.gzip.GZipMiddleware'])? That could be prone to error, requiring people to remember which middleware are required to be removed in order to preserve streaming functionality.

What about setting a boolean attribute consume_generator on HttpResponse which defaults to True. If this is True, then the generator is consumed immediately and replaced with the resulting content string. If it's False, then either any middleware or partial functionality within some middleware that can't work without consuming the content generator are skipped?

07/13/08 21:58:48 changed by mrmachine

  • attachment streaming_response.3.diff added.

add stream_content argument.

07/13/08 22:01:18 changed by mrmachine

Updated patch to add a stream_content argument (defaults to False) to specify whether or not content generator should be consumed immediately. Middleware can then assume that if content generator exists, user intended it to be preserved. Default behaviour will be to consume content, and if middleware authors want to (and can) work with streaming content, all they need to do is set response.content to a new generator (and we make response.content_generator) available publicly so they can alter the original generator if one is set.

07/14/08 01:41:59 changed by isagalaev

if a certain middleware CAN run without breaking a content generator (e.g. GZip when not using WSGI), shouldn't it do so?

I don't think so. It would break feauture parity between mod_python and WSGI and thus will make it harder for users to switch deployment scheme. Furthermore WSGI requiring Content-Length is a good thing because otherwise browsers won't show download progress. I think it's a bad side effect.

What about setting a boolean attribute consume_generator on HttpResponse which defaults to True.

The point of my previous comment was that there shouldn't be a generic attribute at all. Because generically it doesn't make any sense. I really meant just this:

def my_view(request):
    response = HttpResponse(open('somefile'))
    if os.path.getsize('somefile') > 1024:
        response.dont_gzip = True

class GzipMiddleware:
    def process_response(self, request, response):
          if getattr(response, 'dont_gzip'):
              return

And similar attribute for CommonMiddleware? about etag.

In other words, I think that the fact that certain two middleware can skip their action in a very certain conditions should remain their own detail and shouldn't affect other parts of the framework.

07/14/08 05:46:29 changed by grahamd

The WSGI specification does NOT require that Content-Length be specified for response. If you are making assumptions based on that belief, they will be wrong.

02/25/09 13:51:44 changed by

  • milestone deleted.

Milestone post-1.0 deleted

02/25/09 22:47:44 changed by mrmachine

  • keywords changed from stream HttpResponse to stream HttpResponse Content-Length.

PEP333 may be relevant here, and seems to indicate that Content-Length may be empty or absent.

03/31/09 06:33:15 changed by mrts

  • milestone set to 1.2.

Re-targeting for 1.2.

05/21/09 13:45:15 changed by didditdoug

There is a bug in the function utils.text.compress_sequence provided in the patch. Because it never closes the zfile, it misses a few trailing bytes of output, causing invalid output. Some decompressors (e.g. many browsers) seem tolerant of this, but many (e.g. Googlebot, gunzip) are not. The fix is straightforward; patch to text.py attached. This also has a warning (feel free to exclude if you wish!) for users of the function that its compression performance is not as good as compress_string (though obviously it has other advantages). As a final note, in our local installation we use the Z_FULL_FLUSH option on zfile.flush(), with some penalty in compression performance, so that browsers can start rendering the streaming output immediately, without having to wait for further bytes to complete decompression. This may not be an appropriate change for all installations, so I didn't include it here, though it might be useful as an option.

Hope this is helpful-

05/21/09 13:45:59 changed by didditdoug

  • attachment text.py.diff added.

Bug fix for invalid compression output in compress_sequence


Add/Change #7581 (Middleware accessing HttpResponse.content breaks streaming HttpResponse objects.)




Change Properties
Action