Opened 16 years ago

Closed 12 years ago

Last modified 11 years ago

#7581 closed New feature (fixed)

Middleware accessing HttpResponse.content breaks streaming HttpResponse objects.

Reported by: Tai Lee Owned by: ccahoon
Component: Core (Other) Version: dev
Severity: Normal Keywords: stream HttpResponse Content-Length
Cc: real.human@…, mmitar@…, mindsocket Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

ConditionalGetMiddleware, GZipMiddleware, CommonMiddleware, and CsrfMiddleware all access response.content directly or indirectly. This prevents a streaming response from being initiated until any generator passed to the HttpResponse as content is consumed, which can cause a timeout when trying to stream large dynamically generated content.

I've already put together a patch based on the assumption that HttpResponse._is_string being False indicates content has been passed in as a dynamic generator and thus we shouldn't delay streaming a response to the browser while consuming the entire generator.

The patch implements the following:

  • Allow middleware to assign a new generator to HttpResponse.content without consuming it (e.g. GZipMiddleware)
  • Compress chunks of content yielded by HttpResponse._container progressively in GZipMiddleware to allow streaming GZipped content
  • Only generate an ETag in CommonMiddleware from HttpResponse.content if HttpResponse._is_string is True
  • Only check that the length of HttpResponse.content is less than 200 in GZipMiddleware if HttpResponse._is_string is True
  • Only set the Content-Length header in GZipMiddleware and ConditionalGetMiddleware if HttpResponse._is_string is True

With CommonMiddleware enabled by default and breaking the streaming response functionality if ETags are enabled in settings.py, I'd consider this a bug. It can be worked around by manually specifying a bogus ETag before returning the response, which doesn't seem ideal.

With this patch, users still have the option of consuming a generator before passing it to the HttpResponse in order to enable ETag and Content-Length headers, and conditional GZipping when the content length is less than 200.

With CsrfMiddleware, the generator is only consumed when the Content-Type header is text/html or application/xhtml+xml, which may be an acceptable compromise - no streaming response for HTML content if you choose to use django.contrib.csrf.

This patch accesses HttpResponse._is_string and HttpResponse._container from external middleware classes. Perhaps these properties could be made public and/or renamed to be more logical in this context?

Attachments (8)

streaming_response.diff (6.8 KB ) - added by Tai Lee 16 years ago.
add HttpResponse.content_generator, consume generators only once, set Content-Length in WSGIHander.
streaming_response.2.diff (7.5 KB ) - added by Tai Lee <real.human@…> 16 years ago.
fix django.contrib.csrf to work with content generators.
streaming_response.3.diff (8.1 KB ) - added by Tai Lee 16 years ago.
add stream_content argument.
text.py.diff (686 bytes ) - added by didditdoug 16 years ago.
Bug fix for invalid compression output in compress_sequence
7581-Streaming-HttpResponse-r11381.diff (7.2 KB ) - added by Tai Lee 15 years ago.
Updated for trunk r11381.
7581-Streaming-HttpResponse-r16445.diff (6.6 KB ) - added by Tai Lee 13 years ago.
Update for trunk r16445.
7581-Streaming-HttpResponse-1.3.1.diff (6.6 KB ) - added by Chris Ghormley <chris@…> 13 years ago.
Patch for Django 1.3.1 (uses md5_constructor instead of hashlib.md5)
7581-20121020.patch (35.9 KB ) - added by Aymeric Augustin 12 years ago.

Download all attachments as: .zip

Change History (57)

comment:1 by Tai Lee, 16 years ago

milestone: 1.0post-1.0
Triage Stage: UnreviewedDesign decision needed

Changed to DDN after comments from Malcolm and Ivan on google groups.

The arguments against are that: 3rd party middleware authors will be unable to access the content for an HttpResponse because it's content may be an un-consumable generator; middleware will repetitively be required to replace HttpResponse.content with a string if a generator is consumed; Content-Length *must* always be set with WSGI.

The argument for is that 3rd party middleware should have the *option* of working with HttpResponse objects which have an un-consumable generator (e.g. GZipMiddleware). If middleware authors fail or choose not to account for this possibility, behaviour will remain as it is today, as it would be if generators were consumed immediately when instantiating an HttpResponse object.

A possible alternative raised by Malcolm is to add a way to bypass process_response in middleware (or bypass all middleware) for specific HttpResponse objects. However, this would disallow potentially useful middleware that *can* work without consuming the content generator (e.g. GZipMiddleware).

I've updated the patch with a few improvements:

  • Added HttpResponse.content_generator which will return either a generator if one is present. Middleware authors can use this to check for the existence of, or actually get the content generator. We no longer need to access the private attributes HttpResponse._is_string or HttpResponse._container from middleware. We can still assign a string or generator to HttpResponse.content in the case where we need to replace the content or generator in middleware.
  • Accessing HttpResponse.content will now consume any content generator that exists and replace HttpResponse._container with the resulting string, ensuring that the generator is only consumed once.
  • Added a set_content_length method to WSGIHandler.response_fixes to ensure that Content-Length is *always* set with WSGI. Although I have found some discussion to support the removal of Content-Length by WSGI servers with specific examples to GZipped content.

Requiring Content-Length on every single response which has content with WSGI will make streaming a response impossible under any circumstances. This is quite useful functionality (exporting large sets of data), so if it turns out to be acceptable to stream content through WSGI without a Content-Length header (as opposed to an incorrect Content-Length header), or the WSGI spec changes to allow for this, I'd love to see the WSGIHandler.set_content_length method removed.

FYI, I have tested this without HttpResponse.set_content_length and with GZipMiddleware enabled on Apache with mod_wsgi and it does work (stream) as intended.

Personally I don't feel that it is an unreasonable restriction to place on middleware authors by asking them to make use of HttpResponse.content_generator *if* they want or need their middleware to work with streaming content. It's also not unreasonable for developers to check that the middleware they are using will work with streaming content, if they require streaming content. After all, some middleware by their very nature will not work with streaming content.

by Tai Lee, 16 years ago

Attachment: streaming_response.diff added

add HttpResponse.content_generator, consume generators only once, set Content-Length in WSGIHander.

comment:2 by Ivan Sagalaev <Maniac@…>, 16 years ago

From reading the patch I'm not sure why _get_content_generator is needed. You can always iterate over HttpResponse by calling iter() on it. What am I missing?

comment:3 by Tai Lee <real.human@…>, 16 years ago

content_generator (and thus _get_content_generator) are needed so middleware have a public attribute to check for the existence of and return the actual content generator that was passed to HttpResponse. We could change _get_content_generator to return iter(self) instead, but we'd still need _get_content_generator to return a generator only when one was passed to HttpResponse.

I think there might also be a problem if we return iter(self) to middleware. Because HttpResponse.next encodes each chunk as a bytestring, this could potentially encode each chunk multiple times if middleware sets a new content generator which yields unicode (or other) data, e.g. response.content = (something(e) for e in iter(response)) if something returned unicode. We'd also be passing a bytestring to something instead of the original content chunk. Even if something did not return unicode, it would at least call str on each chunk multiple times.

Isn't it better to pass the original content generator to middleware, and only call HttpResponse.next on each chunk once when HttpResponse.content is finally accessed, which can only happen once as the generator is replaced with string content when that happens?

by Tai Lee <real.human@…>, 16 years ago

Attachment: streaming_response.2.diff added

fix django.contrib.csrf to work with content generators.

in reply to:  3 ; comment:4 by Ivan Sagalaev <Maniac@…>, 16 years ago

Replying to Tai Lee <real.human@mrmachine.net>:

content_generator (and thus _get_content_generator) are needed so middleware have a public attribute to check for the existence of and return the actual content generator that was passed to HttpResponse. We could change _get_content_generator to return iter(self) instead, but we'd still need _get_content_generator to return a generator only when one was passed to HttpResponse.

Sorry, I wasn't clear. By "why" I meant a deeper meaning: why a middleware would ever want to know if a response has a generator as a source or something else. From the middleware's point of view content of a response can be obtained either wholly with response.content or as a stream with iter(response). After discussion on the list I agree with Malcolm that middleware already knows if it wants the whole content or can stream it.

I think there might also be a problem if we return iter(self) to middleware. Because HttpResponse.next encodes each chunk as a bytestring, this could potentially encode each chunk multiple times if middleware sets a new content generator which yields unicode

I have as many as 3 points to say about it, please don't fall asleep! :-)

  1. As matters stand now, only byte strings should appear on the output side of the response. Which means that every middleware that does something to a response should work on byte strings and return byte strings. Actually it might be useful to have this conversion as late as possible -- in the WSGIHandler -- but it's another thing anyway.
  1. When HttpResponse iterates over itself the first time it should just save already encoded stream to avoid encoding multiple times as well as iterating. I mean instead of self._container = [''.join(self._container)] it should be like self._container = [''.join([s for s in self])]
  1. More importantly, I think every piece of code that can access response in a streaming fashion should never exhaust it entirely. Not because it wouldn't work but because it's not efficient. As an example GzipMiddleware can replace response's generator with its own but shouldn't read a single chunk from it.

BTW I now understand why you're exposing content_generator publicly: a middleware would want to replace it leaving other aspects of a response intact, right? If yes I think it could be done in a more functional fashion: provide a helper response wrapper that will have its own generator wrapper but will delegate everything else to a wrapped response. Like this:

class HttpResponseWrapper(object):
    def __init__(self, response, iterable):
        self.response, self.iterable = response, iterable
    def __iter__(self):
        return iter(self.iterable)
    def __getattr__(self, name):
        return getattr(self.response, name)

response = HttpResponseWrapper(response, gzip_content_generator(response))

in reply to:  4 comment:5 by Tai Lee <real.human@…>, 16 years ago

Replying to Ivan Sagalaev <Maniac@SoftwareManiacs.Org>:

Sorry, I wasn't clear. By "why" I meant a deeper meaning: why a middleware would ever want to know if a response has a generator as a source or something else. From the middleware's point of view content of a response can be obtained either wholly with response.content or as a stream with iter(response). After discussion on the list I agree with Malcolm that middleware already knows if it wants the whole content or can stream it.

I think that middleware needs to be able to modify the response in different ways (or not at all) if the content is derived from a generator, so there needs to be a public attribute we can check to determine if the content is derived from a generator or not. An example is that CommonMiddleware should not generate an ETag or ConditionalGetMiddleware should not set Content-Length if content is derived from a generator. Another is that GZipMiddleware can compress the entire content in one operation if content is not derived from a generator, which may be more efficient or result in higher compression then compressing each chunk sequentially?

  1. As matters stand now, only byte strings should appear on the output side of the response. Which means that every middleware that does something to a response should work on byte strings and return byte strings. Actually it might be useful to have this conversion as late as possible -- in the WSGIHandler -- but it's another thing anyway.

By giving middleware access to the original content generator we can delay the conversion to byte string until HttpResponse.content is accessed, either by a middleware that requires the entire content or by WSGIHandler, and be sure that it occurs only once.

  1. When HttpResponse iterates over itself the first time it should just save already encoded stream to avoid encoding multiple times as well as iterating. I mean instead of self._container = [''.join(self._container)] it should be like self._container = [''.join([s for s in self])]

If encoding should be performed in next, then should _get_content be altered to simply return ''.join(self._container) after consuming the generator with self._container = [''.join(self)]? And should next behave like _get_content does now and not encode if Content-Encoding is set, otherwise use smart_str?

  1. More importantly, I think every piece of code that can access response in a streaming fashion should never exhaust it entirely. Not because it wouldn't work but because it's not efficient. As an example GzipMiddleware can replace response's generator with its own but shouldn't read a single chunk from it.

Indeed, those middleware that can operate on chunks of content should not consume the content generator, but should only replace it with a new generator. This is the problem solved by this patch where middleware that are not aware of a content generator consume it and break streaming.

BTW I now understand why you're exposing content_generator publicly: a middleware would want to replace it leaving other aspects of a response intact, right? If yes I think it could be done in a more functional fashion: provide a helper response wrapper that will have its own generator wrapper but will delegate everything else to a wrapped response. Like this:

Middleware can already replace the content generator with response.content = (something(chunk) for chunk in response.content_generator), because _set_content will check if the value is an iterable or a string.

As mentioned above, I think we need to expose content_generator for two purposes. The first is to indicate to middleware that the response is derived from a generator so that the middleware can skip incompatible functionality (e.g. ETag generation) or replace the content generator instead of replacing the content directly.

The second is to provide access to the original content generator. Currently next will encode any value that is unicode. If a middleware sets a new content generator which yields a unicode value (e.g. response.content = (unicode(chunk) for chunk in response), the encoding will occur twice.

comment:6 by James Bennett, 16 years ago

The more I think about this, the more I'm -1 on it. The complications of trying to force lazily-evaluated responses are far too numerous (ranging all the way from resource leaks on the Django side out to things like WSGI forbidding useful HTTP features which help with this) and the existence of specialized solutions for doing long-lived HTTP and/or asynchronous stuff are too plentiful to justify Django special-casing enough stuff to make this work.

comment:7 by Tai Lee, 16 years ago

I don't understand the resistence. Streaming HttpResponse functionality has been in trunk for around 2 years now. All I'm suggesting is that we fix certain middleware that ships with Django to work with the streaming functionality that's already present in trunk where possible and appropriate - e.g. GZip, ETag, etc., and to add a public attribute to HttpResponse to aid the conditional execution of middleware code.

If there are certain middleware that cannot work with dynamic content generators, they continue to work as they do now. If there are certain handlers (WSGI) which don't at present work without a Content-Length header, they continue to work as they do now. Middleware authors can continue to develop middleware without even being aware that anything has changed. Existing 3rd party middleware will continue to work as it does now. This should be backwards compatible.

I don't see how what's been added (essentially just skipping certain middleware functionality if a generator was used, or being able to replace an existing generator with a new one) would cause any of the numerous complications you mention that aren't already and haven't already been present for the past 2 years.

comment:8 by Ivan Sagalaev, 16 years ago

I was thinking about this some time and changed my mind a little...

I agree with Tai Lee that this ticket is not about some abstract ultimate way for keeping streaming responses remain streaming after middleware. It's about fixing some points that can be fixed. However I now think that current patch does it the wrong way for two reasons.

  1. It relies on how the response was constructed which doesn't always correlate with the need to apply certain middleware. For example one may still want to count ETag or gzip content of small files which will be treated as generators.
  1. Gzipping a stream just can't work because WSGI wants Content-length to be set for response and it's impossible to count without gzipping the whole contents.

So may be what's really needed here is a simple way for a view to explicitly flag the response with "don't gzip it" or "don't etag it".

comment:9 by Tai Lee, 16 years ago

I agree that it would also be useful to be able to disable certain middleware per HttpResponse. You are correct in saying that the way a response is created won't always indicate which middleware we want to disable. However, if a certain middleware CAN run without breaking a content generator (e.g. GZip when not using WSGI), shouldn't it do so?

How would such a feature be implemented? Pass a list of middleware (e.g. ['django.middleware.common.CommonMiddleware', 'django.middleware.gzip.GZipMiddleware'])? That could be prone to error, requiring people to remember which middleware are required to be removed in order to preserve streaming functionality.

What about setting a boolean attribute consume_generator on HttpResponse which defaults to True. If this is True, then the generator is consumed immediately and replaced with the resulting content string. If it's False, then either any middleware or partial functionality within some middleware that can't work without consuming the content generator are skipped?

by Tai Lee, 16 years ago

Attachment: streaming_response.3.diff added

add stream_content argument.

comment:10 by Tai Lee, 16 years ago

Updated patch to add a stream_content argument (defaults to False) to specify whether or not content generator should be consumed immediately. Middleware can then assume that if content generator exists, user intended it to be preserved. Default behaviour will be to consume content, and if middleware authors want to (and can) work with streaming content, all they need to do is set response.content to a new generator (and we make response.content_generator) available publicly so they can alter the original generator if one is set.

comment:11 by Ivan Sagalaev, 16 years ago

if a certain middleware CAN run without breaking a content generator (e.g. GZip when not using WSGI), shouldn't it do so?

I don't think so. It would break feauture parity between mod_python and WSGI and thus will make it harder for users to switch deployment scheme. Furthermore WSGI requiring Content-Length is a good thing because otherwise browsers won't show download progress. I think it's a bad side effect.

What about setting a boolean attribute consume_generator on HttpResponse which defaults to True.

The point of my previous comment was that there shouldn't be a generic attribute at all. Because generically it doesn't make any sense. I really meant just this:

def my_view(request):
    response = HttpResponse(open('somefile'))
    if os.path.getsize('somefile') > 1024:
        response.dont_gzip = True

class GzipMiddleware:
    def process_response(self, request, response):
          if getattr(response, 'dont_gzip'):
              return

And similar attribute for CommonMiddleware about etag.

In other words, I think that the fact that certain two middleware can skip their action in a very certain conditions should remain their own detail and shouldn't affect other parts of the framework.

comment:12 by Graham Dumpleton, 16 years ago

The WSGI specification does NOT require that Content-Length be specified for response. If you are making assumptions based on that belief, they will be wrong.

comment:13 by (none), 16 years ago

milestone: post-1.0

Milestone post-1.0 deleted

comment:14 by Tai Lee, 16 years ago

Keywords: Content-Length added

PEP333 may be relevant here, and seems to indicate that Content-Length may be empty or absent.

comment:15 by mrts, 16 years ago

milestone: 1.2

Re-targeting for 1.2.

comment:16 by didditdoug, 16 years ago

There is a bug in the function utils.text.compress_sequence provided in the patch. Because it never closes the zfile, it misses a few trailing bytes of output, causing invalid output. Some decompressors (e.g. many browsers) seem tolerant of this, but many (e.g. Googlebot, gunzip) are not. The fix is straightforward; patch to text.py attached. This also has a warning (feel free to exclude if you wish!) for users of the function that its compression performance is not as good as compress_string (though obviously it has other advantages). As a final note, in our local installation we use the Z_FULL_FLUSH option on zfile.flush(), with some penalty in compression performance, so that browsers can start rendering the streaming output immediately, without having to wait for further bytes to complete decompression. This may not be an appropriate change for all installations, so I didn't include it here, though it might be useful as an option.

Hope this is helpful-

by didditdoug, 16 years ago

Attachment: text.py.diff added

Bug fix for invalid compression output in compress_sequence

comment:17 by ccahoon, 15 years ago

Owner: changed from nobody to ccahoon

See also #2504.

comment:18 by Tai Lee, 15 years ago

I just updated the patch to work with trunk r11381 and also rolled in the compress_sequence fix. I also removed the set_content_length parts for WSGI, since it appears to be valid in HTTP 1.0/1.1 and WSGI to omit this header when the content length is not known. The full test suite passes. I've been using this patch in my local branch of Django for over a year now, first with Apache/mod_python, and now with Apache/mod_wsgi, and have not noticed any ill-effects.

I've discussed this with ccahoon recently who is working on the http-wsgi-improvements branch, and I cannot see a way for any middleware to ever work with a streaming response (when possible and appropriate, e.g. gzip) without having a hook back into the HttpResponse object where it can alter the existing generator in-place. I think this is a fairly useful ability for middleware, so I'd still like to see any generator passed in as content to HttpResponse exposed for middleware to inspect and replace, or automatically disable any functionality that is irrelevant to generator content (e.g. etags).

I don't believe that this is the complete solution. There will still be cases where middleware *must* consume the content in order to function (e.g. csrf, which is conditionally applied to HTML content types). For these cases there needs to be a way for users to disable specific middleware for a specific response, so that streaming can still function, and this may also be useful for other purposes.

I suggest that we simply add an attribute HttpResponse.disabled_middleware (list or tuple). Any installed middleware specified there should be bypassed when applying response middleware. Django could then ship with an HttpStreamingResponse class that disables any included middleware that is known not to work with generator content, and this could easily be subclassed and updated by users who need to disable any 3rd party middleware to create a streaming response.

My own use-case for streaming responses is that I often need to export large sets of model data in CSV format, sometimes manipulating the data as I go which can result in a significant wait before the entire content is consumed. Besides the also significant memory usage, the delay before content delivery can start often causes browser timeouts. I'm hesitant to simply increase the timeout because it's not really solving the problem, and users will often think that something is wrong if a download doesn't start quickly.

My only alternative without streaming responses is to look into an offline job queue solution, which seems further outside of the scope of django, and introduces a requirement for notifying users when their download is complete.

by Tai Lee, 15 years ago

Updated for trunk r11381.

comment:19 by ccahoon, 15 years ago

Changeset [11449] on the branch sco2009/http-wsgi-improvements has my attempt at solving this problem, using a new response class (http.HttpResponseStreaming) and class attributes for streaming-safe response middleware. It has tests and docs. I am not sure how I feel about the test coverage, but it does show that at least it operates correctly as an HttpResponse, and that the content doesn't get consumed on initialization of the response. I used code and ideas from the existing patches here with some additional changes.

comment:20 by Luke Plant, 15 years ago

FYI - the use of HttpResponse.content in the CSRF middleware should hopefully go away completely in Django 1.4. Before then (Django 1.2 hopefully), it will be isolated in a deprecated middleware, replaced by a streaming friendly method. See CsrfProtection for more.

comment:21 by anonymous, 15 years ago

For people with problems with file streaming you can use a simple workaround http://www.djangosnippets.org/snippets/1782/

comment:22 by Jacob, 15 years ago

milestone: 1.2

Bumping from 1.2: there's still no agreement about the correct approach here.

comment:23 by Luke Plant, 14 years ago

Severity: Normal
Type: New feature

by Tai Lee, 13 years ago

Update for trunk r16445.

comment:24 by Tai Lee, 13 years ago

Easy pickings: unset
UI/UX: unset

Just updated the patch to apply cleanly against trunk again. Removed (now) redundant changes to CSRF middleware (it no longer breaks streaming). Also made _get_content_generator() consistent with _get_content() in that it now calls smart_str() on each chunk. This was causing me problems when passing a generator that yielded unicode strings instead of bytecode strings, which was breaking compress_sequence().

comment:25 by Paul McMillan, 13 years ago

Patch needs improvement: set

I'm afraid this patch broke again with r16829. I think that change should make this patch quite a bit cleaner in the long run.

comment:26 by Jacob, 13 years ago

Triage Stage: Design decision neededAccepted

It's clear to me that we have to do *something* about this (though *what* is far from obvious).

by Chris Ghormley <chris@…>, 13 years ago

Patch for Django 1.3.1 (uses md5_constructor instead of hashlib.md5)

comment:27 by Chris Ghormley <chris@…>, 13 years ago

It doesn't advance the discussion on what to do next, but the patch I just added should let people like me who wanted to painlessly apply mrmachine's patch to Django 1.3.1, do so.

comment:28 by Mitar, 13 years ago

Cc: mmitar@… added

comment:29 by Tai Lee, 12 years ago

I've updated this patch on a branch at GitHub.

https://github.com/thirstydigital/django/tree/tickets/7581-streaming-response-middleware

If somebody can provide some concrete direction on the desired approach to resolve this (BDFL decision?), I'll try to add docs and tests as well.

FYI, I've been using this patch in various incarnations for 4 years in production environments without any problems.

My vote is still for allowing middleware to ask the response if it is streaming (generator as content) or not, and act accordingly. Either by changing their behaviour (to be compatible with streaming responses), doing nothing (to avoid breaking a streaming response), or forcing the response to consume the content (breaking the streaming response, but ensuring that the middleware always runs).

comment:30 by Tai Lee, 12 years ago

Updated this patch with some minor tweaks arising from IRC discussion with akaariai. Consume generator content on access, not assignment, so that generator content can still stream if no middleware has to access response.content. Raise PendingDeprecationWarning if you explicitly create an HttpResponse with streaming_content=True and then you (or some middleware) accesses response.content.

Also see recent google developers discussion at https://groups.google.com/d/topic/django-developers/RrNPfuJxnlM/discussion

comment:31 by mindsocket, 12 years ago

Cc: mindsocket added

comment:32 by Tai Lee, 12 years ago

Needs documentation: set

I've updated the patch on my branch after further feedback from Aymeric Augustin and Anssi Kääriäinen on google groups discussion.

The new patch refactors HttpResponse into HttpResponseBase and HttpResponse classes, and adds a new HttpStreamingResponse class.

The HttpResponse class is simplified, and will consume iterator content whenever it is assigned, so that content can be safely accessed multiple times.

The HttpStreamingResponse class exposes a new attribute, streaming_content. If you assign an iterator as streaming content, it will not be consumed until the response is iterated. The streaming content can be wrapped with a new generator (e.g. GZipMiddleware). The class prevents accidental assignment to the content attribute, to avoid confusing middleware that checks for a content attribute on response objects.

Both classes use HttpResponseBase.make_bytes() to yield bytestrings when iterated, or a single bytestring when HttpResponse.content is accessed. This has simplified the implementation a little and removed some duplication.

The close() method will now close all "file-like" content objects that were assigned to content or streaming_content, even if they are later replaced by different content. I think the old code would have left unclosed objects in this case.

As an added bonus, you can now write() to both regular and streaming responses, when iterable or non-iterable content is assigned. For streaming responses, this just wraps the old streaming content with a new generator that yields an additional chunk at the end. For regular responses, assigned content is always normalised to a list now, so we can always append additional content.

Middleware have been updated for compatibility with streaming responses. They now check responses for content or streaming_content before attempting to read or replace those attributes. New and 3rd party middleware should follow this example.

We now have tests, too, and the full test suite passes here (2.7, OS X, SQLite). If this approach is accepted, I will add docs as well.

comment:33 by Tai Lee, 12 years ago

I've updated the branch again slightly, to use HttpStreamingResponse in the django.views.static.serve view. This is MASSIVELY faster on my local machine. It was slightly less of an improvement with GZipMiddleware enabled, but still a huge improvement compared to regular responses.

I noticed when I was testing a site that hosts a 16MB iPad app. When I tried to install the app on the iPad from my local dev/test site, it was *extremely* slow. So I tried to download the file directly with Firefox on the same machine. Firefox was pulling it down at 60-70KB/s from localhost.

When using HttpStreamingResponse instead, it's instantaneous. This is not only useful or noticeable with large files, but in general all images and other static content now loads MUCH faster with the dev server. Previously I could literally see even small files (60KB images) being progressively drawn to the page. Now it's all instant. Much nicer during development, especially on image heavy sites.

I also found that django.http.utils.conditional_content_removal() was accessing response.content directly. I've updated that to work with regular and streaming responses. I couldn't find *any* tests for this function (or any in django.http.utils, so I added a new regression tests package, http_utils.

comment:34 by Anssi Kääriäinen, 12 years ago

Quick skimming, and looks good. Could you create a pull request for easier reviewing?

I will try to see if we can get this into 1.5. Unfortunately HTTP protocol isn't my strongest area, so help is welcome for reviewing...

comment:35 by Tai Lee, 12 years ago

Needs documentation: unset
Patch needs improvement: unset

I've added docs (patch should be complete, now), and opened a pull request.

https://github.com/django/django/pull/407

comment:36 by Aymeric Augustin, 12 years ago

There's a lot of history on this ticket; I reviewed as much as possible but couldn't read everything. Here's the current status of my review.

I'm still at the design decision level; I haven't looked at the code yet.

Hard requirements

  • The behavior of responses instantiated with a string must not change.
    • Below, I'm only considering responses instantiated with an iterator (often a generator).
  • Accessing request.content in a regular HttpResponse must exhaust the iterator and switch to a string-based response, which can be repeatedly accessed.

Requirements

  • Discussions on CSRF middleware are moot for the reason explained by Luke in comment 20.
  • Discussions on the Content-Length header can be ignored; it's optional and should be omitted if it can't be computed efficiently.
  • Discussions about conditionally disabling response middleware are outside of the scope of this ticket.
    • I'm -1 on HttpResponse.disabled_middleware and all similar ideas, especially those requiring new settings — it's already way too difficult to figure out how to make middleware run in a meaningful order.

Personal opinions

  • I consider streaming a large CSV export as a valid use case. Ultimately we should be able to stream from the database to the end user (I know this isn't possible with django.db right now). The two other uses cases brought up in the latest discussion are also interesting.
  • I have absolutely no interest in fixing GzipMiddleware. I'd rather deprecate it, on the grounds that it's prohibited by PEP 3333 (and everyone should be using WSGI these days).

Questions

  • One of the goals is to switch some parts of Django (e.g. django.views.static.serve) to streaming by default. This will be backward incompatible if the StreamingHttpResponse API isn't (at least temporarily) a superset of the HttpResponse API — specifically if accessing StreamingHttpResponse.content raises an exception.
  • In the long run accessing StreamingHttpResponse.content should raise an exception, in order to guarantee the streaming behavior isn't accidentally broken by middleware.
    • This is why StreamingHttpResponse provides the content in a different attribute: streaming_content
  • We could use a deprecation path to solve both issues at the same time: StreamingHttpResponse.content would work initially, but be deprecated and removed in 1.7.
  • Middleware must be able to provide different implementation for regular and streamed responses.
    • The suggested idiom is if hasattr(response, 'streaming_content'): .... This is more pythonic than the alternative if isinstance(response, StreamingHttpResponse): ... but it still looks wrong.
    • I'm considering adding get_streaming_response to the response middleware API. If a response is a streaming response (which would be determined by duck-typing: hasattr(response, 'streaming_content')) and the middleware provides a get_streaming_response method, that method would be called. Otherwise get_response would be called. Middleware authors who want to support streaming responses will have to implement that new method.

comment:37 by Tai Lee, 12 years ago

I assume you meant process_response() and process_streaming_response() methods for middleware classes? I think this would be much nicer to work with, and easier to explain to middleware authors than asking them to do hasattr(response, 'streaming_content'). But it would mean a little more work to refactor some of the bundled middleware classes.

One thing you didn't mention is automatically closing file-like objects passed as content to regular and streaming responses.

For streaming responses, Django should close them automatically, because users won't be able to. I think that we should do the same for regular responses as well for consistency and to avoid accidentally leaving files open.

I'm not sure if this would be a backwards incompatible change, or if it would need a deprecation path, or if a deprecation path would be possible? I guess this could be implemented separately, if it decided to go ahead.

comment:38 by Aymeric Augustin, 12 years ago

Yes I meant process_response.

I didn't look at the closing issue yet, but it shouldn't be a problem.

comment:39 by Aymeric Augustin, 12 years ago

Design decisions

  • I'm not going to refactor HttpResponse right now; there's another ticket about that.
  • I'm going to introduce a subclass of StreamingHttpResponse that provides a deprecated content attribute, and use that class in the static serve view, to preserve backwards compatibility.
  • I'm going to introduce a response.streaming boolean flag to avoid the hasattr(response, 'streaming_content') pattern.
  • I'm not going to change the middleware API.
Version 0, edited 12 years ago by Aymeric Augustin (next)

comment:40 by Aymeric Augustin, 12 years ago

I've finished working on a patch. Barring unexpected objections, I'll commit it soon.

To keep things manageable, I have made the minimum changes required to support streaming responses and enable them in the static serve view. I've taken care not to change the behavior of HttpResponse in any way. (I've saved that for #18796.)

I've kept the tests written by mrmachine (with the exception of those that tested changes to HttpResponse). I've rewritten the code to make fewer changes, because I didn't trust myself enough to push such a large and sensitive patch in a single commit.

by Aymeric Augustin, 12 years ago

Attachment: 7581-20121020.patch added

comment:41 by Aymeric Augustin, 12 years ago

mrmachine's patch contained code to call close() on the response object when the generator terminates.

This isn't necessary, because the WSGI server is already required to call that method when the response is finished.

I removed it.

comment:42 by Aymeric Augustin <aymeric.augustin@…>, 12 years ago

Resolution: fixed
Status: newclosed

In 4b27813198ae31892f1159d437e492f7745761a0:

Fixed #7581 -- Added streaming responses.

Thanks mrmachine and everyone else involved on this long-standing ticket.

comment:43 by Tai Lee, 12 years ago

Excellent. Very happy to see this committed. I have a couple of comments that may or may not be important.

  1. We're not checking if the close attribute on assigned content is callable before adding it to _closable_objects? Probably an edge case not worth worrying about, though.
  1. Should we pop closable objects off the list before closing them (or just empty the list after closing all items), so that multiple calls to close() don't re-attempt to close objects already closed? Graham's blog seems to indicate that multiple WSGI middleware may each call close() on the response? http://blog.dscpl.com.au/2012/10/obligations-for-calling-close-on.html
  1. Returning self._iterator directly in HttpStreamingResponse.streaming_content, instead of a new iterator that yields bytestrings means that middleware authors can't expect consistent bytestring chunks when wrapping streaming_content in a new iterator? They may have to type check or coerce each chunk (potentially reproducing something like make_bytes()? For comparison, HttpResponse.content always returns a bytestring.
  1. There may still be a case for consuming iterators assigned to HttpResponse.content immediately, rather than on first access. Maybe after a deprecation period, though? If HttpResponse._container is an iterator, the response can't be pickled by cache middleware. It would also allow additional content to be written to the response. This is probably an issue for another ticket, anyway.
  1. We should add an ETag header in the serve() view, if etags are enabled in settings. ETags are normally added by CommonMiddleware, but middleware can't calculate an ETag for a streaming response, and we do have access to the raw file in serve() so it should be done there (like we do for Last-Modified and Content-Length).
  1. The note in the docs about iterators being closed after the response is iterated as a point of difference from regular responses is not true anymore. We rely on WSGI to call close() on both regular and streaming responses.
  1. The http_utils test wasn't included?

in reply to:  43 comment:44 by Aymeric Augustin, 12 years ago

Replying to mrmachine:

  1. We're not checking if the close attribute on assigned content is callable before adding it to _closable_objects? Probably an edge case not worth worrying about, though.

Yes, primarily because callable is gone in Python 3.

If someone has a custom file-like / iterator object that has a close attribute that isn't callable, that'll teach him a bit of API design principles :)

  1. Should we pop closable objects off the list before closing them (or just empty the list after closing all items), so that multiple calls to close() don't re-attempt to close objects already closed? Graham's blog seems to indicate that multiple WSGI middleware may each call close() on the response? http://blog.dscpl.com.au/2012/10/obligations-for-calling-close-on.html

I'm not sure which part of that blog post you're referring to. This sentence indicates that close() should be called once:

The intent of that statement, at least from the perspective of the WSGI server, is that close() only be called once all content has been consumed from the iterable and that content has been sent to the client.

Furthermore, close() can usually be called on objects that are already closed; it just does nothing.

  1. Returning self._iterator directly in HttpStreamingResponse.streaming_content, instead of a new iterator that yields bytestrings means that middleware authors can't expect consistent bytestring chunks when wrapping streaming_content in a new iterator? They may have to type check or coerce each chunk (potentially reproducing something like make_bytes()? For comparison, HttpResponse.content always returns a bytestring.

That's what #18796 is about: normalizing the conversion of the content to bytes (which is moot because content is already bytes in 99.99% of responses).

I'll copy your remark over there.

  1. There may still be a case for consuming iterators assigned to HttpResponse.content immediately, rather than on first access. Maybe after a deprecation period, though? If HttpResponse._container is an iterator, the response can't be pickled by cache middleware.

If that's true, it's a bug of the cache middleware, which can easily be fixed by making it access response.content. If that bug wasn't reported yet, could you create a new ticket?

It would also allow additional content to be written to the response. This is probably an issue for another ticket, anyway.

Allowing writes to a regular response instantiated with an iterator is #6527. I've uploaded a patch extracted from your work over there.

  1. We should add an ETag header in the serve() view, if etags are enabled in settings. ETags are normally added by CommonMiddleware, but middleware can't calculate an ETag for a streaming response, and we do have access to the raw file in serve() so it should be done there (like we do for Last-Modified and Content-Length).

This never came into the discussion until now, could you create a new ticket for this feature request?

  1. The note in the docs about iterators being closed after the response is iterated as a point of difference from regular responses is not true anymore. We rely on WSGI to call close() on both regular and streaming responses.

I'll fix that.

  1. The http_utils test wasn't included?

Oops, my git-fu must have failed me. I'll commet them separately.

comment:45 by Aymeric Augustin <aymeric.augustin@…>, 12 years ago

In 5e629a015e96a5564a0b0a273e1374b9651ce8e9:

Added tests for conditional_content_removal.

Refs #7581. Thanks mrmachine.

comment:46 by Aymeric Augustin <aymeric.augustin@…>, 12 years ago

In cd914175c8209c5f366e87d94f8f6d050347757d:

[1.5.x] Prevented caching of streaming responses.

The test introduced in 4b278131 accidentally passed because of a
limitation of Python < 3.3.

Refs #17758, #7581.

Backport of 1c8be95 from master.

comment:47 by Aymeric Augustin <aymeric.augustin@…>, 12 years ago

In 1c8be95a864540d416602577d1aa03d58ba33168:

Prevented caching of streaming responses.

The test introduced in 4b278131 accidentally passed because of a
limitation of Python < 3.3.

Refs #17758, #7581.

comment:48 by anonymous, 12 years ago

It should be noted that the Apache2 webserver by default (at least on Ubuntu Linux) enables mod_deflate that suffers from the same issue and thus effectively prevents streaming. With mod_deflate disabled, streaming with Django works very nice.

comment:49 by Aymeric Augustin <aymeric.augustin@…>, 11 years ago

In 8b9b8d3bda09eb1b447631182d06c6c5e51425f6:

Removed compatibility code for streaming responses.

This code provided a deprecation path for old-style streaming responses.

Refs #6527, #7581.

Note: See TracTickets for help on using tickets.
Back to Top