#7581 closed New feature (fixed)
Middleware accessing HttpResponse.content breaks streaming HttpResponse objects.
Reported by: | Tai Lee | Owned by: | ccahoon |
---|---|---|---|
Component: | Core (Other) | Version: | dev |
Severity: | Normal | Keywords: | stream HttpResponse Content-Length |
Cc: | real.human@…, mmitar@…, mindsocket | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
ConditionalGetMiddleware
, GZipMiddleware
, CommonMiddleware
, and CsrfMiddleware
all access response.content directly or indirectly. This prevents a streaming response from being initiated until any generator passed to the HttpResponse
as content is consumed, which can cause a timeout when trying to stream large dynamically generated content.
I've already put together a patch based on the assumption that HttpResponse._is_string
being False
indicates content has been passed in as a dynamic generator and thus we shouldn't delay streaming a response to the browser while consuming the entire generator.
The patch implements the following:
- Allow middleware to assign a new generator to
HttpResponse.content
without consuming it (e.g.GZipMiddleware
)
- Compress chunks of content yielded by
HttpResponse._container
progressively inGZipMiddleware
to allow streaming GZipped content
- Only generate an ETag in
CommonMiddleware
fromHttpResponse.content
ifHttpResponse._is_string
isTrue
- Only check that the length of
HttpResponse.content
is less than 200 inGZipMiddleware
ifHttpResponse._is_string
isTrue
- Only set the Content-Length header in
GZipMiddleware
andConditionalGetMiddleware
ifHttpResponse._is_string
isTrue
With CommonMiddleware
enabled by default and breaking the streaming response functionality if ETags are enabled in settings.py
, I'd consider this a bug. It can be worked around by manually specifying a bogus ETag before returning the response, which doesn't seem ideal.
With this patch, users still have the option of consuming a generator before passing it to the HttpResponse
in order to enable ETag and Content-Length headers, and conditional GZipping when the content length is less than 200.
With CsrfMiddleware
, the generator is only consumed when the Content-Type header is text/html
or application/xhtml+xml
, which may be an acceptable compromise - no streaming response for HTML content if you choose to use django.contrib.csrf
.
This patch accesses HttpResponse._is_string
and HttpResponse._container
from external middleware classes. Perhaps these properties could be made public and/or renamed to be more logical in this context?
Attachments (8)
Change History (57)
comment:1 by , 16 years ago
milestone: | 1.0 → post-1.0 |
---|---|
Triage Stage: | Unreviewed → Design decision needed |
by , 16 years ago
Attachment: | streaming_response.diff added |
---|
add HttpResponse.content_generator, consume generators only once, set Content-Length in WSGIHander.
comment:2 by , 16 years ago
From reading the patch I'm not sure why _get_content_generator is needed. You can always iterate over HttpResponse by calling iter() on it. What am I missing?
follow-up: 4 comment:3 by , 16 years ago
content_generator
(and thus _get_content_generator
) are needed so middleware have a public attribute to check for the existence of and return the actual content generator that was passed to HttpResponse
. We could change _get_content_generator
to return iter(self)
instead, but we'd still need _get_content_generator
to return a generator only when one was passed to HttpResponse
.
I think there might also be a problem if we return iter(self)
to middleware. Because HttpResponse.next
encodes each chunk as a bytestring, this could potentially encode each chunk multiple times if middleware sets a new content generator which yields unicode (or other) data, e.g. response.content = (something(e) for e in iter(response))
if something
returned unicode. We'd also be passing a bytestring to something
instead of the original content chunk. Even if something
did not return unicode, it would at least call str
on each chunk multiple times.
Isn't it better to pass the original content generator to middleware, and only call HttpResponse.next
on each chunk once when HttpResponse.content
is finally accessed, which can only happen once as the generator is replaced with string content when that happens?
by , 16 years ago
Attachment: | streaming_response.2.diff added |
---|
fix django.contrib.csrf to work with content generators.
follow-up: 5 comment:4 by , 16 years ago
Replying to Tai Lee <real.human@mrmachine.net>:
content_generator
(and thus_get_content_generator
) are needed so middleware have a public attribute to check for the existence of and return the actual content generator that was passed toHttpResponse
. We could change_get_content_generator
to returniter(self)
instead, but we'd still need_get_content_generator
to return a generator only when one was passed toHttpResponse
.
Sorry, I wasn't clear. By "why" I meant a deeper meaning: why a middleware would ever want to know if a response has a generator as a source or something else. From the middleware's point of view content of a response can be obtained either wholly with response.content
or as a stream with iter(response)
. After discussion on the list I agree with Malcolm that middleware already knows if it wants the whole content or can stream it.
I think there might also be a problem if we return
iter(self)
to middleware. BecauseHttpResponse.next
encodes each chunk as a bytestring, this could potentially encode each chunk multiple times if middleware sets a new content generator which yields unicode
I have as many as 3 points to say about it, please don't fall asleep! :-)
- As matters stand now, only byte strings should appear on the output side of the response. Which means that every middleware that does something to a response should work on byte strings and return byte strings. Actually it might be useful to have this conversion as late as possible -- in the WSGIHandler -- but it's another thing anyway.
- When HttpResponse iterates over itself the first time it should just save already encoded stream to avoid encoding multiple times as well as iterating. I mean instead of
self._container = [''.join(self._container)]
it should be likeself._container = [''.join([s for s in self])]
- More importantly, I think every piece of code that can access response in a streaming fashion should never exhaust it entirely. Not because it wouldn't work but because it's not efficient. As an example GzipMiddleware can replace response's generator with its own but shouldn't read a single chunk from it.
BTW I now understand why you're exposing content_generator publicly: a middleware would want to replace it leaving other aspects of a response intact, right? If yes I think it could be done in a more functional fashion: provide a helper response wrapper that will have its own generator wrapper but will delegate everything else to a wrapped response. Like this:
class HttpResponseWrapper(object): def __init__(self, response, iterable): self.response, self.iterable = response, iterable def __iter__(self): return iter(self.iterable) def __getattr__(self, name): return getattr(self.response, name) response = HttpResponseWrapper(response, gzip_content_generator(response))
comment:5 by , 16 years ago
Replying to Ivan Sagalaev <Maniac@SoftwareManiacs.Org>:
Sorry, I wasn't clear. By "why" I meant a deeper meaning: why a middleware would ever want to know if a response has a generator as a source or something else. From the middleware's point of view content of a response can be obtained either wholly with
response.content
or as a stream withiter(response)
. After discussion on the list I agree with Malcolm that middleware already knows if it wants the whole content or can stream it.
I think that middleware needs to be able to modify the response in different ways (or not at all) if the content is derived from a generator, so there needs to be a public attribute we can check to determine if the content is derived from a generator or not. An example is that CommonMiddleware
should not generate an ETag or ConditionalGetMiddleware
should not set Content-Length if content is derived from a generator. Another is that GZipMiddleware
can compress the entire content in one operation if content is not derived from a generator, which may be more efficient or result in higher compression then compressing each chunk sequentially?
- As matters stand now, only byte strings should appear on the output side of the response. Which means that every middleware that does something to a response should work on byte strings and return byte strings. Actually it might be useful to have this conversion as late as possible -- in the WSGIHandler -- but it's another thing anyway.
By giving middleware access to the original content generator we can delay the conversion to byte string until HttpResponse.content
is accessed, either by a middleware that requires the entire content or by WSGIHandler, and be sure that it occurs only once.
- When HttpResponse iterates over itself the first time it should just save already encoded stream to avoid encoding multiple times as well as iterating. I mean instead of
self._container = [''.join(self._container)]
it should be likeself._container = [''.join([s for s in self])]
If encoding should be performed in next
, then should _get_content
be altered to simply return ''.join(self._container)
after consuming the generator with self._container = [''.join(self)]
? And should next
behave like _get_content
does now and not encode if Content-Encoding is set, otherwise use smart_str
?
- More importantly, I think every piece of code that can access response in a streaming fashion should never exhaust it entirely. Not because it wouldn't work but because it's not efficient. As an example GzipMiddleware can replace response's generator with its own but shouldn't read a single chunk from it.
Indeed, those middleware that can operate on chunks of content should not consume the content generator, but should only replace it with a new generator. This is the problem solved by this patch where middleware that are not aware of a content generator consume it and break streaming.
BTW I now understand why you're exposing content_generator publicly: a middleware would want to replace it leaving other aspects of a response intact, right? If yes I think it could be done in a more functional fashion: provide a helper response wrapper that will have its own generator wrapper but will delegate everything else to a wrapped response. Like this:
Middleware can already replace the content generator with response.content = (something(chunk) for chunk in response.content_generator)
, because _set_content
will check if the value is an iterable or a string.
As mentioned above, I think we need to expose content_generator for two purposes. The first is to indicate to middleware that the response is derived from a generator so that the middleware can skip incompatible functionality (e.g. ETag generation) or replace the content generator instead of replacing the content directly.
The second is to provide access to the original content generator. Currently next
will encode any value that is unicode. If a middleware sets a new content generator which yields a unicode value (e.g. response.content = (unicode(chunk) for chunk in response)
, the encoding will occur twice.
comment:6 by , 16 years ago
The more I think about this, the more I'm -1 on it. The complications of trying to force lazily-evaluated responses are far too numerous (ranging all the way from resource leaks on the Django side out to things like WSGI forbidding useful HTTP features which help with this) and the existence of specialized solutions for doing long-lived HTTP and/or asynchronous stuff are too plentiful to justify Django special-casing enough stuff to make this work.
comment:7 by , 16 years ago
I don't understand the resistence. Streaming HttpResponse functionality has been in trunk for around 2 years now. All I'm suggesting is that we fix certain middleware that ships with Django to work with the streaming functionality that's already present in trunk where possible and appropriate - e.g. GZip, ETag, etc., and to add a public attribute to HttpResponse
to aid the conditional execution of middleware code.
If there are certain middleware that cannot work with dynamic content generators, they continue to work as they do now. If there are certain handlers (WSGI) which don't at present work without a Content-Length header, they continue to work as they do now. Middleware authors can continue to develop middleware without even being aware that anything has changed. Existing 3rd party middleware will continue to work as it does now. This should be backwards compatible.
I don't see how what's been added (essentially just skipping certain middleware functionality if a generator was used, or being able to replace an existing generator with a new one) would cause any of the numerous complications you mention that aren't already and haven't already been present for the past 2 years.
comment:8 by , 16 years ago
I was thinking about this some time and changed my mind a little...
I agree with Tai Lee that this ticket is not about some abstract ultimate way for keeping streaming responses remain streaming after middleware. It's about fixing some points that can be fixed. However I now think that current patch does it the wrong way for two reasons.
- It relies on how the response was constructed which doesn't always correlate with the need to apply certain middleware. For example one may still want to count ETag or gzip content of small files which will be treated as generators.
- Gzipping a stream just can't work because WSGI wants Content-length to be set for response and it's impossible to count without gzipping the whole contents.
So may be what's really needed here is a simple way for a view to explicitly flag the response with "don't gzip it" or "don't etag it".
comment:9 by , 16 years ago
I agree that it would also be useful to be able to disable certain middleware per HttpResponse
. You are correct in saying that the way a response is created won't always indicate which middleware we want to disable. However, if a certain middleware CAN run without breaking a content generator (e.g. GZip when not using WSGI), shouldn't it do so?
How would such a feature be implemented? Pass a list of middleware (e.g. ['django.middleware.common.CommonMiddleware', 'django.middleware.gzip.GZipMiddleware']
)? That could be prone to error, requiring people to remember which middleware are required to be removed in order to preserve streaming functionality.
What about setting a boolean attribute consume_generator
on HttpResponse
which defaults to True
. If this is True
, then the generator is consumed immediately and replaced with the resulting content string. If it's False
, then either any middleware or partial functionality within some middleware that can't work without consuming the content generator are skipped?
comment:10 by , 16 years ago
Updated patch to add a stream_content
argument (defaults to False
) to specify whether or not content generator should be consumed immediately. Middleware can then assume that if content generator exists, user intended it to be preserved. Default behaviour will be to consume content, and if middleware authors want to (and can) work with streaming content, all they need to do is set response.content
to a new generator (and we make response.content_generator
) available publicly so they can alter the original generator if one is set.
comment:11 by , 16 years ago
if a certain middleware CAN run without breaking a content generator (e.g. GZip when not using WSGI), shouldn't it do so?
I don't think so. It would break feauture parity between mod_python and WSGI and thus will make it harder for users to switch deployment scheme. Furthermore WSGI requiring Content-Length is a good thing because otherwise browsers won't show download progress. I think it's a bad side effect.
What about setting a boolean attribute consume_generator on HttpResponse which defaults to True.
The point of my previous comment was that there shouldn't be a generic attribute at all. Because generically it doesn't make any sense. I really meant just this:
def my_view(request): response = HttpResponse(open('somefile')) if os.path.getsize('somefile') > 1024: response.dont_gzip = True class GzipMiddleware: def process_response(self, request, response): if getattr(response, 'dont_gzip'): return
And similar attribute for CommonMiddleware about etag.
In other words, I think that the fact that certain two middleware can skip their action in a very certain conditions should remain their own detail and shouldn't affect other parts of the framework.
comment:12 by , 16 years ago
The WSGI specification does NOT require that Content-Length be specified for response. If you are making assumptions based on that belief, they will be wrong.
comment:14 by , 16 years ago
Keywords: | Content-Length added |
---|
PEP333 may be relevant here, and seems to indicate that Content-Length may be empty or absent.
comment:16 by , 16 years ago
There is a bug in the function utils.text.compress_sequence provided in the patch. Because it never closes the zfile, it misses a few trailing bytes of output, causing invalid output. Some decompressors (e.g. many browsers) seem tolerant of this, but many (e.g. Googlebot, gunzip) are not. The fix is straightforward; patch to text.py attached. This also has a warning (feel free to exclude if you wish!) for users of the function that its compression performance is not as good as compress_string (though obviously it has other advantages). As a final note, in our local installation we use the Z_FULL_FLUSH option on zfile.flush(), with some penalty in compression performance, so that browsers can start rendering the streaming output immediately, without having to wait for further bytes to complete decompression. This may not be an appropriate change for all installations, so I didn't include it here, though it might be useful as an option.
Hope this is helpful-
by , 16 years ago
Attachment: | text.py.diff added |
---|
Bug fix for invalid compression output in compress_sequence
comment:18 by , 15 years ago
I just updated the patch to work with trunk r11381 and also rolled in the compress_sequence
fix. I also removed the set_content_length
parts for WSGI, since it appears to be valid in HTTP 1.0/1.1 and WSGI to omit this header when the content length is not known. The full test suite passes. I've been using this patch in my local branch of Django for over a year now, first with Apache/mod_python, and now with Apache/mod_wsgi, and have not noticed any ill-effects.
I've discussed this with ccahoon recently who is working on the http-wsgi-improvements branch, and I cannot see a way for any middleware to ever work with a streaming response (when possible and appropriate, e.g. gzip) without having a hook back into the HttpResponse
object where it can alter the existing generator in-place. I think this is a fairly useful ability for middleware, so I'd still like to see any generator passed in as content to HttpResponse
exposed for middleware to inspect and replace, or automatically disable any functionality that is irrelevant to generator content (e.g. etags).
I don't believe that this is the complete solution. There will still be cases where middleware *must* consume the content in order to function (e.g. csrf, which is conditionally applied to HTML content types). For these cases there needs to be a way for users to disable specific middleware for a specific response, so that streaming can still function, and this may also be useful for other purposes.
I suggest that we simply add an attribute HttpResponse.disabled_middleware
(list or tuple). Any installed middleware specified there should be bypassed when applying response middleware. Django could then ship with an HttpStreamingResponse
class that disables any included middleware that is known not to work with generator content, and this could easily be subclassed and updated by users who need to disable any 3rd party middleware to create a streaming response.
My own use-case for streaming responses is that I often need to export large sets of model data in CSV format, sometimes manipulating the data as I go which can result in a significant wait before the entire content is consumed. Besides the also significant memory usage, the delay before content delivery can start often causes browser timeouts. I'm hesitant to simply increase the timeout because it's not really solving the problem, and users will often think that something is wrong if a download doesn't start quickly.
My only alternative without streaming responses is to look into an offline job queue solution, which seems further outside of the scope of django, and introduces a requirement for notifying users when their download is complete.
by , 15 years ago
Attachment: | 7581-Streaming-HttpResponse-r11381.diff added |
---|
Updated for trunk r11381.
comment:19 by , 15 years ago
Changeset [11449] on the branch sco2009/http-wsgi-improvements has my attempt at solving this problem, using a new response class (http.HttpResponseStreaming) and class attributes for streaming-safe response middleware. It has tests and docs. I am not sure how I feel about the test coverage, but it does show that at least it operates correctly as an HttpResponse, and that the content doesn't get consumed on initialization of the response. I used code and ideas from the existing patches here with some additional changes.
comment:20 by , 15 years ago
FYI - the use of HttpResponse.content
in the CSRF middleware should hopefully go away completely in Django 1.4. Before then (Django 1.2 hopefully), it will be isolated in a deprecated middleware, replaced by a streaming friendly method. See CsrfProtection for more.
comment:21 by , 15 years ago
For people with problems with file streaming you can use a simple workaround http://www.djangosnippets.org/snippets/1782/
comment:22 by , 15 years ago
milestone: | 1.2 |
---|
Bumping from 1.2: there's still no agreement about the correct approach here.
comment:23 by , 14 years ago
Severity: | → Normal |
---|---|
Type: | → New feature |
by , 13 years ago
Attachment: | 7581-Streaming-HttpResponse-r16445.diff added |
---|
Update for trunk r16445.
comment:24 by , 13 years ago
Easy pickings: | unset |
---|---|
UI/UX: | unset |
Just updated the patch to apply cleanly against trunk again. Removed (now) redundant changes to CSRF middleware (it no longer breaks streaming). Also made _get_content_generator()
consistent with _get_content()
in that it now calls smart_str()
on each chunk. This was causing me problems when passing a generator that yielded unicode strings instead of bytecode strings, which was breaking compress_sequence()
.
comment:25 by , 13 years ago
Patch needs improvement: | set |
---|
I'm afraid this patch broke again with r16829. I think that change should make this patch quite a bit cleaner in the long run.
comment:26 by , 13 years ago
Triage Stage: | Design decision needed → Accepted |
---|
It's clear to me that we have to do *something* about this (though *what* is far from obvious).
by , 13 years ago
Attachment: | 7581-Streaming-HttpResponse-1.3.1.diff added |
---|
Patch for Django 1.3.1 (uses md5_constructor instead of hashlib.md5)
comment:27 by , 13 years ago
It doesn't advance the discussion on what to do next, but the patch I just added should let people like me who wanted to painlessly apply mrmachine's patch to Django 1.3.1, do so.
comment:28 by , 13 years ago
Cc: | added |
---|
comment:29 by , 12 years ago
I've updated this patch on a branch at GitHub.
https://github.com/thirstydigital/django/tree/tickets/7581-streaming-response-middleware
If somebody can provide some concrete direction on the desired approach to resolve this (BDFL decision?), I'll try to add docs and tests as well.
FYI, I've been using this patch in various incarnations for 4 years in production environments without any problems.
My vote is still for allowing middleware to ask the response if it is streaming (generator as content) or not, and act accordingly. Either by changing their behaviour (to be compatible with streaming responses), doing nothing (to avoid breaking a streaming response), or forcing the response to consume the content (breaking the streaming response, but ensuring that the middleware always runs).
comment:30 by , 12 years ago
Updated this patch with some minor tweaks arising from IRC discussion with akaariai. Consume generator content on access, not assignment, so that generator content can still stream if no middleware has to access response.content
. Raise PendingDeprecationWarning
if you explicitly create an HttpResponse
with streaming_content=True
and then you (or some middleware) accesses response.content
.
Also see recent google developers discussion at https://groups.google.com/d/topic/django-developers/RrNPfuJxnlM/discussion
comment:31 by , 12 years ago
Cc: | added |
---|
comment:32 by , 12 years ago
Needs documentation: | set |
---|
I've updated the patch on my branch after further feedback from Aymeric Augustin and Anssi Kääriäinen on google groups discussion.
The new patch refactors HttpResponse
into HttpResponseBase
and HttpResponse
classes, and adds a new HttpStreamingResponse
class.
The HttpResponse
class is simplified, and will consume iterator content whenever it is assigned, so that content can be safely accessed multiple times.
The HttpStreamingResponse
class exposes a new attribute, streaming_content
. If you assign an iterator as streaming content, it will not be consumed until the response is iterated. The streaming content can be wrapped with a new generator (e.g. GZipMiddleware). The class prevents accidental assignment to the content
attribute, to avoid confusing middleware that checks for a content
attribute on response objects.
Both classes use HttpResponseBase.make_bytes()
to yield bytestrings when iterated, or a single bytestring when HttpResponse.content
is accessed. This has simplified the implementation a little and removed some duplication.
The close()
method will now close all "file-like" content objects that were assigned to content
or streaming_content
, even if they are later replaced by different content. I think the old code would have left unclosed objects in this case.
As an added bonus, you can now write()
to both regular and streaming responses, when iterable or non-iterable content is assigned. For streaming responses, this just wraps the old streaming content with a new generator that yields an additional chunk at the end. For regular responses, assigned content is always normalised to a list now, so we can always append additional content.
Middleware have been updated for compatibility with streaming responses. They now check responses for content
or streaming_content
before attempting to read or replace those attributes. New and 3rd party middleware should follow this example.
We now have tests, too, and the full test suite passes here (2.7, OS X, SQLite). If this approach is accepted, I will add docs as well.
comment:33 by , 12 years ago
I've updated the branch again slightly, to use HttpStreamingResponse
in the django.views.static.serve
view. This is MASSIVELY faster on my local machine. It was slightly less of an improvement with GZipMiddleware
enabled, but still a huge improvement compared to regular responses.
I noticed when I was testing a site that hosts a 16MB iPad app. When I tried to install the app on the iPad from my local dev/test site, it was *extremely* slow. So I tried to download the file directly with Firefox on the same machine. Firefox was pulling it down at 60-70KB/s from localhost.
When using HttpStreamingResponse
instead, it's instantaneous. This is not only useful or noticeable with large files, but in general all images and other static content now loads MUCH faster with the dev server. Previously I could literally see even small files (60KB images) being progressively drawn to the page. Now it's all instant. Much nicer during development, especially on image heavy sites.
I also found that django.http.utils.conditional_content_removal()
was accessing response.content
directly. I've updated that to work with regular and streaming responses. I couldn't find *any* tests for this function (or any in django.http.utils
, so I added a new regression tests package, http_utils
.
comment:34 by , 12 years ago
Quick skimming, and looks good. Could you create a pull request for easier reviewing?
I will try to see if we can get this into 1.5. Unfortunately HTTP protocol isn't my strongest area, so help is welcome for reviewing...
comment:35 by , 12 years ago
Needs documentation: | unset |
---|---|
Patch needs improvement: | unset |
I've added docs (patch should be complete, now), and opened a pull request.
comment:36 by , 12 years ago
There's a lot of history on this ticket; I reviewed as much as possible but couldn't read everything. Here's the current status of my review.
I'm still at the design decision level; I haven't looked at the code yet.
Hard requirements
- The behavior of responses instantiated with a string must not change.
- Below, I'm only considering responses instantiated with an iterator (often a generator).
- Accessing
request.content
in a regularHttpResponse
must exhaust the iterator and switch to a string-based response, which can be repeatedly accessed.
Requirements
- Discussions on CSRF middleware are moot for the reason explained by Luke in comment 20.
- Discussions on the
Content-Length
header can be ignored; it's optional and should be omitted if it can't be computed efficiently. - Discussions about conditionally disabling response middleware are outside of the scope of this ticket.
- I'm -1 on
HttpResponse.disabled_middleware
and all similar ideas, especially those requiring new settings — it's already way too difficult to figure out how to make middleware run in a meaningful order.
- I'm -1 on
Personal opinions
- I consider streaming a large CSV export as a valid use case. Ultimately we should be able to stream from the database to the end user (I know this isn't possible with
django.db
right now). The two other uses cases brought up in the latest discussion are also interesting. - I have absolutely no interest in fixing
GzipMiddleware
. I'd rather deprecate it, on the grounds that it's prohibited by PEP 3333 (and everyone should be using WSGI these days).
Questions
- One of the goals is to switch some parts of Django (e.g.
django.views.static.serve
) to streaming by default. This will be backward incompatible if theStreamingHttpResponse
API isn't (at least temporarily) a superset of theHttpResponse
API — specifically if accessingStreamingHttpResponse.content
raises an exception. - In the long run accessing
StreamingHttpResponse.content
should raise an exception, in order to guarantee the streaming behavior isn't accidentally broken by middleware.- This is why
StreamingHttpResponse
provides the content in a different attribute:streaming_content
- This is why
- We could use a deprecation path to solve both issues at the same time:
StreamingHttpResponse.content
would work initially, but be deprecated and removed in 1.7. - Middleware must be able to provide different implementation for regular and streamed responses.
- The suggested idiom is
if hasattr(response, 'streaming_content'): ...
. This is more pythonic than the alternativeif isinstance(response, StreamingHttpResponse): ...
but it still looks wrong. - I'm considering adding
get_streaming_response
to the response middleware API. If a response is a streaming response (which would be determined by duck-typing:hasattr(response, 'streaming_content')
) and the middleware provides aget_streaming_response
method, that method would be called. Otherwiseget_response
would be called. Middleware authors who want to support streaming responses will have to implement that new method.
- The suggested idiom is
comment:37 by , 12 years ago
I assume you meant process_response()
and process_streaming_response()
methods for middleware classes? I think this would be much nicer to work with, and easier to explain to middleware authors than asking them to do hasattr(response, 'streaming_content')
. But it would mean a little more work to refactor some of the bundled middleware classes.
One thing you didn't mention is automatically closing file-like objects passed as content to regular and streaming responses.
For streaming responses, Django should close them automatically, because users won't be able to. I think that we should do the same for regular responses as well for consistency and to avoid accidentally leaving files open.
I'm not sure if this would be a backwards incompatible change, or if it would need a deprecation path, or if a deprecation path would be possible? I guess this could be implemented separately, if it decided to go ahead.
comment:38 by , 12 years ago
Yes I meant process_response
.
I didn't look at the closing issue yet, but it shouldn't be a problem.
comment:39 by , 12 years ago
Design decisions
- I'm not going to refactor
HttpResponse
right now; there's another ticket about that. - I'm going to introduce a subclass of
StreamingHttpResponse
that provides a deprecated content attribute, and use that class in the static serve view, to preserve backwards compatibility. - I'm going to deal with closing iterators; it's required to avoid leaking file descriptors in the static serve view.
- I'm going to introduce a
response.streaming
boolean flag to avoid thehasattr(response, 'streaming_content')
pattern. - I'm not going to change the middleware API.
comment:40 by , 12 years ago
I've finished working on a patch. Barring unexpected objections, I'll commit it soon.
To keep things manageable, I have made the minimum changes required to support streaming responses and enable them in the static serve view. I've taken care not to change the behavior of HttpResponse
in any way. (I've saved that for #18796.)
I've kept the tests written by mrmachine (with the exception of those that tested changes to HttpResponse
). I've rewritten the code to make fewer changes, because I didn't trust myself enough to push such a large and sensitive patch in a single commit.
by , 12 years ago
Attachment: | 7581-20121020.patch added |
---|
comment:41 by , 12 years ago
mrmachine's patch contained code to call close()
on the response object when the generator terminates.
This isn't necessary, because the WSGI server is already required to call that method when the response is finished.
I removed it.
comment:42 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
follow-up: 44 comment:43 by , 12 years ago
Excellent. Very happy to see this committed. I have a couple of comments that may or may not be important.
- We're not checking if the
close
attribute on assigned content is callable before adding it to_closable_objects
? Probably an edge case not worth worrying about, though.
- Should we pop closable objects off the list before closing them (or just empty the list after closing all items), so that multiple calls to
close()
don't re-attempt to close objects already closed? Graham's blog seems to indicate that multiple WSGI middleware may each callclose()
on the response? http://blog.dscpl.com.au/2012/10/obligations-for-calling-close-on.html
- Returning
self._iterator
directly inHttpStreamingResponse.streaming_content
, instead of a new iterator that yields bytestrings means that middleware authors can't expect consistent bytestring chunks when wrapping streaming_content in a new iterator? They may have to type check or coerce each chunk (potentially reproducing something likemake_bytes()
? For comparison,HttpResponse.content
always returns a bytestring.
- There may still be a case for consuming iterators assigned to
HttpResponse.content
immediately, rather than on first access. Maybe after a deprecation period, though? IfHttpResponse._container
is an iterator, the response can't be pickled by cache middleware. It would also allow additional content to be written to the response. This is probably an issue for another ticket, anyway.
- We should add an
ETag
header in theserve()
view, if etags are enabled in settings. ETags are normally added byCommonMiddleware
, but middleware can't calculate an ETag for a streaming response, and we do have access to the raw file inserve()
so it should be done there (like we do forLast-Modified
andContent-Length
).
- The note in the docs about iterators being closed after the response is iterated as a point of difference from regular responses is not true anymore. We rely on WSGI to call
close()
on both regular and streaming responses.
- The
http_utils
test wasn't included?
comment:44 by , 12 years ago
Replying to mrmachine:
- We're not checking if the
close
attribute on assigned content is callable before adding it to_closable_objects
? Probably an edge case not worth worrying about, though.
Yes, primarily because callable
is gone in Python 3.
If someone has a custom file-like / iterator object that has a close
attribute that isn't callable, that'll teach him a bit of API design principles :)
- Should we pop closable objects off the list before closing them (or just empty the list after closing all items), so that multiple calls to
close()
don't re-attempt to close objects already closed? Graham's blog seems to indicate that multiple WSGI middleware may each callclose()
on the response? http://blog.dscpl.com.au/2012/10/obligations-for-calling-close-on.html
I'm not sure which part of that blog post you're referring to. This sentence indicates that close()
should be called once:
The intent of that statement, at least from the perspective of the WSGI server, is that close() only be called once all content has been consumed from the iterable and that content has been sent to the client.
Furthermore, close()
can usually be called on objects that are already closed; it just does nothing.
- Returning
self._iterator
directly inHttpStreamingResponse.streaming_content
, instead of a new iterator that yields bytestrings means that middleware authors can't expect consistent bytestring chunks when wrapping streaming_content in a new iterator? They may have to type check or coerce each chunk (potentially reproducing something likemake_bytes()
? For comparison,HttpResponse.content
always returns a bytestring.
That's what #18796 is about: normalizing the conversion of the content to bytes
(which is moot because content is already bytes
in 99.99% of responses).
I'll copy your remark over there.
- There may still be a case for consuming iterators assigned to
HttpResponse.content
immediately, rather than on first access. Maybe after a deprecation period, though? IfHttpResponse._container
is an iterator, the response can't be pickled by cache middleware.
If that's true, it's a bug of the cache middleware, which can easily be fixed by making it access response.content
. If that bug wasn't reported yet, could you create a new ticket?
It would also allow additional content to be written to the response. This is probably an issue for another ticket, anyway.
Allowing writes to a regular response instantiated with an iterator is #6527. I've uploaded a patch extracted from your work over there.
- We should add an
ETag
header in theserve()
view, if etags are enabled in settings. ETags are normally added byCommonMiddleware
, but middleware can't calculate an ETag for a streaming response, and we do have access to the raw file inserve()
so it should be done there (like we do forLast-Modified
andContent-Length
).
This never came into the discussion until now, could you create a new ticket for this feature request?
- The note in the docs about iterators being closed after the response is iterated as a point of difference from regular responses is not true anymore. We rely on WSGI to call
close()
on both regular and streaming responses.
I'll fix that.
- The
http_utils
test wasn't included?
Oops, my git-fu must have failed me. I'll commet them separately.
comment:48 by , 12 years ago
It should be noted that the Apache2 webserver by default (at least on Ubuntu Linux) enables mod_deflate that suffers from the same issue and thus effectively prevents streaming. With mod_deflate disabled, streaming with Django works very nice.
Changed to DDN after comments from Malcolm and Ivan on google groups.
The arguments against are that: 3rd party middleware authors will be unable to access the content for an
HttpResponse
because it's content may be an un-consumable generator; middleware will repetitively be required to replaceHttpResponse.content
with a string if a generator is consumed;Content-Length
*must* always be set with WSGI.The argument for is that 3rd party middleware should have the *option* of working with
HttpResponse
objects which have an un-consumable generator (e.g. GZipMiddleware). If middleware authors fail or choose not to account for this possibility, behaviour will remain as it is today, as it would be if generators were consumed immediately when instantiating anHttpResponse
object.A possible alternative raised by Malcolm is to add a way to bypass
process_response
in middleware (or bypass all middleware) for specificHttpResponse
objects. However, this would disallow potentially useful middleware that *can* work without consuming the content generator (e.g.GZipMiddleware
).I've updated the patch with a few improvements:
HttpResponse.content_generator
which will return either a generator if one is present. Middleware authors can use this to check for the existence of, or actually get the content generator. We no longer need to access the private attributesHttpResponse._is_string
orHttpResponse._container
from middleware. We can still assign a string or generator toHttpResponse.content
in the case where we need to replace the content or generator in middleware.HttpResponse.content
will now consume any content generator that exists and replaceHttpResponse._container
with the resulting string, ensuring that the generator is only consumed once.set_content_length
method toWSGIHandler.response_fixes
to ensure thatContent-Length
is *always* set with WSGI. Although I have found some discussion to support the removal ofContent-Length
by WSGI servers with specific examples to GZipped content.Requiring
Content-Length
on every single response which has content with WSGI will make streaming a response impossible under any circumstances. This is quite useful functionality (exporting large sets of data), so if it turns out to be acceptable to stream content through WSGI without aContent-Length
header (as opposed to an incorrectContent-Length
header), or the WSGI spec changes to allow for this, I'd love to see theWSGIHandler.set_content_length
method removed.FYI, I have tested this without
HttpResponse.set_content_length
and withGZipMiddleware
enabled on Apache with mod_wsgi and it does work (stream) as intended.Personally I don't feel that it is an unreasonable restriction to place on middleware authors by asking them to make use of
HttpResponse.content_generator
*if* they want or need their middleware to work with streaming content. It's also not unreasonable for developers to check that the middleware they are using will work with streaming content, if they require streaming content. After all, some middleware by their very nature will not work with streaming content.