Opened 16 years ago

Closed 11 years ago

Last modified 11 years ago

#6527 closed Bug (fixed)

A bug in HttpResponse with iterators

Reported by: daonb <bennydaon@…> Owned by: Aymeric Augustin
Component: HTTP handling Version: dev
Severity: Normal Keywords: http iterators
Cc: mikolo2@…, jim@…, robert.coup@…, real.human@…, Forest Bond, daemianmack@…, net147 Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

>>> from django.http import HttpResponse
>>> response = HttpResponse(file('helloworld.txt','r'))
>>> print response
Content-Type: text/html; charset=utf-8

helloworld

>>> print response
Content-Type: text/html; charset=utf-8


>>> 

Seems to me django.http.HttpResponse has to be modified so that when its init is given an iterator, it will go through the iterator only once.

I'd be happy to deliver a patch if this is something that should be fixed.

Attachments (4)

6527.patch (31.5 KB ) - added by John DeRosa 16 years ago.
Patch with the fix, and regression tests.
6527-2.diff (3.0 KB ) - added by daonb <bennydaon@…> 16 years ago.
Less code here, includes original regression tests
6527 V2.patch (5.8 KB ) - added by John DeRosa 16 years ago.
Updated patch
6527-v3.patch (4.0 KB ) - added by Aymeric Augustin 11 years ago.

Download all attachments as: .zip

Change History (40)

comment:1 by Chris Beaven, 16 years ago

Component: UncategorizedHTTP handling
Triage Stage: UnreviewedDesign decision needed

I'm not sure that this is an actual bug. One of the benefits in passing an iterator is so that you don't get the whole response saved in memory, isn't it?

comment:2 by daonb <bennydaon@…>, 16 years ago

It came up because I downloaded django analytics from google code and added their middleware. What it does is to scan the reponse for the end body tag and insert the analytics tracker code there. I think it's very clean and it worked great.

Unrelated, I have a piece of code that acts as a proxy for the flickr service - I have to keep my key safe so my server has to act as a go between. My view ends with:

	flickr_resp = urlopen(flickr_api_url, params)
	return HttpResponse(''.join(flickr_resp), mimetype='application/json')

and it worked well until I added the middleware.

It's one of the nasties bags I found - The first thing I did was to add "response = ..." and then print it so I can use the local server to debug. The response seemed fine, after a lot of searching and wandering about I decided to print it twice.

It's definitely not the result I expected. When you have a framework that acts like that, it loses my trust.

comment:3 by Malcolm Tredinnick, 16 years ago

Triage Stage: Design decision neededAccepted

The HttpResponse class should turn the iterator into a list and store that result. Might as well do it in the __init__ method.

Response classes cannot work reliably with iterators that aren't read in their entirety at the moment because it's too easy (and common) for middleware to access response.content, which would chew up an iterator. You don't have to look back very far in the archives to see how bad things get when we tried to make things truly iterator-aware at that level. It just doesn't integrate well with middleware.

comment:4 by John DeRosa, 16 years ago

Owner: changed from nobody to John DeRosa
Status: newassigned

Taken during PyCon 2008 sprint.

by John DeRosa, 16 years ago

Attachment: 6527.patch added

Patch with the fix, and regression tests.

comment:5 by John DeRosa, 16 years ago

Has patch: set

Fixed in the attached patch file. I added regression tests to cover this case.

Note, svn thinks the entire AUTHORS file changed due to bug #6545, which refers to a line-ending problem with pool text files.

comment:6 by Malcolm Tredinnick, 16 years ago

If you're going to read the contents from the iterator, there's no need to create another attribute on the model. Just put the contents into self._contents. You don't need to store the iterator.

by daonb <bennydaon@…>, 16 years ago

Attachment: 6527-2.diff added

Less code here, includes original regression tests

by John DeRosa, 16 years ago

Attachment: 6527 V2.patch added

Updated patch

comment:7 by John DeRosa, 16 years ago

Roger. There is no self._contents, so I'm guessing you meant to write self._container.

The entire class got simpler, as a result. Excellent!

The patch is 6527 V2.patch.

The guideline, "If an HttpResponse has been initialized with an iterator as its content, you can’t use the HttpResponse instance as a file-like object. Doing so will raise Exception," is no longer required, so its change is now in the patch. Svn thinks every line changed due to bug #6545.

comment:8 by John DeRosa, 16 years ago

p.s. The full regression test suite passes.

comment:9 by daonb <bennydaon@…>, 16 years ago

nice job stugots! my patch 6527-2.diff is not as good. Thanks for contributing.

in reply to:  9 comment:10 by John DeRosa, 16 years ago

Replying to daonb <bennydaon@gmail.com>:

nice job stugots! my patch 6527-2.diff is not as good. Thanks for contributing.

Thank you for your kind words. I wish I could do more, but I'm just getting up to speed on the pool. It's thrilling to be able to contribute in a small way to Django's success!

comment:11 by Jim Garrison, 16 years ago

Cc: jim@… added

comment:12 by Robert Coup, 15 years ago

Cc: robert.coup@… added

comment:13 by Tai Lee, 15 years ago

Cc: real.human@… added

comment:14 by ccahoon, 15 years ago

(In [11263]) [soc2009/http-wsgi-improvements] Add docs that I missed from the patch and reformat HttpResponse.str. refs #6527

This and the previous revision on this branch appear to complete all the changes from #6527.

comment:15 by ccahoon, 15 years ago

Owner: changed from John DeRosa to ccahoon
Status: assignednew

comment:16 by ccahoon, 15 years ago

@stugots Hopefully we can get this patch into trunk at the end of SoC2009. I changed the ownership so I don't forget to address it when I am reviewing my changes later on.

comment:17 by ccahoon, 15 years ago

Triage Stage: AcceptedFixed on a branch

Fixed on branches/soc2009/http-wsgi-improvements.

comment:18 by Tai Lee, 15 years ago

So if/when this hits trunk, we will lose the benefit of not having the entire iterator (potentially large and time consuming) saved in memory, which could also potentially cause browser timeouts? If HttpResponse will have no real support for iterators, why not simply make HttpResponse require a string argument? Users wanting to pass an iterator can simply consume it then.

comment:19 by tomevans222, 14 years ago

Triage Stage: Fixed on a branchDesign decision needed

The suggested fix, and the changes committed to soc2009/http-wsgi-improvements cannot possibly be committed to trunk. This severely breaks previously documented behaviour because, as mrmachine points out, you will no longer be able to iteratively generate large responses, and in particular, will not be able to iteratively deliver chunks of content to the client.

This will break any page that takes a long time to fully generate, or about 10% of one of my sites.

I think that any change will need to consider the different types of middleware that operate on a response's contents.
Some types of middleware will only examine the generated response, eg CacheMiddleware will not modify the response, only store a version in the cache when complete.
Others will want to replace the content and modify headers, eg GzipMiddleware.

The current 'fix' only satisfies the requirements of the modifying middleware.

I fixed the limitation of the cache middleware to cope with iterator based responses with code like so:

def buffer_and_cache_response(response, cache_key, timeout):
  from copy import copy
  def worker(rsp):
    from cStringIO import StringIO
    buf = StringIO()
    # we need to copy the response before we iterate through it
    # if we copy it after, then the response will have a generator object
    # on self._iterator that will not be picklable
    buffered_response = copy(rsp)
    for chunk in rsp:
      buf.write(chunk)
      yield chunk
    buffered_response.content = buf.getvalue()
    buf.close()
    rsp.close()
    cache.set(cache_key, buffered_response, timeout)
  new_resp = copy(response)
  new_resp._container = worker(response)
  new_resp._is_string = False
  return new_resp

Our cache middleware (virtually copy/paste of UpdateCacheMiddleware) determines if this is an iterator derived response by checking response._is_string (hacky, I know), and returns the rv of this function. This is just an example of a more intelligent, correct way to fix this issue.

comment:20 by Forest Bond, 14 years ago

Cc: Forest Bond added

comment:21 by daemianmack, 14 years ago

Cc: daemianmack@… added

comment:22 by FunkyBob, 13 years ago

I ended up using the exact same "response._is_string" check in my patch on #10627.

If the documentation for "Passing Iterators" also mentioned "generators", it would cover one side of the behavior issue.

If the documentation on writing middleware warned of the potential of responses being iterators/generators, and thus to no indirectly consume them, along with mentioning the _is_string [or equivalent 'correct'] test, it would at least cover that base.

comment:23 by John DeRosa, 13 years ago

I worked on this three years ago. And it's still not resolved. Amazing.

comment:24 by Julien Phalip, 13 years ago

Type: Bug

comment:25 by Julien Phalip, 13 years ago

Severity: Normal

comment:26 by net147, 13 years ago

Cc: net147 added
Easy pickings: unset
UI/UX: unset

comment:27 by Carl Meyer, 13 years ago

Alex and I looked at this at the sprint and thought there had been extensive discussion of iterable responses and middleware during the TemplateResponse discussions, but weren't sure what the conclusions of that discussion were. It doesn't seem that the relevant HttpResponse code has changed.

comment:28 by Jacob, 13 years ago

Triage Stage: Design decision neededAccepted

This shouldn't be DDN -- it's still a bug -- but it's far from clear how to solve this problem. We've got to figure out a consistant way of handling lazy vs. non-lazy responses, and we haven't yet. That's a bigger discussion and probably needs a "master ticket" or an extended discussion on django-dev.

comment:29 by Claude Paroz, 12 years ago

Patch needs improvement: set

comment:30 by Aymeric Augustin, 11 years ago

Owner: changed from ccahoon to Aymeric Augustin

in reply to:  3 comment:31 by Aymeric Augustin, 11 years ago

Replying to mtredinnick:

The HttpResponse class should turn the iterator into a list and store that result.

Yes, it should.

Might as well do it in the __init__ method.

Today, it seems that the consensus it to keep the streaming behavior in the (rare) cases where it works. This means turning the iterator into a list only when it's going to be iterated and the content loaded in memory anyway — typically when middleware reads response.content.

Response classes cannot work reliably with iterators that aren't read in their entirety at the moment because it's too easy (and common) for middleware to access response.content, which would chew up an iterator. You don't have to look back very far in the archives to see how bad things get when we tried to make things truly iterator-aware at that level. It just doesn't integrate well with middleware.

A separate StreamingHttpResponse was just introduced for this purpose in #7581. In fact, the last pull request also included the changes required to fix this ticket. I didn't commit them because they weren't strictly necessary for #7581. I've extracted them into a separate patch, which I'll be attaching shortly.

This is a logical follow-up for #7581 and I'll commit it quickly if no one objects.

by Aymeric Augustin, 11 years ago

Attachment: 6527-v3.patch added

comment:32 by Tai Lee, 11 years ago

I think that the desire to keep the partial streaming behaviour as-is for HttpResponse was primarily to minimise any disruption or edge-case backwards incompatibility when introducing explicit support for streaming responses. I think that eventually, the goal here should be a more explicit and consistent HttpResponse class that makes user intent clear. Did they pass an iterator as content because they want to stream the response, or as a convenience?

I would like to keep the existing behaviour for now and raise a PendingDeprecationWarning when an iterator is consumed by accessing .content, that directs people to use the new streaming response class. When the deprecation cycle is finished, HttpResponse should consume iterators on assignment. This should give people enough time to switch over to the new streaming response class in those cases when they actually want a streaming response.

An HttpResponse class that always stores its content in a list internally can be pickled (e.g. by cache middleware and possibly other 3rd party code) without first having to access the content, and will remove any ambiguity in situations where a response currently might or might not stream.

in reply to:  32 comment:33 by Aymeric Augustin, 11 years ago

Replying to mrmachine:

I think that the desire to keep the partial streaming behaviour as-is for HttpResponse was primarily to minimise any disruption or edge-case backwards incompatibility when introducing explicit support for streaming responses. I think that eventually, the goal here should be a more explicit and consistent HttpResponse class that makes user intent clear. Did they pass an iterator as content because they want to stream the response, or as a convenience?

Yes, I understand that.

I would like to keep the existing behaviour for now and raise a PendingDeprecationWarning when an iterator is consumed by accessing .content, that directs people to use the new streaming response class. When the deprecation cycle is finished, HttpResponse should consume iterators on assignment. This should give people enough time to switch over to the new streaming response class in those cases when they actually want a streaming response.

Actually we would have to raise the warning when the iterator is *not* consumed by accessing .content, but rather by the WSGI server iterating the response.

An HttpResponse class that always stores its content in a list internally can be pickled (e.g. by cache middleware and possibly other 3rd party code) without first having to access the content, and will remove any ambiguity in situations where a response currently might or might not stream.

Did you mean "stores its content in a string"?

comment:34 by Tai Lee, 11 years ago

No. I meant that when ._container is a list the response can be pickled. When it is an iterator/generator, it cannot. You are right about raising the warning when an HttpResponse is implicitly streamed, rather than when a user consumes the iterator content (accidentally or not). That makes more sense.

comment:35 by Aymeric Augustin <aymeric.augustin@…>, 11 years ago

Resolution: fixed
Status: newclosed

In 495a8b8107dbd4fb511954bcd2322d125addd94e:

Fixed #6527 -- Provided repeatable content access

in HttpResponses instantiated with iterators.

comment:36 by Aymeric Augustin <aymeric.augustin@…>, 11 years ago

In 8b9b8d3bda09eb1b447631182d06c6c5e51425f6:

Removed compatibility code for streaming responses.

This code provided a deprecation path for old-style streaming responses.

Refs #6527, #7581.

Note: See TracTickets for help on using tickets.
Back to Top