Opened 7 years ago

Closed 3 years ago

Last modified 2 years ago

#6527 closed Bug (fixed)

A bug in HttpResponse with iterators

Reported by: daonb <bennydaon@…> Owned by: aaugustin
Component: HTTP handling Version: master
Severity: Normal Keywords: http iterators
Cc: mikolo2@…, jim@…, robert.coup@…, real.human@…, forest, daemianmack@…, net147 Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

>>> from django.http import HttpResponse
>>> response = HttpResponse(file('helloworld.txt','r'))
>>> print response
Content-Type: text/html; charset=utf-8

helloworld

>>> print response
Content-Type: text/html; charset=utf-8


>>> 

Seems to me django.http.HttpResponse has to be modified so that when its init is given an iterator, it will go through the iterator only once.

I'd be happy to deliver a patch if this is something that should be fixed.

Attachments (4)

6527.patch (31.5 KB) - added by stugots 7 years ago.
Patch with the fix, and regression tests.
6527-2.diff (3.0 KB) - added by daonb <bennydaon@…> 7 years ago.
Less code here, includes original regression tests
6527 V2.patch (5.8 KB) - added by stugots 7 years ago.
Updated patch
6527-v3.patch (4.0 KB) - added by aaugustin 3 years ago.

Download all attachments as: .zip

Change History (40)

comment:1 Changed 7 years ago by SmileyChris

  • Component changed from Uncategorized to HTTP handling
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Design decision needed

I'm not sure that this is an actual bug. One of the benefits in passing an iterator is so that you don't get the whole response saved in memory, isn't it?

comment:2 Changed 7 years ago by daonb <bennydaon@…>

It came up because I downloaded django analytics from google code and added their middleware. What it does is to scan the reponse for the end body tag and insert the analytics tracker code there. I think it's very clean and it worked great.

Unrelated, I have a piece of code that acts as a proxy for the flickr service - I have to keep my key safe so my server has to act as a go between. My view ends with:

	flickr_resp = urlopen(flickr_api_url, params)
	return HttpResponse(''.join(flickr_resp), mimetype='application/json')

and it worked well until I added the middleware.

It's one of the nasties bags I found - The first thing I did was to add "response = ..." and then print it so I can use the local server to debug. The response seemed fine, after a lot of searching and wandering about I decided to print it twice.

It's definitely not the result I expected. When you have a framework that acts like that, it loses my trust.

comment:3 follow-up: Changed 7 years ago by mtredinnick

  • Triage Stage changed from Design decision needed to Accepted

The HttpResponse class should turn the iterator into a list and store that result. Might as well do it in the __init__ method.

Response classes cannot work reliably with iterators that aren't read in their entirety at the moment because it's too easy (and common) for middleware to access response.content, which would chew up an iterator. You don't have to look back very far in the archives to see how bad things get when we tried to make things truly iterator-aware at that level. It just doesn't integrate well with middleware.

comment:4 Changed 7 years ago by stugots

  • Owner changed from nobody to stugots
  • Status changed from new to assigned

Taken during PyCon 2008 sprint.

Changed 7 years ago by stugots

Patch with the fix, and regression tests.

comment:5 Changed 7 years ago by stugots

  • Has patch set

Fixed in the attached patch file. I added regression tests to cover this case.

Note, svn thinks the entire AUTHORS file changed due to bug #6545, which refers to a line-ending problem with pool text files.

comment:6 Changed 7 years ago by mtredinnick

If you're going to read the contents from the iterator, there's no need to create another attribute on the model. Just put the contents into self._contents. You don't need to store the iterator.

Changed 7 years ago by daonb <bennydaon@…>

Less code here, includes original regression tests

Changed 7 years ago by stugots

Updated patch

comment:7 Changed 7 years ago by stugots

Roger. There is no self._contents, so I'm guessing you meant to write self._container.

The entire class got simpler, as a result. Excellent!

The patch is 6527 V2.patch.

The guideline, "If an HttpResponse has been initialized with an iterator as its content, you can’t use the HttpResponse instance as a file-like object. Doing so will raise Exception," is no longer required, so its change is now in the patch. Svn thinks every line changed due to bug #6545.

comment:8 Changed 7 years ago by stugots

p.s. The full regression test suite passes.

comment:9 follow-up: Changed 7 years ago by daonb <bennydaon@…>

nice job stugots! my patch 6527-2.diff is not as good. Thanks for contributing.

comment:10 in reply to: ↑ 9 Changed 7 years ago by stugots

Replying to daonb <bennydaon@gmail.com>:

nice job stugots! my patch 6527-2.diff is not as good. Thanks for contributing.

Thank you for your kind words. I wish I could do more, but I'm just getting up to speed on the pool. It's thrilling to be able to contribute in a small way to Django's success!

comment:11 Changed 7 years ago by garrison

  • Cc jim@… added

comment:12 Changed 7 years ago by rcoup

  • Cc robert.coup@… added

comment:13 Changed 6 years ago by mrmachine

  • Cc real.human@… added

comment:14 Changed 6 years ago by ccahoon

(In [11263]) [soc2009/http-wsgi-improvements] Add docs that I missed from the patch and reformat HttpResponse.str. refs #6527

This and the previous revision on this branch appear to complete all the changes from #6527.

comment:15 Changed 6 years ago by ccahoon

  • Owner changed from stugots to ccahoon
  • Status changed from assigned to new

comment:16 Changed 6 years ago by ccahoon

@stugots Hopefully we can get this patch into trunk at the end of SoC2009. I changed the ownership so I don't forget to address it when I am reviewing my changes later on.

comment:17 Changed 6 years ago by ccahoon

  • Triage Stage changed from Accepted to Fixed on a branch

Fixed on branches/soc2009/http-wsgi-improvements.

comment:18 Changed 6 years ago by mrmachine

So if/when this hits trunk, we will lose the benefit of not having the entire iterator (potentially large and time consuming) saved in memory, which could also potentially cause browser timeouts? If HttpResponse will have no real support for iterators, why not simply make HttpResponse require a string argument? Users wanting to pass an iterator can simply consume it then.

comment:19 Changed 6 years ago by tomevans222

  • Triage Stage changed from Fixed on a branch to Design decision needed

The suggested fix, and the changes committed to soc2009/http-wsgi-improvements cannot possibly be committed to trunk. This severely breaks previously documented behaviour because, as mrmachine points out, you will no longer be able to iteratively generate large responses, and in particular, will not be able to iteratively deliver chunks of content to the client.

This will break any page that takes a long time to fully generate, or about 10% of one of my sites.

I think that any change will need to consider the different types of middleware that operate on a response's contents.
Some types of middleware will only examine the generated response, eg CacheMiddleware will not modify the response, only store a version in the cache when complete.
Others will want to replace the content and modify headers, eg GzipMiddleware.

The current 'fix' only satisfies the requirements of the modifying middleware.

I fixed the limitation of the cache middleware to cope with iterator based responses with code like so:

def buffer_and_cache_response(response, cache_key, timeout):
  from copy import copy
  def worker(rsp):
    from cStringIO import StringIO
    buf = StringIO()
    # we need to copy the response before we iterate through it
    # if we copy it after, then the response will have a generator object
    # on self._iterator that will not be picklable
    buffered_response = copy(rsp)
    for chunk in rsp:
      buf.write(chunk)
      yield chunk
    buffered_response.content = buf.getvalue()
    buf.close()
    rsp.close()
    cache.set(cache_key, buffered_response, timeout)
  new_resp = copy(response)
  new_resp._container = worker(response)
  new_resp._is_string = False
  return new_resp

Our cache middleware (virtually copy/paste of UpdateCacheMiddleware) determines if this is an iterator derived response by checking response._is_string (hacky, I know), and returns the rv of this function. This is just an example of a more intelligent, correct way to fix this issue.

comment:20 Changed 5 years ago by forest

  • Cc forest added

comment:21 Changed 5 years ago by daemianmack

  • Cc daemianmack@… added

comment:22 Changed 4 years ago by FunkyBob

I ended up using the exact same "response._is_string" check in my patch on #10627.

If the documentation for "Passing Iterators" also mentioned "generators", it would cover one side of the behavior issue.

If the documentation on writing middleware warned of the potential of responses being iterators/generators, and thus to no indirectly consume them, along with mentioning the _is_string [or equivalent 'correct'] test, it would at least cover that base.

comment:23 Changed 4 years ago by stugots

I worked on this three years ago. And it's still not resolved. Amazing.

comment:24 Changed 4 years ago by julien

  • Type set to Bug

comment:25 Changed 4 years ago by julien

  • Severity set to Normal

comment:26 Changed 4 years ago by net147

  • Cc net147 added
  • Easy pickings unset
  • UI/UX unset

comment:27 Changed 4 years ago by carljm

Alex and I looked at this at the sprint and thought there had been extensive discussion of iterable responses and middleware during the TemplateResponse discussions, but weren't sure what the conclusions of that discussion were. It doesn't seem that the relevant HttpResponse code has changed.

comment:28 Changed 4 years ago by jacob

  • Triage Stage changed from Design decision needed to Accepted

This shouldn't be DDN -- it's still a bug -- but it's far from clear how to solve this problem. We've got to figure out a consistant way of handling lazy vs. non-lazy responses, and we haven't yet. That's a bigger discussion and probably needs a "master ticket" or an extended discussion on django-dev.

comment:29 Changed 3 years ago by claudep

  • Patch needs improvement set

comment:30 Changed 3 years ago by aaugustin

  • Owner changed from ccahoon to aaugustin

comment:31 in reply to: ↑ 3 Changed 3 years ago by aaugustin

Replying to mtredinnick:

The HttpResponse class should turn the iterator into a list and store that result.

Yes, it should.

Might as well do it in the __init__ method.

Today, it seems that the consensus it to keep the streaming behavior in the (rare) cases where it works. This means turning the iterator into a list only when it's going to be iterated and the content loaded in memory anyway — typically when middleware reads response.content.

Response classes cannot work reliably with iterators that aren't read in their entirety at the moment because it's too easy (and common) for middleware to access response.content, which would chew up an iterator. You don't have to look back very far in the archives to see how bad things get when we tried to make things truly iterator-aware at that level. It just doesn't integrate well with middleware.

A separate StreamingHttpResponse was just introduced for this purpose in #7581. In fact, the last pull request also included the changes required to fix this ticket. I didn't commit them because they weren't strictly necessary for #7581. I've extracted them into a separate patch, which I'll be attaching shortly.

This is a logical follow-up for #7581 and I'll commit it quickly if no one objects.

Changed 3 years ago by aaugustin

comment:32 follow-up: Changed 3 years ago by mrmachine

I think that the desire to keep the partial streaming behaviour as-is for HttpResponse was primarily to minimise any disruption or edge-case backwards incompatibility when introducing explicit support for streaming responses. I think that eventually, the goal here should be a more explicit and consistent HttpResponse class that makes user intent clear. Did they pass an iterator as content because they want to stream the response, or as a convenience?

I would like to keep the existing behaviour for now and raise a PendingDeprecationWarning when an iterator is consumed by accessing .content, that directs people to use the new streaming response class. When the deprecation cycle is finished, HttpResponse should consume iterators on assignment. This should give people enough time to switch over to the new streaming response class in those cases when they actually want a streaming response.

An HttpResponse class that always stores its content in a list internally can be pickled (e.g. by cache middleware and possibly other 3rd party code) without first having to access the content, and will remove any ambiguity in situations where a response currently might or might not stream.

comment:33 in reply to: ↑ 32 Changed 3 years ago by aaugustin

Replying to mrmachine:

I think that the desire to keep the partial streaming behaviour as-is for HttpResponse was primarily to minimise any disruption or edge-case backwards incompatibility when introducing explicit support for streaming responses. I think that eventually, the goal here should be a more explicit and consistent HttpResponse class that makes user intent clear. Did they pass an iterator as content because they want to stream the response, or as a convenience?

Yes, I understand that.

I would like to keep the existing behaviour for now and raise a PendingDeprecationWarning when an iterator is consumed by accessing .content, that directs people to use the new streaming response class. When the deprecation cycle is finished, HttpResponse should consume iterators on assignment. This should give people enough time to switch over to the new streaming response class in those cases when they actually want a streaming response.

Actually we would have to raise the warning when the iterator is *not* consumed by accessing .content, but rather by the WSGI server iterating the response.

An HttpResponse class that always stores its content in a list internally can be pickled (e.g. by cache middleware and possibly other 3rd party code) without first having to access the content, and will remove any ambiguity in situations where a response currently might or might not stream.

Did you mean "stores its content in a string"?

comment:34 Changed 3 years ago by mrmachine

No. I meant that when ._container is a list the response can be pickled. When it is an iterator/generator, it cannot. You are right about raising the warning when an HttpResponse is implicitly streamed, rather than when a user consumes the iterator content (accidentally or not). That makes more sense.

comment:35 Changed 3 years ago by Aymeric Augustin <aymeric.augustin@…>

  • Resolution set to fixed
  • Status changed from new to closed

In 495a8b8107dbd4fb511954bcd2322d125addd94e:

Fixed #6527 -- Provided repeatable content access

in HttpResponses instantiated with iterators.

comment:36 Changed 2 years ago by Aymeric Augustin <aymeric.augustin@…>

In 8b9b8d3bda09eb1b447631182d06c6c5e51425f6:

Removed compatibility code for streaming responses.

This code provided a deprecation path for old-style streaming responses.

Refs #6527, #7581.

Note: See TracTickets for help on using tickets.
Back to Top