Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#24240 closed Bug (fixed)

Setting Content-Encoding on a non-ASCII StreamingHttpResponse raises UnicodeEncodeError

Reported by: Matthew Somerville Owned by: nobody
Component: HTTP handling Version: 1.7
Severity: Normal Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

HttpResponseBase's make_bytes() assumes that if the Content-Encoding header is set, then everything has already been dealt with bytes-wise, but in a streaming situation this is not necessarily the case as the header has been set before make_bytes() will be called on the original content.

For example, if you provide an iterable including a non-ASCII Unicode string to a StreamingHttpResponse and are using the GZip middleware, the response will die with a UnicodeEncodeError when it arrives at the first non-ASCII Unicode chunk.

I have added two tests at https://github.com/dracos/django/compare/non-ascii-compressed-streaming (let me know if you'd prefer a PR), firstly a working one to show that supplying non-ASCII Unicode strings to a StreamingHttpResponse works fine normally; then a failing one to show that passing such a response through the GZip middleware dies with a UnicodeEncodeError.

This appears to be happening because as we're working with an iterable, make_bytes() is only called when necessary, but by the point that happens and the iterable is consumed, the middleware has added the Content-Encoding header and thus make_bytes() tries to call bytes(value) (and dies). If it had been a normal HttpResponse, make_bytes() would have been called when the content was set, well before the middleware set the Content-Encoding header.

I'm not aware of other intricacies of HttpResponse to suggest an obvious fix, sorry. The start of this short circuit was back in #4969 but lots has changed since then; I don't know if this special case when Content-Encoding is set is still necessary in the same way.

Change History (5)

comment:1 Changed 4 years ago by Claude Paroz

Please create a PR including your tests and the fix in make_bytes, and we'll see what the test suite think about that!

comment:2 Changed 4 years ago by Matthew Somerville

Has patch: set

Sure, created at https://github.com/django/django/pull/4011 :)
I did remove one test that would clearly fail, because this change will allow you to supply Unicode strings to a Response with a Content-Encoding and it will be encoded as normal. I thought that was acceptable given it fixes the issue with any StreamingHttpResponse. If it isn't, then the change in response.py could check if it was a StreamingHttpResponse, but that would then be different behaviour between HttpResponse and StreamingHttpResponse, in that Streaming would be able to set Content-Encoding and put Unicode strings in the iterator.

comment:3 Changed 4 years ago by Claude Paroz

Triage Stage: UnreviewedReady for checkin

comment:4 Changed 4 years ago by Claude Paroz <claude@…>

Resolution: fixed
Status: newclosed

In 250aa7c39b0025ef3ae508884fda8d6e43c9518f:

Fixed #24240 -- Allowed GZipping a Unicode StreamingHttpResponse

make_bytes() assumed that if the Content-Encoding header is set, then
everything had already been dealt with bytes-wise, but in a streaming
situation this was not necessarily the case.

make_bytes() is only called when necessary when working with a
StreamingHttpResponse iterable, but by that point the middleware has
added the Content-Encoding header and thus make_bytes() tried to call
bytes(value) (and dies). If it had been a normal HttpResponse,
make_bytes() would have been called when the content was set, well
before the middleware set the Content-Encoding header.

This commit removes the special casing when Content-Encoding is set,
allowing unicode strings to be encoded during the iteration before they
are e.g. gzipped. This behaviour was added a long time ago for #4969 and
it doesn't appear to be necessary any more, as everything is correctly
made into bytes at the appropriate places.

Two new tests, to show that supplying non-ASCII characters to a
StreamingHttpResponse works fine normally, and when passed through the
GZip middleware (the latter dies without the change to make_bytes()).
Removes the test with a nonsense Content-Encoding and Unicode input - if
this were to happen, it can still be encoded as bytes fine.

comment:5 Changed 4 years ago by Claude Paroz <claude@…>

In d88c24f436ae23ff5aec4ea06c0c27499a71e074:

[1.8.x] Fixed #24240 -- Allowed GZipping a Unicode StreamingHttpResponse

make_bytes() assumed that if the Content-Encoding header is set, then
everything had already been dealt with bytes-wise, but in a streaming
situation this was not necessarily the case.

make_bytes() is only called when necessary when working with a
StreamingHttpResponse iterable, but by that point the middleware has
added the Content-Encoding header and thus make_bytes() tried to call
bytes(value) (and dies). If it had been a normal HttpResponse,
make_bytes() would have been called when the content was set, well
before the middleware set the Content-Encoding header.

This commit removes the special casing when Content-Encoding is set,
allowing unicode strings to be encoded during the iteration before they
are e.g. gzipped. This behaviour was added a long time ago for #4969 and
it doesn't appear to be necessary any more, as everything is correctly
made into bytes at the appropriate places.

Two new tests, to show that supplying non-ASCII characters to a
StreamingHttpResponse works fine normally, and when passed through the
GZip middleware (the latter dies without the change to make_bytes()).
Removes the test with a nonsense Content-Encoding and Unicode input - if
this were to happen, it can still be encoded as bytes fine.

Backport of 250aa7c39b from master.

Note: See TracTickets for help on using tickets.
Back to Top