Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#24242 closed Cleanup/optimization (fixed)

compress_sequence creates a larger content than no compression

Reported by: Matthew Somerville Owned by: nobody
Component: HTTP handling Version: 1.7
Severity: Normal Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I have a view that is 825157 bytes without gzipping, 35751 bytes gzipped as an HttpResponse, but 1010920 bytes gzipped as a StreamingHttpResponse. The output of the script given below with some noddy data is:

Normal string: 38890
compress_string: 18539
compress_sequence: 89567
compress_sequence, no flush: 18539

Noddy content perhaps, but in actual use I'm very much wanting to use StreamingHttpResponse on very large JSON responses (then it uses 200Mb memory with iterables throughout, as opposed to 2Gb with more standard code/HttpResponse), and the Python json package flushes after each key, value, and punctuation in-between. Having the gzip middleware flush similarly creates a much larger output than no gzipping, with the figures given at the top. It would seem that many uses of StreamingHttpResponse will similarly be flushing regularly at the content level. #7581 does mention "some penalty in compression performance" but producing a worse-than-none performance seems a bit much :)

Should compress_sequence bunch up flushes to provide at least some level of compression? Or if it's a StreamingHttpResponse, should it not bother gzipping?

from django.utils.text import *
from django.utils.six.moves import map

# Identical to django.utils.text.compress_sequence
# but with the flush line commented out
def compress_sequence_without_flush(sequence):
    buf = StreamingBuffer()
    zfile = GzipFile(mode='wb', compresslevel=6, fileobj=buf)
    # Output headers...
    yield buf.read()
    for item in sequence:
        zfile.write(item)
        # zfile.flush()
        yield buf.read()
    zfile.close()
    yield buf.read()

class Example(object):
    def __iter__(self):
        return map(str, xrange(10000))

e = Example()
print 'Normal string:', len(b''.join(e))
print 'compress_string:', len(compress_string(b''.join(e)))
print 'compress_sequence:', len(b''.join(compress_sequence(e)))
print 'compress_sequence, no flush:', len(b''.join(compress_sequence_without_flush(e)))

Change History (6)

comment:1 by Matthew Somerville, 10 years ago

Has patch: set

Removing the flush(), the output does appear to 'bunch' itself into groups of about 17k, with the output ending up the same size as if it had been gzipped as a string. I have made this change and added a test of some JSON output at https://github.com/django/django/pull/4010 , hope that's of interest.

comment:2 by Tim Graham, 10 years ago

Component: UncategorizedHTTP handling
Triage Stage: UnreviewedAccepted
Type: UncategorizedCleanup/optimization

comment:3 by Tim Graham, 10 years ago

Patch needs improvement: set

Test doesn't pass on Python 3.

comment:4 by Tim Graham, 10 years ago

Patch needs improvement: unset
Triage Stage: AcceptedReady for checkin

comment:5 by Tim Graham <timograham@…>, 10 years ago

Resolution: fixed
Status: newclosed

In caa3562d5bec1196502352a715a539bdb0f73c2d:

Fixed #24242 -- Improved efficiency of utils.text.compress_sequence()

The function no longer flushes zfile after each write as doing so can
lead to the gzipped streamed content being larger than the original
content; each flush adds a 5/6 byte type 0 block. Removing this means
buf.read() may return nothing, so only yield if that has some data.
Testing shows without the flush() the buffer is being flushed every 17k
or so and compresses the same as if it had been done as a whole string.

comment:6 by Tim Graham <timograham@…>, 10 years ago

In 2a55301f9fe1c3b62ad4e79c3109bec77b57b317:

[1.8.x] Fixed #24242 -- Improved efficiency of utils.text.compress_sequence()

The function no longer flushes zfile after each write as doing so can
lead to the gzipped streamed content being larger than the original
content; each flush adds a 5/6 byte type 0 block. Removing this means
buf.read() may return nothing, so only yield if that has some data.
Testing shows without the flush() the buffer is being flushed every 17k
or so and compresses the same as if it had been done as a whole string.

Backport of caa3562d5bec1196502352a715a539bdb0f73c2d from master

Note: See TracTickets for help on using tickets.
Back to Top