Opened 17 years ago

Closed 12 years ago

#5611 closed Cleanup/optimization (fixed)

HTTPRequest should check content-type before parsing post body

Reported by: Paul Egan Owned by: Claude Paroz <claude@…>
Component: HTTP handling Version: dev
Severity: Normal Keywords:
Cc: jesse.lovelace@…, slacy@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The modpython handler currently tries to parse the request body as application/x-www-form-urlencoded regardless of the actual content type. This leads to an exception being thrown if the content is not parseable. See attached patch for suggested strict checking of request content-type.

Example

echo -e '\0' | curl -v --data-binary @- http://localhost/

Throws:

Traceback (most recent call last):

  File "/usr/lib64/python2.5/site-packages/mod_python/importer.py", line 1537, in HandlerDispatch
    default=default_handler, arg=req, silent=hlist.silent)

  File "/usr/lib64/python2.5/site-packages/mod_python/importer.py", line 1229, in _process_target
    result = _execute_target(config, req, object, arg)

  File "/usr/lib64/python2.5/site-packages/mod_python/importer.py", line 1128, in _execute_target
    result = object(arg)

  File "/usr/lib/python2.5/site-packages/django/core/handlers/modpython.py", line 181, in handler
    return ModPythonHandler()(req)

  File "/usr/lib/python2.5/site-packages/django/core/handlers/modpython.py", line 154, in __call__
    response = self.get_response(request)

  File "/usr/lib/python2.5/site-packages/django/core/handlers/base.py", line 53, in get_response
    response = self._real_get_response(request)

  File "/usr/lib/python2.5/site-packages/django/core/handlers/base.py", line 63, in _real_get_response
    response = middleware_method(request)

  File "/home/paul/work/site/trunk/bydesign/middleware/hotshot_profiler.py", line 13, in process_request
    if settings.DEBUG and request.has_key('prof'):

  File "/usr/lib/python2.5/site-packages/django/http/__init__.py", line 43, in has_key
    return key in self.GET or key in self.POST

  File "/usr/lib/python2.5/site-packages/django/core/handlers/modpython.py", line 73, in _get_post
    self._load_post_and_files()

  File "/usr/lib/python2.5/site-packages/django/core/handlers/modpython.py", line 56, in _load_post_and_files
    self._post, self._files = http.QueryDict(self.raw_post_data, encoding=self._encoding), datastructures.MultiValueDict()

  File "/usr/lib/python2.5/site-packages/django/http/__init__.py", line 148, in __init__
    for key, value in parse_qsl((query_string or ''), True): # keep_blank_values=True

TypeError: argument 1 must be string without null bytes, not str

Attachments (6)

parsesql.diff (1.4 KB ) - added by Martin Kanerva 16 years ago.
Another workaround
modpython.diff (1.6 KB ) - added by Paul Egan 16 years ago.
Patch to check content-type before parsing post body - updated to r9747
5611.diff (1.4 KB ) - added by Claude Paroz 12 years ago.
Patch updated to current trunk
5611-4.diff (5.2 KB ) - added by Claude Paroz 12 years ago.
Patch including test
5611-5.diff (7.5 KB ) - added by Claude Paroz 12 years ago.
Patch including docs
5611-6.diff (8.4 KB ) - added by Claude Paroz 12 years ago.
CONTENT_TYPE defaulting to the empty string

Download all attachments as: .zip

Change History (30)

comment:1 by tdterry, 17 years ago

This ticket it no longer an issue. Accessing request.POST will still try to parse the content using QueryDict, but QueryDict doesn't throw an exception for non-form-encoded data. You might get a bogus POST, but to access the raw content, you should use request.raw_post_data

comment:2 by Paul Egan, 17 years ago

Hm, is it fixed on a branch? On the trunk (r6844) the QueryDict constructor still throws an exception. A view which expects non url-encoded data will most likely use request.raw_post_data but any intervening code (e.g. middleware) that uses request.POST, request.REQUEST, request.has_key(), etc, will trigger this exception.

Just as it makes sense to check if the request content type is multipart for "file uploads", it seems sensible to check that the type is application/x-www-form-urlencoded before parsing the data as such. The only reason I can think of against such a check would be if there are user-agents out there which don't correctly set the content type for forms - but I don't believe that's the case.

comment:3 by Jacob, 17 years ago

Resolution: invalid
Status: newclosed

I can't reproduce on trunk [7168]:

[1] >>> from django.http import QueryDict

[2] >>> QueryDict("\0")
[2]   : <QueryDict: {u'\x00': [u'']}>

As you can see you get a nonsense querydict, but you don't get an exception.

comment:4 by Paul Egan, 17 years ago

Resolution: invalid
Status: closedreopened

As per ticket title & patch, this is an issue when running django with mod_python. django.http first tries to load mod_python's C implementation of parse_qsl (source:django/trunk/django/http/__init__.py@7168#L13), this will fail outside the mod_python context and so the test above is not valid.

Even if an exception wasn't thrown, I can't think of any good reason for an objection to checking the content type before parsing. Not only is it the right thing to do, it is what other frameworks do. For example:

comment:5 by Jacob, 17 years ago

Triage Stage: UnreviewedAccepted

Ah, I'd forgotten that we use Apache's parse_qsl. I wonder how much faster that really is... seems to confuse things a bit.

comment:6 by mentat, 16 years ago

Cc: jesse.lovelace@… added

I can verify this bug in r8921--wrapping parse_qsl in a try...except is a simple workaround we are using.

comment:7 by Paul Egan, 16 years ago

Wrapping the parser in a try block hides the error, but it is still terribly inefficient to attempt to parse potentially megabytes of data (perhaps POSTing image/jpeg) when you know beforehand from the content type that it is not application/x-www-form-urlencoded.

by Martin Kanerva, 16 years ago

Attachment: parsesql.diff added

Another workaround

comment:8 by Martin Kanerva, 16 years ago

Above workaround should apply cleanly to 1.0.

by Paul Egan, 16 years ago

Attachment: modpython.diff added

Patch to check content-type before parsing post body - updated to r9747

comment:9 by Gabriel Hurley, 13 years ago

Severity: Normal
Type: Cleanup/optimization

comment:10 by patchhammer, 13 years ago

Easy pickings: unset
Patch needs improvement: set

modpython.diff fails to apply cleanly on to trunk

comment:11 by Aymeric Augustin, 13 years ago

UI/UX: unset

Change UI/UX from NULL to False.

by Claude Paroz, 12 years ago

Attachment: 5611.diff added

Patch updated to current trunk

comment:12 by Claude Paroz, 12 years ago

Keywords: modpython removed
Needs tests: set
Patch needs improvement: unset
Summary: modpython handler should check content-type before parsing post bodyHTTPRequest should check content-type before parsing post body

No special opinion on the patch, just cleaned patch and description.

comment:13 by Steve Lacy, 12 years ago

Cc: slacy@… added

by Claude Paroz, 12 years ago

Attachment: 5611-4.diff added

Patch including test

comment:14 by Claude Paroz, 12 years ago

Needs tests: unset
Owner: nobody removed
Status: reopenednew

comment:15 by Claude Paroz, 12 years ago

#19124 was a duplicate with an interesting use case. Let's assume the possible backwards incompatibilities and do the right thing™!

comment:16 by Preston Holmes, 12 years ago

So I've given this some review - and it seems absolutely the right thing to do.

It seems fair to make a backwards incompatible change for anyone who is specifically abusing request.POST for non form data in this case.

Draft release note:

HttpRequest.POST will no longer include data posted via HTTP requests with non form-specific content-types in the header. In prior versions, data posted with content-types other than multipart/form-data or application/x-www-form-urlencoded would still end up represented in the request.POST attribute. Developers wishing to access the raw POST data for these cases, should use the request.body attribute instead.

We will also need a version changed note here

https://docs.djangoproject.com/en/dev/ref/request-response/#django.http.HttpRequest.POST

Clearly our current body attribute docs do a better job of explaining the situation, and can be borrowed from for an update to the POST attribute docs.

https://docs.djangoproject.com/en/dev/ref/request-response/#django.http.HttpRequest.body

by Claude Paroz, 12 years ago

Attachment: 5611-5.diff added

Patch including docs

comment:17 by Claude Paroz, 12 years ago

Thanks Preston for the review. I uploaded a new patch with documentation.

comment:18 by Claude Paroz, 12 years ago

I'm realizing that currently, we are defaulting no content-type to 'text/plain' for wsgi requests (see django/core/servers/basehttp.py, WSGIRequestHandler.get_environ). So there are two issues here:

  • what content-type to default to when the request doesn't have one (see #17474 for example). The spec seems to indicate 'application/octet-stream' and we are doing 'text/plain'
  • how do we treat non form data content types. This is the main issue of this ticket.

If we stick to the current patch (without changing the default content-type), I will have to remove the paragraph about default 'application/octet-stream'.

comment:19 by Preston Holmes, 12 years ago

The default to 'text/plain' you only added recently - so there is no BC commitment there: https://github.com/django/django/commit/91727c76cda3f8bf286ee69652bc3625f150f6e7

I think we should be defaulting to '' and retain essentially the absence of content-type - and continue to make sure the key exists but without burdening ourselves with a None check.

A default of '' seems consistent from what I can tell from https://code.djangoproject.com/ticket/17474#comment:2

It will be up to anyone reading the request to follow the spec and assume octet-stream - I don't think we need to materialize that assumption, otherwise its impossible for end user code to distinguish between an implicit and explicit content-type=application/octet-stream.

As far as the reference to octet-stream in the release note - I don't think that is a big deal either way. Currently the patch behavior does *conform* to that expectation - but that doesn't mean much in this case, as we don't process the post data for octet stream. I'd say its more accurate to say that we only process the post data when the content-type indicates form or multipart - otherwise we just leave it in the body attribute - which is what the test on the patch verifies.

by Claude Paroz, 12 years ago

Attachment: 5611-6.diff added

CONTENT_TYPE defaulting to the empty string

in reply to:  19 comment:20 by Claude Paroz, 12 years ago

Replying to ptone:

The default to 'text/plain' you only added recently - so there is no BC commitment there: https://github.com/django/django/commit/91727c76cda3f8bf286ee69652bc3625f150f6e7

Wow, didn't even realized *I* committed this :-/

I think we should be defaulting to '' and retain essentially the absence of content-type - and continue to make sure the key exists but without burdening ourselves with a None check.

Completely agree (fixed in new uploaded patch).

comment:21 by Preston Holmes, 12 years ago

Triage Stage: AcceptedReady for checkin

A couple doc suggestions - but then RFC IMO

https://code.djangoproject.com/attachment/ticket/5611/5611-6.diff#L62
"did also contain" replace with "contained"

Take the note starting with "Now..." and rephrase under the permanent section for attr POST ie:

...request contains form data. See the :class:QueryDict documentation below. If you are you need to access raw or non-form data posted in the request, access this through the :attr:HttpRequest.body attribute instead.

The point being that the version changed note is transient - but the point should be made concretely in the POST docs.

Version 0, edited 12 years ago by Preston Holmes (next)

comment:22 by Claude Paroz, 12 years ago

Now I understand why we added 'text-plain' in [91727c76cda3f8]. It appears to be the default behaviour of original Python WSGIRequestHandler (which calls self.headers.get_content_type(), which itself conforms to http://tools.ietf.org/html/rfc2045.html#section-5.2).
http://hg.python.org/cpython/file/2d2f206d040e/Lib/wsgiref/simple_server.py#l92

Should we diverge from Python? Should we report a bug? Or maybe this doesn't matter so much, because with the new behaviour, the content will still be left as is in the body attribute.

comment:23 by Claude Paroz, 12 years ago

After [681550ca6d], my latest question is solved. We will use Python default behaviour, that is defaulting to 'text/plain'.

Last edited 12 years ago by Claude Paroz (previous) (diff)

comment:24 by Claude Paroz <claude@…>, 12 years ago

Owner: set to Claude Paroz <claude@…>
Resolution: fixed
Status: newclosed

In dfd4a7175119ddb422d8426dcc15902265d5a428:

Fixed #5611 -- Restricted accepted content types in parsing POST data

Thanks paulegan for the report and Preston Holmes for the review.

Note: See TracTickets for help on using tickets.
Back to Top