Opened 17 years ago

Closed 11 years ago

Last modified 11 years ago

#5076 closed Cleanup/optimization (fixed)

Request object should try to determine encoding

Reported by: dbr <daniel@…> Owned by: Claude Paroz <claude@…>
Component: HTTP handling Version: dev
Severity: Normal Keywords: unicode conversion charset POST request
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

(As discussed here on django-developers.)

When recieving POST-data it is not certain that the charset is declared
in the Content-Type header. This leads to a situation where the POST data, when it is converted to
unicode, will have all non-ascii characters translated to \ufffd (the
codepoint for an unknown character). Probably desireable if the original data
was, for instance, latin-1 encoded.

Attachments (3)

django_request_charset.diff (2.6 KB ) - added by dbr <daniel@…> 17 years ago.
A first try at a fix
django_request_charset.2.diff (2.6 KB ) - added by dbr <daniel@…> 17 years ago.
5076-3.diff (2.9 KB ) - added by Claude Paroz 11 years ago.
Get request encoding from content-type header

Download all attachments as: .zip

Change History (19)

by dbr <daniel@…>, 17 years ago

Attachment: django_request_charset.diff added

A first try at a fix

comment:1 by dbr <daniel@…>, 17 years ago

This was a great deal more challenging than I thought it would be, mostly due to the fact that I don't have much knowledge of django's internals. Please feel free to poke fun at my code (or at least suggest a better solution).

A few things to note for this patch:

  1. Since I don't use mod_python, I have not made any changes to the mod_python request object since I would not be able to test it
  2. I'd rather see the detect_charset() code executed by init() in the base HttpRequest-class, but since you override it i figured you do that for a reason.
  3. If the encoding is determined in the QuerySet class (the try-except-clause in the patch), the request object will never now. Not sure this matters though.

.. there is probably more issues with my approach, so let's continue the discussion..

comment:2 by Simon G. <dev@…>, 17 years ago

Patch needs improvement: set
Triage Stage: UnreviewedAccepted

by dbr <daniel@…>, 17 years ago

comment:3 by dbr <daniel@…>, 17 years ago

Just added one line to the previous diff.

I was thinking maybe to skip the detect_charset() code entirely and just rely on setting the encoding when the QuerySet is created, not sure what makes most sense here..

comment:4 by Piotr Lewandowski <django@…>, 16 years ago

Component: UncategorizedHTTP handling

comment:5 by ahlongxp, 15 years ago

milestone: post-1.0
force_unicode(key, encoding, errors='replace')

is depressing.
It'd better if we can choose whether to throw exceptions

comment:6 by (none), 15 years ago

milestone: post-1.0

Milestone post-1.0 deleted

comment:7 by ccahoon, 15 years ago

Owner: changed from nobody to ccahoon

comment:8 by anonymous, 15 years ago

Owner: ccahoon removed

comment:9 by Gabriel Hurley, 13 years ago

Severity: Normal
Type: Cleanup/optimization

comment:10 by Aymeric Augustin, 12 years ago

UI/UX: unset

Change UI/UX from NULL to False.

comment:11 by Aymeric Augustin, 12 years ago

Easy pickings: unset

Change Easy pickings from NULL to False.

comment:12 by Aymeric Augustin, 11 years ago

The patch for #19101 includes the fix for this ticket. However, I'd prefer to fix these issues separately, so I'm going to comment here.

I'd like to see a more robust media type parsing. The grammar is simple enough to implement correctly. For example:

content_type = self.META.get('CONTENT_TYPE', '')
content_type, _, parameters = content_type.partition(';')
content_params = {}
for parameter in parameters.split(';'):
    k, _, v = parameter.strip().partition('=')
    content_params[k] = v

It could make sense to move this to a utility function or at least as a method of the HttpRequest object, as in the patches above.

Then we just do:

if 'charset' in content_params:
    ...

by Claude Paroz, 11 years ago

Attachment: 5076-3.diff added

Get request encoding from content-type header

comment:13 by Claude Paroz, 11 years ago

Patch needs improvement: unset

Patch separated from the one in #19101 and adapted following comment:12.

comment:14 by Aymeric Augustin, 11 years ago

Triage Stage: AcceptedReady for checkin

The patch looks good to me.

Could you just mention the source for the algorithm in the docstring?

def _parse_content_type(self, ctype):
    """Media Types parsing according to RFC 2616, section 3.7.

    Returns the data type and parameters. For example:
    Input: "text/plain; charset=iso-8859-1"
    Output: ('text/plain', {'charset': 'iso-8859-1'})
    """

comment:15 by Claude Paroz <claude@…>, 11 years ago

Owner: set to Claude Paroz <claude@…>
Resolution: fixed
Status: newclosed

In 6de6988f9990b4b53f5a20bfc3811b3cc49291c5:

Fixed #5076 -- Properly decode POSTs with non-utf-8 payload encoding

Thanks daniel at blogg.se for the report and Aymeric Augustin for
his assistance on the patch.

comment:16 by Claude Paroz <claude@…>, 11 years ago

In 3f3076edbf8d6fb204984a1a7fddbde408d5b104:

[1.5.x] Fixed #5076 -- Properly decode POSTs with non-utf-8 payload encoding

Thanks daniel at blogg.se for the report and Aymeric Augustin for
his assistance on the patch.

Backport of 6de6988f9 from master.

Note: See TracTickets for help on using tickets.
Back to Top