Opened 8 years ago

Closed 3 years ago

Last modified 3 years ago

#5076 closed Cleanup/optimization (fixed)

Request object should try to determine encoding

Reported by: dbr <daniel@…> Owned by: Claude Paroz <claude@…>
Component: HTTP handling Version: master
Severity: Normal Keywords: unicode conversion charset POST request
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

(As discussed here on django-developers.)

When recieving POST-data it is not certain that the charset is declared
in the Content-Type header. This leads to a situation where the POST data, when it is converted to
unicode, will have all non-ascii characters translated to \ufffd (the
codepoint for an unknown character). Probably desireable if the original data
was, for instance, latin-1 encoded.

Attachments (3)

django_request_charset.diff (2.6 KB) - added by dbr <daniel@…> 8 years ago.
A first try at a fix
django_request_charset.2.diff (2.6 KB) - added by dbr <daniel@…> 8 years ago.
5076-3.diff (2.9 KB) - added by claudep 3 years ago.
Get request encoding from content-type header

Download all attachments as: .zip

Change History (19)

Changed 8 years ago by dbr <daniel@…>

A first try at a fix

comment:1 Changed 8 years ago by dbr <daniel@…>

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

This was a great deal more challenging than I thought it would be, mostly due to the fact that I don't have much knowledge of django's internals. Please feel free to poke fun at my code (or at least suggest a better solution).

A few things to note for this patch:

  1. Since I don't use mod_python, I have not made any changes to the mod_python request object since I would not be able to test it
  2. I'd rather see the detect_charset() code executed by init() in the base HttpRequest-class, but since you override it i figured you do that for a reason.
  3. If the encoding is determined in the QuerySet class (the try-except-clause in the patch), the request object will never now. Not sure this matters though.

.. there is probably more issues with my approach, so let's continue the discussion..

comment:2 Changed 8 years ago by Simon G. <dev@…>

  • Patch needs improvement set
  • Triage Stage changed from Unreviewed to Accepted

Changed 8 years ago by dbr <daniel@…>

comment:3 Changed 8 years ago by dbr <daniel@…>

Just added one line to the previous diff.

I was thinking maybe to skip the detect_charset() code entirely and just rely on setting the encoding when the QuerySet is created, not sure what makes most sense here..

comment:4 Changed 7 years ago by Piotr Lewandowski <django@…>

  • Component changed from Uncategorized to HTTP handling

comment:5 Changed 7 years ago by ahlongxp

  • milestone set to post-1.0
force_unicode(key, encoding, errors='replace')

is depressing.
It'd better if we can choose whether to throw exceptions

comment:6 Changed 7 years ago by anonymous

  • milestone post-1.0 deleted

Milestone post-1.0 deleted

comment:7 Changed 6 years ago by ccahoon

  • Owner changed from nobody to ccahoon

comment:8 Changed 6 years ago by anonymous

  • Owner ccahoon deleted

comment:9 Changed 4 years ago by gabrielhurley

  • Severity set to Normal
  • Type set to Cleanup/optimization

comment:10 Changed 4 years ago by aaugustin

  • UI/UX unset

Change UI/UX from NULL to False.

comment:11 Changed 4 years ago by aaugustin

  • Easy pickings unset

Change Easy pickings from NULL to False.

comment:12 Changed 3 years ago by aaugustin

The patch for #19101 includes the fix for this ticket. However, I'd prefer to fix these issues separately, so I'm going to comment here.

I'd like to see a more robust media type parsing. The grammar is simple enough to implement correctly. For example:

content_type = self.META.get('CONTENT_TYPE', '')
content_type, _, parameters = content_type.partition(';')
content_params = {}
for parameter in parameters.split(';'):
    k, _, v = parameter.strip().partition('=')
    content_params[k] = v

It could make sense to move this to a utility function or at least as a method of the HttpRequest object, as in the patches above.

Then we just do:

if 'charset' in content_params:
    ...

Changed 3 years ago by claudep

Get request encoding from content-type header

comment:13 Changed 3 years ago by claudep

  • Patch needs improvement unset

Patch separated from the one in #19101 and adapted following comment:12.

comment:14 Changed 3 years ago by aaugustin

  • Triage Stage changed from Accepted to Ready for checkin

The patch looks good to me.

Could you just mention the source for the algorithm in the docstring?

def _parse_content_type(self, ctype):
    """Media Types parsing according to RFC 2616, section 3.7.

    Returns the data type and parameters. For example:
    Input: "text/plain; charset=iso-8859-1"
    Output: ('text/plain', {'charset': 'iso-8859-1'})
    """

comment:15 Changed 3 years ago by Claude Paroz <claude@…>

  • Owner set to Claude Paroz <claude@…>
  • Resolution set to fixed
  • Status changed from new to closed

In 6de6988f9990b4b53f5a20bfc3811b3cc49291c5:

Fixed #5076 -- Properly decode POSTs with non-utf-8 payload encoding

Thanks daniel at blogg.se for the report and Aymeric Augustin for
his assistance on the patch.

comment:16 Changed 3 years ago by Claude Paroz <claude@…>

In 3f3076edbf8d6fb204984a1a7fddbde408d5b104:

[1.5.x] Fixed #5076 -- Properly decode POSTs with non-utf-8 payload encoding

Thanks daniel at blogg.se for the report and Aymeric Augustin for
his assistance on the patch.

Backport of 6de6988f9 from master.

Note: See TracTickets for help on using tickets.
Back to Top