#5076 closed Cleanup/optimization (fixed)
Request object should try to determine encoding
Reported by: | Owned by: | ||
---|---|---|---|
Component: | HTTP handling | Version: | dev |
Severity: | Normal | Keywords: | unicode conversion charset POST request |
Cc: | Triage Stage: | Ready for checkin | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
(As discussed here on django-developers.)
When recieving POST-data it is not certain that the charset is declared
in the Content-Type header. This leads to a situation where the POST data, when it is converted to
unicode, will have all non-ascii characters translated to \ufffd (the
codepoint for an unknown character). Probably desireable if the original data
was, for instance, latin-1 encoded.
Attachments (3)
Change History (19)
by , 17 years ago
Attachment: | django_request_charset.diff added |
---|
comment:1 by , 17 years ago
This was a great deal more challenging than I thought it would be, mostly due to the fact that I don't have much knowledge of django's internals. Please feel free to poke fun at my code (or at least suggest a better solution).
A few things to note for this patch:
- Since I don't use mod_python, I have not made any changes to the mod_python request object since I would not be able to test it
- I'd rather see the detect_charset() code executed by init() in the base HttpRequest-class, but since you override it i figured you do that for a reason.
- If the encoding is determined in the QuerySet class (the try-except-clause in the patch), the request object will never now. Not sure this matters though.
.. there is probably more issues with my approach, so let's continue the discussion..
comment:2 by , 17 years ago
Patch needs improvement: | set |
---|---|
Triage Stage: | Unreviewed → Accepted |
by , 17 years ago
Attachment: | django_request_charset.2.diff added |
---|
comment:3 by , 17 years ago
Just added one line to the previous diff.
I was thinking maybe to skip the detect_charset() code entirely and just rely on setting the encoding when the QuerySet is created, not sure what makes most sense here..
comment:4 by , 16 years ago
Component: | Uncategorized → HTTP handling |
---|
comment:5 by , 16 years ago
milestone: | → post-1.0 |
---|
force_unicode(key, encoding, errors='replace')
is depressing.
It'd better if we can choose whether to throw exceptions
comment:7 by , 15 years ago
Owner: | changed from | to
---|
comment:8 by , 15 years ago
Owner: | removed |
---|
comment:9 by , 14 years ago
Severity: | → Normal |
---|---|
Type: | → Cleanup/optimization |
comment:12 by , 12 years ago
The patch for #19101 includes the fix for this ticket. However, I'd prefer to fix these issues separately, so I'm going to comment here.
I'd like to see a more robust media type parsing. The grammar is simple enough to implement correctly. For example:
content_type = self.META.get('CONTENT_TYPE', '') content_type, _, parameters = content_type.partition(';') content_params = {} for parameter in parameters.split(';'): k, _, v = parameter.strip().partition('=') content_params[k] = v
It could make sense to move this to a utility function or at least as a method of the HttpRequest object, as in the patches above.
Then we just do:
if 'charset' in content_params: ...
comment:13 by , 12 years ago
Patch needs improvement: | unset |
---|
Patch separated from the one in #19101 and adapted following comment:12.
comment:14 by , 12 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
The patch looks good to me.
Could you just mention the source for the algorithm in the docstring?
def _parse_content_type(self, ctype): """Media Types parsing according to RFC 2616, section 3.7. Returns the data type and parameters. For example: Input: "text/plain; charset=iso-8859-1" Output: ('text/plain', {'charset': 'iso-8859-1'}) """
comment:15 by , 12 years ago
Owner: | set to |
---|---|
Resolution: | → fixed |
Status: | new → closed |
A first try at a fix