#5076 closed Cleanup/optimization (fixed)
Request object should try to determine encoding
| Reported by: | Owned by: | ||
|---|---|---|---|
| Component: | HTTP handling | Version: | dev |
| Severity: | Normal | Keywords: | unicode conversion charset POST request |
| Cc: | Triage Stage: | Ready for checkin | |
| Has patch: | yes | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description
(As discussed here on django-developers.)
When recieving POST-data it is not certain that the charset is declared
in the Content-Type header. This leads to a situation where the POST data, when it is converted to
unicode, will have all non-ascii characters translated to \ufffd (the
codepoint for an unknown character). Probably desireable if the original data
was, for instance, latin-1 encoded.
Attachments (3)
Change History (19)
by , 18 years ago
| Attachment: | django_request_charset.diff added |
|---|
comment:1 by , 18 years ago
This was a great deal more challenging than I thought it would be, mostly due to the fact that I don't have much knowledge of django's internals. Please feel free to poke fun at my code (or at least suggest a better solution).
A few things to note for this patch:
- Since I don't use mod_python, I have not made any changes to the mod_python request object since I would not be able to test it
- I'd rather see the detect_charset() code executed by init() in the base HttpRequest-class, but since you override it i figured you do that for a reason.
- If the encoding is determined in the QuerySet class (the try-except-clause in the patch), the request object will never now. Not sure this matters though.
.. there is probably more issues with my approach, so let's continue the discussion..
comment:2 by , 18 years ago
| Patch needs improvement: | set |
|---|---|
| Triage Stage: | Unreviewed → Accepted |
by , 18 years ago
| Attachment: | django_request_charset.2.diff added |
|---|
comment:3 by , 18 years ago
Just added one line to the previous diff.
I was thinking maybe to skip the detect_charset() code entirely and just rely on setting the encoding when the QuerySet is created, not sure what makes most sense here..
comment:4 by , 17 years ago
| Component: | Uncategorized → HTTP handling |
|---|
comment:5 by , 17 years ago
| milestone: | → post-1.0 |
|---|
force_unicode(key, encoding, errors='replace')
is depressing.
It'd better if we can choose whether to throw exceptions
comment:7 by , 16 years ago
| Owner: | changed from to |
|---|
comment:8 by , 16 years ago
| Owner: | removed |
|---|
comment:9 by , 15 years ago
| Severity: | → Normal |
|---|---|
| Type: | → Cleanup/optimization |
comment:12 by , 13 years ago
The patch for #19101 includes the fix for this ticket. However, I'd prefer to fix these issues separately, so I'm going to comment here.
I'd like to see a more robust media type parsing. The grammar is simple enough to implement correctly. For example:
content_type = self.META.get('CONTENT_TYPE', '')
content_type, _, parameters = content_type.partition(';')
content_params = {}
for parameter in parameters.split(';'):
k, _, v = parameter.strip().partition('=')
content_params[k] = v
It could make sense to move this to a utility function or at least as a method of the HttpRequest object, as in the patches above.
Then we just do:
if 'charset' in content_params:
...
comment:13 by , 13 years ago
| Patch needs improvement: | unset |
|---|
Patch separated from the one in #19101 and adapted following comment:12.
comment:14 by , 13 years ago
| Triage Stage: | Accepted → Ready for checkin |
|---|
The patch looks good to me.
Could you just mention the source for the algorithm in the docstring?
def _parse_content_type(self, ctype):
"""Media Types parsing according to RFC 2616, section 3.7.
Returns the data type and parameters. For example:
Input: "text/plain; charset=iso-8859-1"
Output: ('text/plain', {'charset': 'iso-8859-1'})
"""
comment:15 by , 13 years ago
| Owner: | set to |
|---|---|
| Resolution: | → fixed |
| Status: | new → closed |
A first try at a fix