Opened 16 years ago

Closed 10 years ago

#10190 closed New feature (fixed)

Charset should be customizable with HttpResponse

Reported by: Wonlay Owned by: Tim Graham <timograham@…>
Component: HTTP handling Version: dev
Severity: Normal Keywords: HttpResponse, charset, runtime
Cc: unai@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

When our web site is talking to other application or client, the charset of web response may vary from each other.

Currently only a settings.DEFAULT_CHARSET can be set with a unique value, we cannot change the HttpResponse charset at runtime properly.

I think we should add a charset parameter to HttpResponse.init.

Attachments (2)

http_response_charset.diff (1.2 KB ) - added by Wonlay 16 years ago.
charset_without_mimetype.diff (1.0 KB ) - added by ccahoon 15 years ago.
with changes suggested by mtreddinick (first draft)

Download all attachments as: .zip

Change History (34)

by Wonlay, 16 years ago

Attachment: http_response_charset.diff added

comment:1 by Malcolm Tredinnick, 16 years ago

Patch needs improvement: set

This patch should probably be just a one-liner: just setting self._charset. Don't mess around with the mimetype stuff, since that's not related. However, there's something more that should be worked out here to avoid all problems.

If the value of self._charset is always going to be valid in the HTTP content-type header, we should just extract it from the content_type parameter. That would only be valid if all Python codec values are also valid HTTP header values.

If that isn't possible, then self._charset shouldn't be used to construct the content-type header, as it will use an invalid value. The current "default" is only for really old code (and/or people using defaults likes utf-8). For anything non-standard, content_type should be being used with a proper character set definition.

So, improvements needed here are:

  1. Remove the changes to mimetype setting. Make the smallest change possible and that isn't needed.
  2. Find out whether Python codec names are all valid as HTTP charset names. If so, parse the content_type to see if a special charset encoding is needed (no need for an extra __init__ parameter).
  3. If the previous item is not applicable, don't use _charset in the content_type setting, since it could lead to invalid value (and we'll need a documentation update to clarify the requirements there).

comment:2 by Jacob, 16 years ago

milestone: 1.1
Triage Stage: UnreviewedAccepted

comment:3 by Jacob, 16 years ago

milestone: 1.11.2

in reply to:  1 comment:4 by ccahoon, 15 years ago

Owner: changed from nobody to ccahoon
  1. Find out whether Python codec names are all valid as HTTP charset names. If so, parse the content_type to see if a special charset encoding is needed (no need for an extra __init__ parameter).

This is not the case. http://docs.python.org/library/codecs.html#standard-encodings ("A number of codecs are specific to Python, so their codec names have no meaning outside Python.") and http://www.cs.tut.fi/~jkorpela/chars/sorted.html which lacks a bunch of codecs listed in the first link.

  1. If the previous item is not applicable, don't use _charset in the content_type setting, since it could lead to invalid value (and we'll need a documentation update to clarify the requirements there).

Such as a list of valid codecs?

by ccahoon, 15 years ago

with changes suggested by mtreddinick (first draft)

comment:5 by ccahoon, 15 years ago

http://www.iana.org/assignments/character-sets appears to be the canonical list of valid character sets.

comment:6 by ccahoon, 15 years ago

Support for Accepted-Charset in an HttpRequest, charset=.. in the content_type, and manually setting the encoding all have implementations on http://code.djangoproject.com/browser/django/branches/soc2009/http-wsgi-improvements

As far as I understand, this code will not appear on the trunk until the end of Google Summer of Code, to be merged with the rest of the work on that branch. However, if someone wants a patch for trunk, ask, and I will cook one up.

comment:7 by ccahoon, 15 years ago

(In [11199]) [soc2009/http-wsgi-improvements] Change HttpResponse.status_code to a property, additional test coverage. Refs #10190.

Improve charset coverage. Change HttpResponse.status_code to property, which checks for a 406 situation. This also required changes to all HttpResponse subclasses, so that their default status_code is set by _status_code, now.

Passes regression test suite, except ones related to sendfile.

comment:8 by ccahoon, 15 years ago

Patch needs improvement: unset
Triage Stage: AcceptedFixed on a branch

comment:9 by ccahoon, 15 years ago

Fixed on branches/soc2009/http-wsgi-improvements.

comment:10 by ccahoon, 15 years ago

(In [11370]) [soc2009/http-wsgi-improvements] Fixes for HttpResponse.charsets docs and HttpResponse._charset. Refs #10190.

comment:11 by ccahoon, 15 years ago

(In [11378]) [soc2009/http-wsgi-improvements] Update the render_to_response shortcut to pass request and content_type to HttpResponse, to support Accept-Charset. Refs #10190.

Also updates docs.

comment:12 by Adam Nelson, 15 years ago

I may be way off base, but could this be the cause of a problem I'm having uploading an HTML block to an S3 backend using django-storages? The file 'somehow' gets encoded as UTF-32LE which seems to confuse Safari4. People will note that Safari renders this URL as text and Firefox understands it's HTML:

http://s3.amazonaws.com/renderadamperformlinecom/performline_page_source/dc0be20f5d3d8cba0b43deb85e2a054368e7410b.html

Here is the code from my views.py file:

    file_data = {u'content':SimpleUploadedFile(filename + '.html', content_file)}
    form = PageDisplayForm(post,file_data) 

(I've tried doing SimpleUploadedFile(filename,content_file,'text/html') and that didn't help

Where content_file is a block of HTML. Here's the relevant chunk from settings.py:

    from S3 import CallingFormat
    DEFAULT_FILE_STORAGE = 'backends.s3.S3Storage'
    AWS_CALLING_FORMAT = CallingFormat.PATH

comment:13 by ccahoon, 15 years ago

(In [11448]) [soc2009/http-wsgi-improvements] Clean up charset-related code in HttpResponse. Refs #10190.

comment:14 by James Bennett, 15 years ago

milestone: 1.2

1.2 is feature-frozen, moving this feature request off the milestone.

comment:15 by Ramiro Morales, 14 years ago

#13391 was closed as duplicate of this and included a patch with an implementation for this issue.

comment:16 by Chris Beaven, 14 years ago

Severity: Normal
Type: New feature

comment:17 by Aymeric Augustin, 13 years ago

UI/UX: unset

Change UI/UX from NULL to False.

comment:18 by Aymeric Augustin, 13 years ago

Easy pickings: unset

Change Easy pickings from NULL to False.

comment:19 by Aymeric Augustin, 12 years ago

Owner: changed from ccahoon to Aymeric Augustin

comment:20 by Aymeric Augustin, 11 years ago

Status: newassigned

comment:21 by Aymeric Augustin, 11 years ago

Owner: Aymeric Augustin removed
Status: assignednew

comment:22 by Unai Zalakain, 11 years ago

Owner: set to Unai Zalakain
Status: newassigned

comment:23 by Unai Zalakain, 11 years ago

Triage Stage: Fixed on a branchAccepted

PR sent: https://github.com/django/django/pull/1928

HttpResponses charset is now depending on its Content-Type header key.
If no charset is found there, it falls back to the DEFAULT_CHARSET setting.

Version 0, edited 11 years ago by Unai Zalakain (next)

comment:24 by Aymeric Augustin, 11 years ago

Needs documentation: set
Patch needs improvement: set

This latest patch doesn't address the fact that Python charsets do not match IANA-registered character sets. What's your take on this issue?

comment:25 by Unai Zalakain, 11 years ago

Needs documentation: unset
Patch needs improvement: unset

For the sake of future readers, this is what we have spoken on IRC (#django-dev):

There are two possible workarounds to this:

  • Creating a dictionary between IANA registered charsets and Python charsets. This would be great but it means having to regularly update and maintain it.
  • Adding a charset initialization parameter and property to HttpResponse. If you have issues with the charset specification, just pass on the charset argument or set the charset property. If not set, it'll fall back to the extraction from the Content-Type response header and if that isn't possible, to the DEFAULT_CHARSET setting.

Finally, we decided for the second solution.

We also spoke about parsing the Accept-Charset request header so that the chain would be:
[charset parameter/property] ---> [content-type extraction] ---> [accept-charset extraction] ---> [default setting]

The problem with it was that the HttpResponse object could not possibly know what the request headers were. It's sad to have this limitation but there isn't any easy workaround.

So, apart of making the changes suggested in the review, I only added the charset parameter and property.

comment:26 by Aymeric Augustin, 11 years ago

Owner: changed from Unai Zalakain to Aymeric Augustin

Tentatively assigning the ticket to myself to review the latest iteration of the patch.

comment:27 by Unai Zalakain, 11 years ago

Any updates so far?

comment:28 by Unai Zalakain, 11 years ago

Cc: unai@… added

comment:29 by Aymeric Augustin, 11 years ago

Owner: Aymeric Augustin removed
Status: assignednew

comment:30 by Tim Graham, 10 years ago

Patch needs improvement: set

I left comments for improvement on the PR. Please uncheck "Patch needs improvement" when you update it, thanks.

comment:31 by Unai Zalakain, 10 years ago

Patch needs improvement: unset

Comments addressed and PR updated ;-)

comment:32 by Tim Graham <timograham@…>, 10 years ago

Owner: set to Tim Graham <timograham@…>
Resolution: fixed
Status: newclosed

In 5f2542f12a90cfcfb7be776424ef2f7b200df006:

Fixed #10190 -- Made HttpResponse charset customizable.

Thanks to Simon Charette, Aymeric Augustin, and Tim Graham
for reviews and contributions.

Note: See TracTickets for help on using tickets.
Back to Top