Opened 6 years ago

Closed 11 months ago

#10190 closed New feature (fixed)

Charset should be customizable with HttpResponse

Reported by: Wonlay Owned by: Tim Graham <timograham@…>
Component: HTTP handling Version: master
Severity: Normal Keywords: HttpResponse, charset, runtime
Cc: unai@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

When our web site is talking to other application or client, the charset of web response may vary from each other.

Currently only a settings.DEFAULT_CHARSET can be set with a unique value, we cannot change the HttpResponse charset at runtime properly.

I think we should add a charset parameter to HttpResponse.init.

Attachments (2)

http_response_charset.diff (1.2 KB) - added by Wonlay 6 years ago.
charset_without_mimetype.diff (1.0 KB) - added by ccahoon 6 years ago.
with changes suggested by mtreddinick (first draft)

Download all attachments as: .zip

Change History (34)

Changed 6 years ago by Wonlay

comment:1 follow-up: Changed 6 years ago by mtredinnick

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement set

This patch should probably be just a one-liner: just setting self._charset. Don't mess around with the mimetype stuff, since that's not related. However, there's something more that should be worked out here to avoid all problems.

If the value of self._charset is always going to be valid in the HTTP content-type header, we should just extract it from the content_type parameter. That would only be valid if all Python codec values are also valid HTTP header values.

If that isn't possible, then self._charset shouldn't be used to construct the content-type header, as it will use an invalid value. The current "default" is only for really old code (and/or people using defaults likes utf-8). For anything non-standard, content_type should be being used with a proper character set definition.

So, improvements needed here are:

  1. Remove the changes to mimetype setting. Make the smallest change possible and that isn't needed.
  2. Find out whether Python codec names are all valid as HTTP charset names. If so, parse the content_type to see if a special charset encoding is needed (no need for an extra __init__ parameter).
  3. If the previous item is not applicable, don't use _charset in the content_type setting, since it could lead to invalid value (and we'll need a documentation update to clarify the requirements there).

comment:2 Changed 6 years ago by jacob

  • milestone set to 1.1
  • Triage Stage changed from Unreviewed to Accepted

comment:3 Changed 6 years ago by jacob

  • milestone changed from 1.1 to 1.2

comment:4 in reply to: ↑ 1 Changed 6 years ago by ccahoon

  • Owner changed from nobody to ccahoon
  1. Find out whether Python codec names are all valid as HTTP charset names. If so, parse the content_type to see if a special charset encoding is needed (no need for an extra __init__ parameter).

This is not the case. http://docs.python.org/library/codecs.html#standard-encodings ("A number of codecs are specific to Python, so their codec names have no meaning outside Python.") and http://www.cs.tut.fi/~jkorpela/chars/sorted.html which lacks a bunch of codecs listed in the first link.

  1. If the previous item is not applicable, don't use _charset in the content_type setting, since it could lead to invalid value (and we'll need a documentation update to clarify the requirements there).

Such as a list of valid codecs?

Changed 6 years ago by ccahoon

with changes suggested by mtreddinick (first draft)

comment:5 Changed 6 years ago by ccahoon

http://www.iana.org/assignments/character-sets appears to be the canonical list of valid character sets.

comment:6 Changed 6 years ago by ccahoon

Support for Accepted-Charset in an HttpRequest, charset=.. in the content_type, and manually setting the encoding all have implementations on http://code.djangoproject.com/browser/django/branches/soc2009/http-wsgi-improvements

As far as I understand, this code will not appear on the trunk until the end of Google Summer of Code, to be merged with the rest of the work on that branch. However, if someone wants a patch for trunk, ask, and I will cook one up.

comment:7 Changed 6 years ago by ccahoon

(In [11199]) [soc2009/http-wsgi-improvements] Change HttpResponse.status_code to a property, additional test coverage. Refs #10190.

Improve charset coverage. Change HttpResponse.status_code to property, which checks for a 406 situation. This also required changes to all HttpResponse subclasses, so that their default status_code is set by _status_code, now.

Passes regression test suite, except ones related to sendfile.

comment:8 Changed 6 years ago by ccahoon

  • Patch needs improvement unset
  • Triage Stage changed from Accepted to Fixed on a branch

comment:9 Changed 6 years ago by ccahoon

Fixed on branches/soc2009/http-wsgi-improvements.

comment:10 Changed 6 years ago by ccahoon

(In [11370]) [soc2009/http-wsgi-improvements] Fixes for HttpResponse.charsets docs and HttpResponse._charset. Refs #10190.

comment:11 Changed 6 years ago by ccahoon

(In [11378]) [soc2009/http-wsgi-improvements] Update the render_to_response shortcut to pass request and content_type to HttpResponse, to support Accept-Charset. Refs #10190.

Also updates docs.

comment:12 Changed 6 years ago by adamnelson

I may be way off base, but could this be the cause of a problem I'm having uploading an HTML block to an S3 backend using django-storages? The file 'somehow' gets encoded as UTF-32LE which seems to confuse Safari4. People will note that Safari renders this URL as text and Firefox understands it's HTML:

http://s3.amazonaws.com/renderadamperformlinecom/performline_page_source/dc0be20f5d3d8cba0b43deb85e2a054368e7410b.html

Here is the code from my views.py file:

    file_data = {u'content':SimpleUploadedFile(filename + '.html', content_file)}
    form = PageDisplayForm(post,file_data) 

(I've tried doing SimpleUploadedFile(filename,content_file,'text/html') and that didn't help

Where content_file is a block of HTML. Here's the relevant chunk from settings.py:

    from S3 import CallingFormat
    DEFAULT_FILE_STORAGE = 'backends.s3.S3Storage'
    AWS_CALLING_FORMAT = CallingFormat.PATH

comment:13 Changed 6 years ago by ccahoon

(In [11448]) [soc2009/http-wsgi-improvements] Clean up charset-related code in HttpResponse. Refs #10190.

comment:14 Changed 5 years ago by ubernostrum

  • milestone 1.2 deleted

1.2 is feature-frozen, moving this feature request off the milestone.

comment:15 Changed 4 years ago by ramiro

#13391 was closed as duplicate of this and included a patch with an implementation for this issue.

comment:16 Changed 4 years ago by SmileyChris

  • Severity set to Normal
  • Type set to New feature

comment:17 Changed 3 years ago by aaugustin

  • UI/UX unset

Change UI/UX from NULL to False.

comment:18 Changed 3 years ago by aaugustin

  • Easy pickings unset

Change Easy pickings from NULL to False.

comment:19 Changed 3 years ago by aaugustin

  • Owner changed from ccahoon to aaugustin

comment:20 Changed 22 months ago by aaugustin

  • Status changed from new to assigned

comment:21 Changed 22 months ago by aaugustin

  • Owner aaugustin deleted
  • Status changed from assigned to new

comment:22 Changed 20 months ago by unaizalakain

  • Owner set to unaizalakain
  • Status changed from new to assigned

comment:23 Changed 20 months ago by unaizalakain

  • Triage Stage changed from Fixed on a branch to Accepted

PR sent: https://github.com/django/django/pull/1928

HttpResponses charset is now depending on its Content-Type header key.
If no charset is found there, it falls back to the DEFAULT_CHARSET setting.

Last edited 20 months ago by unaizalakain (previous) (diff)

comment:24 Changed 19 months ago by aaugustin

  • Needs documentation set
  • Patch needs improvement set

This latest patch doesn't address the fact that Python charsets do not match IANA-registered character sets. What's your take on this issue?

comment:25 Changed 19 months ago by unaizalakain

  • Needs documentation unset
  • Patch needs improvement unset

For the sake of future readers, this is what we have spoken on IRC (#django-dev):

There are two possible workarounds to this:

  • Creating a dictionary between IANA registered charsets and Python charsets. This would be great but it means having to regularly update and maintain it.
  • Adding a charset initialization parameter and property to HttpResponse. If you have issues with the charset specification, just pass on the charset argument or set the charset property. If not set, it'll fall back to the extraction from the Content-Type response header and if that isn't possible, to the DEFAULT_CHARSET setting.

Finally, we decided for the second solution.

We also spoke about parsing the Accept-Charset request header so that the chain would be:
[charset parameter/property] ---> [content-type extraction] ---> [accept-charset extraction] ---> [default setting]

The problem with it was that the HttpResponse object could not possibly know what the request headers were. It's sad to have this limitation but there isn't any easy workaround.

So, apart of making the changes suggested in the review, I only added the charset parameter and property.

comment:26 Changed 19 months ago by aaugustin

  • Owner changed from unaizalakain to aaugustin

Tentatively assigning the ticket to myself to review the latest iteration of the patch.

comment:27 Changed 19 months ago by unaizalakain

Any updates so far?

comment:28 Changed 19 months ago by unaizalakain

  • Cc unai@… added

comment:29 Changed 14 months ago by aaugustin

  • Owner aaugustin deleted
  • Status changed from assigned to new

comment:30 Changed 11 months ago by timgraham

  • Patch needs improvement set

I left comments for improvement on the PR. Please uncheck "Patch needs improvement" when you update it, thanks.

comment:31 Changed 11 months ago by unaizalakain

  • Patch needs improvement unset

Comments addressed and PR updated ;-)

comment:32 Changed 11 months ago by Tim Graham <timograham@…>

  • Owner set to Tim Graham <timograham@…>
  • Resolution set to fixed
  • Status changed from new to closed

In 5f2542f12a90cfcfb7be776424ef2f7b200df006:

Fixed #10190 -- Made HttpResponse charset customizable.

Thanks to Simon Charette, Aymeric Augustin, and Tim Graham
for reviews and contributions.

Note: See TracTickets for help on using tickets.
Back to Top