Opened 12 years ago
Closed 12 years ago
#18909 closed Bug (invalid)
QueryString with non-ascii characters under Windows (Python 2.7) may be decoded inproperly
Reported by: | Owned by: | nobody | |
---|---|---|---|
Component: | Core (URLs) | Version: | 1.4 |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Unreviewed | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Under Python 2.7, using cgi.parse_qsl would vary on return type according to the input arguments. Take the following query string as example:
>>> cgi.parse_qsl('q=%E4%BD%A0%E5%A5%BD') [('q', '\xe4\xbd\xa0\xe5\xa5\xbd')] >>> cgi.parse_qsl(u'q=%E4%BD%A0%E5%A5%BD') [(u'q', u'\xe4\xbd\xa0\xe5\xa5\xbd')]
The url-encoded string, "%E4%BD%A0%E5%A5%BD", carries two Chinese characters. There's not much trouble for the time being though, but when it is used together with django.http.QueryDict, something bad happens.
def __init__(self, query_string, mutable=False, encoding=None): super(QueryDict, self).__init__() if not encoding: encoding = settings.DEFAULT_CHARSET self.encoding = encoding #if (isinstance(query_string, unicode)): # These two lines were added by me. # query_string = query_string.encode('utf-8') if six.PY3: for key, value in parse_qsl(query_string or '', keep_blank_values=True, encoding=encoding): self.appendlist(key, value) else: for key, value in parse_qsl(query_string or '', keep_blank_values=True): self.appendlist(force_text(key, encoding, errors='replace'), force_text(value, encoding, errors='replace')) self._mutable = mutable
Although key and values are intended to be translated to Unicode, they have no chance to make the effort. The value remains u'\xe4\xbd\xa0\xe5\xa5\xbd', which is certainly not a valid UTF-16 string. I can't say whether this bug would be fixed silently later in other place of Django. What's more, I just couldn't see the bug under Linux (UTF-8). But under Windows, it really exists.
Here I will upload my Django application along with this ticket. It is a really small application, and just displays everything client requested to server. You can start the application by manage.py runserver, and hit "http://localhost:8000/?q=%E4%BD%A0%E5%A5%BD" to see whether the bug occurs or not.
Note: I discovered this under Windows XP SP3 (Simplified Chinese) with system encoding = gbk (Though the django is set to utf-8).
Attachments (1)
Change History (3)
by , 12 years ago
comment:2 by , 12 years ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
I think this is invalid.
If you want Django to interpret incoming data from the client as a specific encoding, you need to set DEFAULT_CHARSET to that encoding, or set request.encoding, and your project does neither. There is (unfortunately due to oversight in HTTP) no automatic way for Django to know what the encoding the client is using. Normally this works correctly most of the time by declaring the encoding on your web page, and the client (browser) then uses the same encoding when submitting GET/POST data.
I think there is also confusion here about the difference between 'Unicode' and 'UTF-16' (which is not unicode - it is the normal way that most Windows APIs pass unicode around, but not in Python). You seem to be expecting something to get automatically translated to UTF-16, which isn't going to happen. There is also some confusion about Python byte strings and Python unicode strings.
The bug Django 1.4 application (Bug only exists under Windows, Encoding = GBK)