Opened 4 years ago

Closed 4 years ago

#18909 closed Bug (invalid)

QueryString with non-ascii characters under Windows (Python 2.7) may be decoded inproperly

Reported by: public@… Owned by: nobody
Component: Core (URLs) Version: 1.4
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no


Under Python 2.7, using cgi.parse_qsl would vary on return type according to the input arguments. Take the following query string as example:

>>> cgi.parse_qsl('q=%E4%BD%A0%E5%A5%BD')
[('q', '\xe4\xbd\xa0\xe5\xa5\xbd')]
>>> cgi.parse_qsl(u'q=%E4%BD%A0%E5%A5%BD')
[(u'q', u'\xe4\xbd\xa0\xe5\xa5\xbd')]

The url-encoded string, "%E4%BD%A0%E5%A5%BD", carries two Chinese characters. There's not much trouble for the time being though, but when it is used together with django.http.QueryDict, something bad happens.

    def __init__(self, query_string, mutable=False, encoding=None):
        super(QueryDict, self).__init__()
        if not encoding:
            encoding = settings.DEFAULT_CHARSET
        self.encoding = encoding
        #if (isinstance(query_string, unicode)):     # These two lines were added by me.
        #    query_string = query_string.encode('utf-8')
        if six.PY3:
            for key, value in parse_qsl(query_string or '',
                self.appendlist(key, value)
            for key, value in parse_qsl(query_string or '',
                self.appendlist(force_text(key, encoding, errors='replace'),
                                force_text(value, encoding, errors='replace'))
        self._mutable = mutable

Although key and values are intended to be translated to Unicode, they have no chance to make the effort. The value remains u'\xe4\xbd\xa0\xe5\xa5\xbd', which is certainly not a valid UTF-16 string. I can't say whether this bug would be fixed silently later in other place of Django. What's more, I just couldn't see the bug under Linux (UTF-8). But under Windows, it really exists.

Here I will upload my Django application along with this ticket. It is a really small application, and just displays everything client requested to server. You can start the application by runserver, and hit "http://localhost:8000/?q=%E4%BD%A0%E5%A5%BD" to see whether the bug occurs or not.

Note: I discovered this under Windows XP SP3 (Simplified Chinese) with system encoding = gbk (Though the django is set to utf-8).

Attachments (1)

myapp.7z (3.4 KB) - added by public@… 4 years ago.
The bug Django 1.4 application (Bug only exists under Windows, Encoding = GBK)

Download all attachments as: .zip

Change History (3)

Changed 4 years ago by public@…

Attachment: myapp.7z added

The bug Django 1.4 application (Bug only exists under Windows, Encoding = GBK)

comment:1 Changed 4 years ago by Aymeric Augustin

Needs documentation: unset
Needs tests: unset
Patch needs improvement: unset

Maybe setting request.encoding could help?

comment:2 Changed 4 years ago by Luke Plant

Resolution: invalid
Status: newclosed

I think this is invalid.

If you want Django to interpret incoming data from the client as a specific encoding, you need to set DEFAULT_CHARSET to that encoding, or set request.encoding, and your project does neither. There is (unfortunately due to oversight in HTTP) no automatic way for Django to know what the encoding the client is using. Normally this works correctly most of the time by declaring the encoding on your web page, and the client (browser) then uses the same encoding when submitting GET/POST data.

I think there is also confusion here about the difference between 'Unicode' and 'UTF-16' (which is not unicode - it is the normal way that most Windows APIs pass unicode around, but not in Python). You seem to be expecting something to get automatically translated to UTF-16, which isn't going to happen. There is also some confusion about Python byte strings and Python unicode strings.

Note: See TracTickets for help on using tickets.
Back to Top