Code

#18909 closed Bug (invalid)

QueryString with non-ascii characters under Windows (Python 2.7) may be decoded inproperly

Reported by: public@… Owned by: nobody
Component: Core (URLs) Version: 1.4
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Under Python 2.7, using cgi.parse_qsl would vary on return type according to the input arguments. Take the following query string as example:

>>> cgi.parse_qsl('q=%E4%BD%A0%E5%A5%BD')
[('q', '\xe4\xbd\xa0\xe5\xa5\xbd')]
>>> cgi.parse_qsl(u'q=%E4%BD%A0%E5%A5%BD')
[(u'q', u'\xe4\xbd\xa0\xe5\xa5\xbd')]

The url-encoded string, "%E4%BD%A0%E5%A5%BD", carries two Chinese characters. There's not much trouble for the time being though, but when it is used together with django.http.QueryDict, something bad happens.

    def __init__(self, query_string, mutable=False, encoding=None):
        super(QueryDict, self).__init__()
        if not encoding:
            encoding = settings.DEFAULT_CHARSET
        self.encoding = encoding
        #if (isinstance(query_string, unicode)):     # These two lines were added by me.
        #    query_string = query_string.encode('utf-8')
        if six.PY3:
            for key, value in parse_qsl(query_string or '',
                                        keep_blank_values=True,
                                        encoding=encoding):
                self.appendlist(key, value)
        else:
            for key, value in parse_qsl(query_string or '',
                                        keep_blank_values=True):
                self.appendlist(force_text(key, encoding, errors='replace'),
                                force_text(value, encoding, errors='replace'))
        self._mutable = mutable

Although key and values are intended to be translated to Unicode, they have no chance to make the effort. The value remains u'\xe4\xbd\xa0\xe5\xa5\xbd', which is certainly not a valid UTF-16 string. I can't say whether this bug would be fixed silently later in other place of Django. What's more, I just couldn't see the bug under Linux (UTF-8). But under Windows, it really exists.

Here I will upload my Django application along with this ticket. It is a really small application, and just displays everything client requested to server. You can start the application by manage.py runserver, and hit "http://localhost:8000/?q=%E4%BD%A0%E5%A5%BD" to see whether the bug occurs or not.

Note: I discovered this under Windows XP SP3 (Simplified Chinese) with system encoding = gbk (Though the django is set to utf-8).

Attachments (1)

myapp.7z (3.4 KB) - added by public@… 22 months ago.
The bug Django 1.4 application (Bug only exists under Windows, Encoding = GBK)

Download all attachments as: .zip

Change History (3)

Changed 22 months ago by public@…

The bug Django 1.4 application (Bug only exists under Windows, Encoding = GBK)

comment:1 Changed 22 months ago by aaugustin

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

Maybe setting request.encoding could help?

comment:2 Changed 22 months ago by lukeplant

  • Resolution set to invalid
  • Status changed from new to closed

I think this is invalid.

If you want Django to interpret incoming data from the client as a specific encoding, you need to set DEFAULT_CHARSET to that encoding, or set request.encoding, and your project does neither. There is (unfortunately due to oversight in HTTP) no automatic way for Django to know what the encoding the client is using. Normally this works correctly most of the time by declaring the encoding on your web page, and the client (browser) then uses the same encoding when submitting GET/POST data.

I think there is also confusion here about the difference between 'Unicode' and 'UTF-16' (which is not unicode - it is the normal way that most Windows APIs pass unicode around, but not in Python). You seem to be expecting something to get automatically translated to UTF-16, which isn't going to happen. There is also some confusion about Python byte strings and Python unicode strings.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.