Code

#19468 closed Bug (fixed)

django doesn't encode request.path correctly in python3

Reported by: aliva Owned by: aaugustin
Component: Python 3 Version: master
Severity: Release blocker Keywords:
Cc: aliva Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

when I visit urls which contain non ascii charecters (like persian or arabic)
request.path is not right

this i my urls.py

from django.conf.urls import patterns, include, url
from django.http.response import HttpResponse

def view(request):
    print (request.path)
    return HttpResponse(request.path)

urlpatterns = patterns('',
    url(r'^', view),
)

for example I visit this url

http://127.0.0.1:8000/سلام

in the view function request.path is:

/سلاÙ

but it should be:

/سلام

also this problem happens when I want to handle urls with regex groups in urls.py (the charecter groups are wrong too)

I don't have any of those problems on python2 - every thing works ok there!

  • System:
    • debian - sid
    • Python 3.2.3
    • django - master

Attachments (2)

19468-1.diff (2.9 KB) - added by claudep 17 months ago.
19468-2.diff (3.9 KB) - added by claudep 16 months ago.

Download all attachments as: .zip

Change History (22)

comment:1 Changed 17 months ago by claudep

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

Unfortunately, this seems to be a Python problem. Someone thought that URLs were always encoded in iso-8859-1, which is probably wrong.
Here's the problem:
http://hg.python.org/cpython/file/cbdd6852a274/Lib/wsgiref/simple_server.py#l85

The issue which led to the commit:
http://bugs.python.org/issue10155

With request.path.encode('iso-8859-1').decode('utf-8'), you will find again the original URL path. More research needs to be done on this subject...

comment:2 Changed 17 months ago by claudep

  • Triage Stage changed from Unreviewed to Accepted

I've filed a Python bug report: http://bugs.python.org/issue16679

Changed 17 months ago by claudep

comment:3 Changed 17 months ago by claudep

  • Has patch set
  • Severity changed from Normal to Release blocker

Whatever the outcome of the Python bug report, we'll have to cope with this issue with released versions of Python 3. Attached is a possible fix.

Salam aleikhoum :-)

comment:4 Changed 17 months ago by aaugustin

To the best of my understanding, Graham's answer on Python's tracker is correct, and there's no bug in Python.

PEP3333 says that environ must contain native strings (str objects). When native strings are actually implemented with a unicode-aware type, only code points representable in ISO-8859-1 encoding may be used.

One might disagree with the idea of using native strings for storing data that's really bytes, but it also has advantages and it's the status quo. The point of PEP 3333 is to provide a stable API; it seems extremely unlikely to me that it'll change before years.


Per RFC 3986 2.5:

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]

but HTTP is an "old" URI scheme per RFC 3987 6.4:

the HTTP URL scheme does not specify how to encode original characters.

and just below there's an example of a non UTF-8 HTTP URL.

Yes, modern browsers will nicely display utf-8 URLs, but that's just cosmetic. You can write a perfectly correct and RFC-compliant HTTP service that uses another charset in its URLs.

WSGI uses iso-8859-1 because every bytestring can be decoded with this charset. If it assumed utf-8, it would fail to decode some perfectly valid HTTP requests. WSGI wants to be universal and can't make 99%-correct assumptions.


So, this has three practical consequences for us:

  • every HTTP requests can be unambiguously represented in WSGI, and the WSGI layer needs not be aware of the encoding of the URL (and of the rest of the HTTP request);
  • Django can recover the original bytestring of any environ value, including environ['PATH_INFO'] with .encode('iso-8859-1');
  • Django must re-decode data fetched from environ with the appropriate charset.

The next steps are:

  • audit where Django is reading data from environ;
  • determine which charset should be used for decoding.

I find it reasonable to assume that URLs will use the same charset as HTTP responses; that means using .encode('iso-8859-1').decode(settings.DEFAULT_CHARSET).

comment:5 Changed 17 months ago by aliva

  • Cc aliva added

comment:6 Changed 16 months ago by aaugustin

We should fix #11111 while we're there.

comment:7 follow-up: Changed 16 months ago by claudep

I admit that due to an unfortunate missing standard in the past, URL encoded with non-utf-8 encodings are possible and correct RFC-wise. However, all modern browsers do encode the URLs with UTF-8, and that has nothing to do with "nicely displaying" them. They really send utf-8-encoded paths on the wire.

If wsgiref/PEP 3333 chooses to continue to "wrongly" (but safely) decoding 98% of URLs, it might be a design choice and it remains to be seen if it is a problem or not. Backwards compatibility is also an issue here.

As far as Django is concerned, I do agree with your next steps. But I'm -1 to using DEFAULT_CHARSET for decoding URLs. Django has absolutely no influence on the encoding of URL paths, that's the user agent's business. So even when you decide you want to serve non UTF-8 responses by setting DEFAULT_CHARSET, you still have no influence on the encoding of the paths you are receiving from clients (also taking into account hand-written URLs in browser address bars). In my opinion, these are orthogonal issues.

I have unfortunately not the public-facing infrastructure to run a Python 3 Django test instance, but it would be nice to have one such test project to see how it goes, and what various user agents are sending to the server with non-ascii URLs.

comment:8 in reply to: ↑ 7 Changed 16 months ago by aaugustin

Replying to claudep:

I admit that due to an unfortunate missing standard in the past, URL encoded with non-utf-8 encodings are possible and correct RFC-wise. However, all modern browsers do encode the URLs with UTF-8, and that has nothing to do with "nicely displaying" them.


This sentence is simplifying things a bit, because it ignores URL-encoding.

Here's what browsers really do.

1) When you type non-ASCII characters in a URL bar, say http://example.com/café/, the browser will utf-8-encode and then url-encode it, resulting in http://example.com/caf%C3%A9/.

To try it by yourself, run nc -l 8000 in a console, and go to http://localhost:8000/café/ with a browser. In the console you'll see:

GET /caf%C3%A9/ HTTP/1.1
Host: localhost:8000
...

2) When an URL contains non-ASCII characters (which is illegal — URLs must be ASCII-only), browsers cope with the situation as above.

I tested this in Firefox, Chrome and Safari by creating a file with the following content, and saving it with the iso-8895-1 encoding:

<html>
<head><meta charset="iso-8859-1"><title>Test iso-8895-1 link</title></head>
<body><a href="http:/localhost:8000/café/">Café!</a></body>
</html>

Clicking the link gives the same result as above in the console (which is a bit surprising — it would make sense to keep the original charset here).

3) When an URL is properly URL-encoded, browsers transmit it as is. The server can then URL-decode it and interpret it according to whatever charset it wants.

I did the same test, but with a URL-encoded URL:

<html>
<head><meta charset="iso-8859-1"><title>Test iso-8895-1 link</title></head>
<body><a href="http:/localhost:8000/caf%e9/">Café!</a></body>
</html>

When clicking the link, that the browser sends the original URL (iso-8859-1 encoded, URL-encoded):

GET /caf%e9/ HTTP/1.1
Host: localhost:8000

They really send utf-8-encoded paths on the wire.


No, as demonstrated above.

Browsers are notoriously robust to ill-formed inputs. A non-ASCII (ie. non URL-encoded) URL is an invalid input.

Rather than reject it, browsers choose to encode it using utf-8, URL-encode the result, and use that. It's a good choice for error handling; being 99% correct is good enough when you're dealing with invalid content in the first place.

But if a developer wants to write a Django server with Shift-JIS URLs — it may be more compact than utf-8 for asian languages — he's allowed to. If someone wants to replace a legacy sytem with ISO-8859-1 URLs with a Django version, she can. The URLs may not display nicely in browsers, but as long as they're properly URL-encoded, they'll work.

Besides, browsers aren't the only consumers of HTTP content on the Internet.


If wsgiref/PEP 3333 chooses to continue to "wrongly" (but safely) decoding 98% of URLs, it might be a design choice and it remains to be seen if it is a problem or not. Backwards compatibility is also an issue here.


I still think it's right (given the decision to use native strings in environ). If PEP 3333 decided to URL-decode and utf-8-decode URLs, it would prevent people from using any charset other than utf-8 in their URLs. I've given use cases for non-utf-8 URLs above.


As far as Django is concerned, I do agree with your next steps. But I'm -1 to using DEFAULT_CHARSET for decoding URLs. Django has absolutely no influence on the encoding of URL paths, that's the user agent's business. So even when you decide you want to serve non UTF-8 responses by setting DEFAULT_CHARSET, you still have no influence on the encoding of the paths you are receiving from clients


I disagree. You have total control on the encoding of the paths you are receiving, as long as you <charset>-encode and URL-encode your URLs, like you should. The UA must not perform any decoding or encoding on properly URL-encoded URLs.

(also taking into account hand-written URLs in browser address bars). In my opinion, these are orthogonal issues.


Yes, hand written URLs are the only case where Django doesn't have control.

Obviously, most regular websites will just use utf-8 everywhere, and that guarantees the best compatibility with the Web ecosystem.

My point is to make it possible to use something else if one wants to and is aware of the consequences. That's why Django has a DEFAULT_CHARSET setting.


If you think that Django should give up all pretense to support non-utf-8 environments, that's another discussion!

comment:9 Changed 16 months ago by claudep

First of all, thanks for the detailed explanations above. They are really useful.

To sum up, I think we agree that Django will receive 98% of non-ascii URLs utf-8-encoded, URL-encoded. For those, environ['PATH_INFO'] will be wrongly decoded. That's why we have to re-encode, re-decode them to fix this (as proposed in my patch). The trick is to choose how to decode an input where we cannot be 100% sure about the encoding used. I'd suggest using the same idea I proposed in the Python bug:

a) try to decode with UTF-8
b) if it fails (UnicodeDecodeError), fallback to DEFAULT_CHARSET (if different that utf-8) or iso-8859-1.

It is based on the fact that a non-utf-8-encoded string will very probably fail when decoded with utf-8. Surely, there might be a risk (to be demonstrated) that a non-utf-8-encoded string might not fail when decoded with utf-8. I estimate the risk at 2% of 2% of the non-ascii URLs. I'm not sure we can avoid that. I'd be open to use DEFAULT_CHARSET in a) only if DEFAULT_CHARSET is used to encode URLs in Django (AFAIK it's not the case). I also think that favoring UTF-8 is in agreement with http://hg.python.org/cpython/rev/428d384ed626/

Changed 16 months ago by claudep

comment:10 Changed 16 months ago by claudep

Patch updated as of comment:9

comment:11 Changed 16 months ago by aaugustin

May I suggest a different look at these percentages?

Django is serving two populations of programmers here:

  • the 99.x% who either use utf-8 everywhere or don't even know what an encoding is. The debate is moot for them because they have DEFAULT_CHARSET = 'utf-8'. And anyone who expects non ASCII URLs typed in a browser bar to work falls in this category.
  • the 0.y% who want non-utf-8 URLs, for whatever reason. These are people who have special needs and who can be assumed to know what they're doing. They're probably not writing consumer websites. Their software may never be accessed by browsers.
    • Forcing utf-8 decoding states upfront that Django doesn't support this use case: it'll fail in 98% of the cases and return wrong results in the other 2%.
    • Trying utf-8 decoding, and falling back to DEFAULT_CHARSET, will work in 98% of the cases and return the wrong result in only 2% of the cases. This may be missed in testing and can't be relied upon in production. It's a trap and it's useless.

If you're bent on only supporting utf-8 URLs, please be upfront about it and don't make it a trap.

I'm getting weary of this debate; if I haven't convinced you, do what you want. For the record, I don't condone non-deterministic decoding, ie. "here's your decoded path — well, maybe, because if it happened to be valid utf-8 Django decoded it with utf-8 instead".

comment:12 Changed 16 months ago by aaugustin

I forgot to mention that utf-8 *is* the default, which is consistent with the Python changeset you're linking to.

comment:13 Changed 16 months ago by aaugustin

I took a stab at writing a patch implementing the solution I described above.

While I was working on it, I noticed a regression in the test client: #19487. Fixing it is a prerequisite for testing this ticket decently.

I've created a pull request fixing both issues: https://github.com/django/django/pull/596

comment:14 follow-up: Changed 16 months ago by aaugustin

I'm following up on Claude's comment on #19487 here, because it's more related to this ticket.

Django will still encode the URL's in UTF-8, so the decoding will probably fail.

That's a good point that I had missed until now. The reverse function (and, as a consequence, the {% url %} tag) use django.utils.encoding.iri_to_uri, which is hardcoded to use UTF-8: it calls force_bytes without specifying a different encoding.

settings.DEFAULT_CHARSET is really about response encoding, not much else (see #4380).

I beg to differ. settings.DEFAULT_CHARSET is both about request and response encoding. It is used to decode GET and POST data in requests.


To sum up:

  • DEFAULT_CHARSET applies to the request and response bodies
  • it isn't clear whether it's intended to apply to anything else
  • if we use it to decode URLs, we must fix iri_to_uri accordingly
  • if we default to utf-8 to decode URLs, we're making it impossible to use reliably any other charset
Last edited 16 months ago by aaugustin (previous) (diff)

comment:15 Changed 16 months ago by aaugustin

  • Patch needs improvement set

comment:16 in reply to: ↑ 14 Changed 16 months ago by claudep

Replying to aaugustin:

settings.DEFAULT_CHARSET is really about response encoding, not much else (see #4380).

I beg to differ. settings.DEFAULT_CHARSET is both about request and response encoding. It is used to decode GET and POST data in requests.

OK, then it might have evolved since #4380. I didn't make an extensive audit of its current use.

To sum up:

  • DEFAULT_CHARSET applies to the request and response bodies
  • it isn't clear whether it's intended to apply to anything else
  • if we use it to decode URLs, we must fix iri_to_uri accordingly

Yes, good point.

  • if we default to utf-8 to decode URLs, we're making it impossible to use reliably any other charset

utf-8 rulez!

comment:17 Changed 16 months ago by aaugustin

I think we've fully described the context :)

I'm going to shoot an email to -developers and try to get other people involved in the discussion.

comment:18 Changed 16 months ago by aaugustin

  • Owner changed from nobody to aaugustin

comment:19 Changed 16 months ago by aaugustin

I looked into encoding URLs with DEFAULT_CHARSET, but I ran into a semantic problem.

The unicode branch introduced iri_to_uri and used it to convert arbitrary pieces of text into something suitable for inclusion in an URL. This function implements section 3.1 of RFC 3987, which mandates utf-8. Its name implies that Django generates URIs, which are encoded in utf-8 by definition.

Simply using DEFAULT_CHARSET instead of utf-8 will make its name misleading. Changing the name is backwards incompatible because it's documented.


I don't have any interest in this besides making Django's behavior as correct and unsurprising as possible, and this is proving more complex than what I'm willing to deal with.

Since the discussion on django-developers didn't attract any interest, let's just hardcode utf-8 and accept the two drawbacks when DEFAULT_CHARSET != 'utf-8':

  • URLs generated by Django contain a mix of utf-8 (path) and non-utf-8 (query string) — but we can pretend that the query-string is opaque application data :)
  • it isn't possible to serve arbitrary URLs with Django — but I have a better proposal for this in #19508.

comment:20 Changed 16 months ago by Aymeric Augustin <aymeric.augustin@…>

  • Resolution set to fixed
  • Status changed from new to closed

In 1e4a27d08790c96f657d2e960c8142d1ca69aede:

Fixed #19468 -- Decoded request.path correctly on Python 3.

Thanks aliva for the report and claudep for the feedback.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.