Opened 12 years ago
Closed 12 years ago
#19468 closed Bug (fixed)
django doesn't encode request.path correctly in python3
Reported by: | aliva | Owned by: | Aymeric Augustin |
---|---|---|---|
Component: | Python 3 | Version: | dev |
Severity: | Release blocker | Keywords: | |
Cc: | aliva | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
when I visit urls which contain non ascii charecters (like persian or arabic)
request.path is not right
this i my urls.py
from django.conf.urls import patterns, include, url from django.http.response import HttpResponse def view(request): print (request.path) return HttpResponse(request.path) urlpatterns = patterns('', url(r'^', view), )
for example I visit this url
http://127.0.0.1:8000/سلام
in the view function request.path is:
/سÙاÙ
but it should be:
/سلام
also this problem happens when I want to handle urls with regex groups in urls.py (the charecter groups are wrong too)
I don't have any of those problems on python2 - every thing works ok there!
- System:
- debian - sid
- Python 3.2.3
- django - master
Attachments (2)
Change History (22)
comment:1 by , 12 years ago
comment:2 by , 12 years ago
Triage Stage: | Unreviewed → Accepted |
---|
I've filed a Python bug report: http://bugs.python.org/issue16679
by , 12 years ago
Attachment: | 19468-1.diff added |
---|
comment:3 by , 12 years ago
Has patch: | set |
---|---|
Severity: | Normal → Release blocker |
Whatever the outcome of the Python bug report, we'll have to cope with this issue with released versions of Python 3. Attached is a possible fix.
Salam aleikhoum :-)
comment:4 by , 12 years ago
To the best of my understanding, Graham's answer on Python's tracker is correct, and there's no bug in Python.
PEP3333 says that environ
must contain native strings (str
objects). When native strings are actually implemented with a unicode-aware type, only code points representable in ISO-8859-1 encoding may be used.
One might disagree with the idea of using native strings for storing data that's really bytes, but it also has advantages and it's the status quo. The point of PEP 3333 is to provide a stable API; it seems extremely unlikely to me that it'll change before years.
Per RFC 3986 2.5:
When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]
but HTTP is an "old" URI scheme per RFC 3987 6.4:
the HTTP URL scheme does not specify how to encode original characters.
and just below there's an example of a non UTF-8 HTTP URL.
Yes, modern browsers will nicely display utf-8 URLs, but that's just cosmetic. You can write a perfectly correct and RFC-compliant HTTP service that uses another charset in its URLs.
WSGI uses iso-8859-1
because every bytestring can be decoded with this charset. If it assumed utf-8
, it would fail to decode some perfectly valid HTTP requests. WSGI wants to be universal and can't make 99%-correct assumptions.
So, this has three practical consequences for us:
- every HTTP requests can be unambiguously represented in WSGI, and the WSGI layer needs not be aware of the encoding of the URL (and of the rest of the HTTP request);
- Django can recover the original bytestring of any
environ
value, includingenviron['PATH_INFO']
with.encode('iso-8859-1')
; - Django must re-decode data fetched from
environ
with the appropriate charset.
The next steps are:
- audit where Django is reading data from
environ
; - determine which charset should be used for decoding.
I find it reasonable to assume that URLs will use the same charset as HTTP responses; that means using .encode('iso-8859-1').decode(settings.DEFAULT_CHARSET)
.
comment:5 by , 12 years ago
Cc: | added |
---|
follow-up: 8 comment:7 by , 12 years ago
I admit that due to an unfortunate missing standard in the past, URL encoded with non-utf-8 encodings are possible and correct RFC-wise. However, all modern browsers do encode the URLs with UTF-8, and that has nothing to do with "nicely displaying" them. They really send utf-8-encoded paths on the wire.
If wsgiref/PEP 3333 chooses to continue to "wrongly" (but safely) decoding 98% of URLs, it might be a design choice and it remains to be seen if it is a problem or not. Backwards compatibility is also an issue here.
As far as Django is concerned, I do agree with your next steps. But I'm -1 to using DEFAULT_CHARSET for decoding URLs. Django has absolutely no influence on the encoding of URL paths, that's the user agent's business. So even when you decide you want to serve non UTF-8 responses by setting DEFAULT_CHARSET, you still have no influence on the encoding of the paths you are receiving from clients (also taking into account hand-written URLs in browser address bars). In my opinion, these are orthogonal issues.
I have unfortunately not the public-facing infrastructure to run a Python 3 Django test instance, but it would be nice to have one such test project to see how it goes, and what various user agents are sending to the server with non-ascii URLs.
comment:8 by , 12 years ago
Replying to claudep:
I admit that due to an unfortunate missing standard in the past, URL encoded with non-utf-8 encodings are possible and correct RFC-wise. However, all modern browsers do encode the URLs with UTF-8, and that has nothing to do with "nicely displaying" them.
This sentence is simplifying things a bit, because it ignores URL-encoding.
Here's what browsers really do.
1) When you type non-ASCII characters in a URL bar, say http://example.com/café/
, the browser will utf-8-encode and then url-encode it, resulting in http://example.com/caf%C3%A9/
.
To try it by yourself, run nc -l 8000
in a console, and go to http://localhost:8000/café/
with a browser. In the console you'll see:
GET /caf%C3%A9/ HTTP/1.1 Host: localhost:8000 ...
2) When an URL contains non-ASCII characters (which is illegal — URLs must be ASCII-only), browsers cope with the situation as above.
I tested this in Firefox, Chrome and Safari by creating a file with the following content, and saving it with the iso-8895-1 encoding:
<html> <head><meta charset="iso-8859-1"><title>Test iso-8895-1 link</title></head> <body><a href="http:/localhost:8000/café/">Café!</a></body> </html>
Clicking the link gives the same result as above in the console (which is a bit surprising — it would make sense to keep the original charset here).
3) When an URL is properly URL-encoded, browsers transmit it as is. The server can then URL-decode it and interpret it according to whatever charset it wants.
I did the same test, but with a URL-encoded URL:
<html> <head><meta charset="iso-8859-1"><title>Test iso-8895-1 link</title></head> <body><a href="http:/localhost:8000/caf%e9/">Café!</a></body> </html>
When clicking the link, that the browser sends the original URL (iso-8859-1 encoded, URL-encoded):
GET /caf%e9/ HTTP/1.1 Host: localhost:8000
They really send utf-8-encoded paths on the wire.
No, as demonstrated above.
Browsers are notoriously robust to ill-formed inputs. A non-ASCII (ie. non URL-encoded) URL is an invalid input.
Rather than reject it, browsers choose to encode it using utf-8, URL-encode the result, and use that. It's a good choice for error handling; being 99% correct is good enough when you're dealing with invalid content in the first place.
But if a developer wants to write a Django server with Shift-JIS URLs — it may be more compact than utf-8 for asian languages — he's allowed to. If someone wants to replace a legacy sytem with ISO-8859-1 URLs with a Django version, she can. The URLs may not display nicely in browsers, but as long as they're properly URL-encoded, they'll work.
Besides, browsers aren't the only consumers of HTTP content on the Internet.
If wsgiref/PEP 3333 chooses to continue to "wrongly" (but safely) decoding 98% of URLs, it might be a design choice and it remains to be seen if it is a problem or not. Backwards compatibility is also an issue here.
I still think it's right (given the decision to use native strings in environ
). If PEP 3333 decided to URL-decode and utf-8-decode URLs, it would prevent people from using any charset other than utf-8 in their URLs. I've given use cases for non-utf-8 URLs above.
As far as Django is concerned, I do agree with your next steps. But I'm -1 to using DEFAULT_CHARSET for decoding URLs. Django has absolutely no influence on the encoding of URL paths, that's the user agent's business. So even when you decide you want to serve non UTF-8 responses by setting DEFAULT_CHARSET, you still have no influence on the encoding of the paths you are receiving from clients
I disagree. You have total control on the encoding of the paths you are receiving, as long as you <charset>-encode and URL-encode your URLs, like you should. The UA must not perform any decoding or encoding on properly URL-encoded URLs.
(also taking into account hand-written URLs in browser address bars). In my opinion, these are orthogonal issues.
Yes, hand written URLs are the only case where Django doesn't have control.
Obviously, most regular websites will just use utf-8 everywhere, and that guarantees the best compatibility with the Web ecosystem.
My point is to make it possible to use something else if one wants to and is aware of the consequences. That's why Django has a DEFAULT_CHARSET
setting.
If you think that Django should give up all pretense to support non-utf-8 environments, that's another discussion!
comment:9 by , 12 years ago
First of all, thanks for the detailed explanations above. They are really useful.
To sum up, I think we agree that Django will receive 98% of non-ascii URLs utf-8-encoded, URL-encoded. For those, environ['PATH_INFO']
will be wrongly decoded. That's why we have to re-encode, re-decode them to fix this (as proposed in my patch). The trick is to choose how to decode an input where we cannot be 100% sure about the encoding used. I'd suggest using the same idea I proposed in the Python bug:
a) try to decode with UTF-8
b) if it fails (UnicodeDecodeError
), fallback to DEFAULT_CHARSET
(if different that utf-8) or iso-8859-1.
It is based on the fact that a non-utf-8-encoded string will very probably fail when decoded with utf-8. Surely, there might be a risk (to be demonstrated) that a non-utf-8-encoded string might not fail when decoded with utf-8. I estimate the risk at 2% of 2% of the non-ascii URLs. I'm not sure we can avoid that. I'd be open to use DEFAULT_CHARSET
in a) only if DEFAULT_CHARSET
is used to encode URLs in Django (AFAIK it's not the case). I also think that favoring UTF-8 is in agreement with http://hg.python.org/cpython/rev/428d384ed626/
by , 12 years ago
Attachment: | 19468-2.diff added |
---|
comment:11 by , 12 years ago
May I suggest a different look at these percentages?
Django is serving two populations of programmers here:
- the 99.x% who either use utf-8 everywhere or don't even know what an encoding is. The debate is moot for them because they have
DEFAULT_CHARSET = 'utf-8'
. And anyone who expects non ASCII URLs typed in a browser bar to work falls in this category. - the 0.y% who want non-utf-8 URLs, for whatever reason. These are people who have special needs and who can be assumed to know what they're doing. They're probably not writing consumer websites. Their software may never be accessed by browsers.
- Forcing utf-8 decoding states upfront that Django doesn't support this use case: it'll fail in 98% of the cases and return wrong results in the other 2%.
- Trying utf-8 decoding, and falling back to
DEFAULT_CHARSET
, will work in 98% of the cases and return the wrong result in only 2% of the cases. This may be missed in testing and can't be relied upon in production. It's a trap and it's useless.
If you're bent on only supporting utf-8 URLs, please be upfront about it and don't make it a trap.
I'm getting weary of this debate; if I haven't convinced you, do what you want. For the record, I don't condone non-deterministic decoding, ie. "here's your decoded path — well, maybe, because if it happened to be valid utf-8 Django decoded it with utf-8 instead".
comment:12 by , 12 years ago
I forgot to mention that utf-8 *is* the default, which is consistent with the Python changeset you're linking to.
comment:13 by , 12 years ago
I took a stab at writing a patch implementing the solution I described above.
While I was working on it, I noticed a regression in the test client: #19487. Fixing it is a prerequisite for testing this ticket decently.
I've created a pull request fixing both issues: https://github.com/django/django/pull/596
follow-up: 16 comment:14 by , 12 years ago
I'm following up on Claude's comment on #19487 here, because it's more related to this ticket.
Django will still encode the URL's in UTF-8, so the decoding will probably fail.
That's a good point that I had missed until now. The reverse
function (and, as a consequence, the {% url %}
tag) use django.utils.encoding.iri_to_uri
, which is hardcoded to use UTF-8: it calls force_bytes without specifying a different encoding.
settings.DEFAULT_CHARSET
is really about response encoding, not much else (see #4380).
I beg to differ. settings.DEFAULT_CHARSET
is both about request and response encoding. It is used to decode GET and POST data in requests.
To sum up:
DEFAULT_CHARSET
applies to the request and response bodies- it isn't clear whether it's intended to apply to anything else
- if we use it to decode URLs, we must fix
iri_to_uri
accordingly - if we default to utf-8 to decode URLs, we're making it impossible to use reliably any other charset
comment:15 by , 12 years ago
Patch needs improvement: | set |
---|
comment:16 by , 12 years ago
Replying to aaugustin:
settings.DEFAULT_CHARSET
is really about response encoding, not much else (see #4380).
I beg to differ.
settings.DEFAULT_CHARSET
is both about request and response encoding. It is used to decode GET and POST data in requests.
OK, then it might have evolved since #4380. I didn't make an extensive audit of its current use.
To sum up:
DEFAULT_CHARSET
applies to the request and response bodies- it isn't clear whether it's intended to apply to anything else
- if we use it to decode URLs, we must fix
iri_to_uri
accordingly
Yes, good point.
- if we default to utf-8 to decode URLs, we're making it impossible to use reliably any other charset
utf-8 rulez!
comment:17 by , 12 years ago
I think we've fully described the context :)
I'm going to shoot an email to -developers and try to get other people involved in the discussion.
comment:18 by , 12 years ago
Owner: | changed from | to
---|
comment:19 by , 12 years ago
I looked into encoding URLs with DEFAULT_CHARSET
, but I ran into a semantic problem.
The unicode branch introduced iri_to_uri and used it to convert arbitrary pieces of text into something suitable for inclusion in an URL. This function implements section 3.1 of RFC 3987, which mandates utf-8. Its name implies that Django generates URIs, which are encoded in utf-8 by definition.
Simply using DEFAULT_CHARSET
instead of utf-8 will make its name misleading. Changing the name is backwards incompatible because it's documented.
I don't have any interest in this besides making Django's behavior as correct and unsurprising as possible, and this is proving more complex than what I'm willing to deal with.
Since the discussion on django-developers didn't attract any interest, let's just hardcode utf-8 and accept the two drawbacks when DEFAULT_CHARSET != 'utf-8'
:
- URLs generated by Django contain a mix of utf-8 (path) and non-utf-8 (query string) — but we can pretend that the query-string is opaque application data :)
- it isn't possible to serve arbitrary URLs with Django — but I have a better proposal for this in #19508.
comment:20 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Unfortunately, this seems to be a Python problem. Someone thought that URLs were always encoded in iso-8859-1, which is probably wrong.
Here's the problem:
http://hg.python.org/cpython/file/cbdd6852a274/Lib/wsgiref/simple_server.py#l85
The issue which led to the commit:
http://bugs.python.org/issue10155
With
request.path.encode('iso-8859-1').decode('utf-8')
, you will find again the original URL path. More research needs to be done on this subject...