Opened 12 years ago
Closed 10 years ago
#19508 closed Cleanup/optimization (fixed)
Implement URL decoding according to RFC 3987
Reported by: | Aymeric Augustin | Owned by: | Loic Bistuer |
---|---|---|---|
Component: | HTTP handling | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | anubhav9042@… | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Since #5738, when Django fails to decode an URL because it isn't valid UTF-8, it returns a HTTP 400 error with no content.
It's a bit sloppy to use a protocol-level message for an application-level requirement — in other words, to reply to a well-formed HTTP request with 400 Bad Request. Section 3.2 of RFC 3987 proposes a solution to this problem. Basically, non-ASCII bytes that do not create a valid utf-8 sequence should remain URL-encoded. This may not be trivial to implement, but it provide better error handling and it's normalized.
With this change, non-existing but well-formed URLs will return a 404 instead of a 400. That's what people expect, as shown in the comments of #5738 and #16541.
Django builds URLs according to section 3.1 of RFC 3987. With this change, URLs will round trip cleanly through the reversing / resolving (that's one of the guarantees of RFC 3987) and Django will be able to deal with legacy, non-utf-8 URLs. I pursued these goals in #19468 with a more primitive technique (depending only on the encoding) and that didn't work out.
Change History (11)
comment:1 by , 12 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:3 by , 11 years ago
Cc: | added |
---|
comment:4 by , 10 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
Will be working on this in my GSoC project
comment:5 by , 10 years ago
Loic and I had a discussion on this today.
Few things worth mentioning:
- When having urls like
/test/~%A9
, we get400
when project is deployed asWSGIHandler
has fallback for400
in case ofUnicodeDecodeError
. - In development we get
500
becauseStaticFilesHandler
is used and it does not have that fallback. - It cannot be reproduced in tests because the
ClientHandler
again raisesUnicodeDecodeError
inget_path_info()
. - The problem arises when in get_path_info(), we decode the url in
utf-8
. - When we pass a url, it passes through
unquote()
inurllib
where it converts all percent encodings, even those which should remain url-encoded.
I am working on this now and will report again soon.
comment:8 by , 10 years ago
Owner: | changed from | to
---|
This ticket was recently mentioned in a ML thread: https://groups.google.com/d/topic/django-developers/mS9-HXI4ljw/discussion
I amended the patch from Anubhav to make uri_to_iri()
return unicode rather than UTF-8 (Option 2 in PR comment https://github.com/django/django/pull/2932/files#r15440287). This is consistent with Werkzeug's implementation (Refs http://werkzeug.pocoo.org/docs/0.9/utils/#werkzeug.urls.uri_to_iri)
Made a new PR - https://github.com/django/django/pull/3350 - feedback welcome.
comment:9 by , 10 years ago
I reworked the implementation of the "repercent" step. Anyone experienced with unicode to double check?
I still want to review our usage of the various quote/unquote functions and their respective quirks in term of input/return values (and their discrepancies between PY2 and PY3).
comment:11 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
I am thinking to work on this.....
I have some idea of what to do, by visiting the links provided in the summary.