Opened 13 years ago
Closed 11 years ago
#19508 closed Cleanup/optimization (fixed)
Implement URL decoding according to RFC 3987
| Reported by: | Aymeric Augustin | Owned by: | Loic Bistuer |
|---|---|---|---|
| Component: | HTTP handling | Version: | dev |
| Severity: | Normal | Keywords: | |
| Cc: | anubhav9042@… | Triage Stage: | Ready for checkin |
| Has patch: | yes | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description
Since #5738, when Django fails to decode an URL because it isn't valid UTF-8, it returns a HTTP 400 error with no content.
It's a bit sloppy to use a protocol-level message for an application-level requirement — in other words, to reply to a well-formed HTTP request with 400 Bad Request. Section 3.2 of RFC 3987 proposes a solution to this problem. Basically, non-ASCII bytes that do not create a valid utf-8 sequence should remain URL-encoded. This may not be trivial to implement, but it provide better error handling and it's normalized.
With this change, non-existing but well-formed URLs will return a 404 instead of a 400. That's what people expect, as shown in the comments of #5738 and #16541.
Django builds URLs according to section 3.1 of RFC 3987. With this change, URLs will round trip cleanly through the reversing / resolving (that's one of the guarantees of RFC 3987) and Django will be able to deal with legacy, non-utf-8 URLs. I pursued these goals in #19468 with a more primitive technique (depending only on the encoding) and that didn't work out.
Change History (11)
comment:1 by , 13 years ago
| Triage Stage: | Unreviewed → Accepted |
|---|
comment:3 by , 12 years ago
| Cc: | added |
|---|
comment:4 by , 12 years ago
| Owner: | changed from to |
|---|---|
| Status: | new → assigned |
Will be working on this in my GSoC project
comment:5 by , 11 years ago
Loic and I had a discussion on this today.
Few things worth mentioning:
- When having urls like
/test/~%A9, we get400when project is deployed asWSGIHandlerhas fallback for400in case ofUnicodeDecodeError. - In development we get
500becauseStaticFilesHandleris used and it does not have that fallback. - It cannot be reproduced in tests because the
ClientHandleragain raisesUnicodeDecodeErroringet_path_info(). - The problem arises when in get_path_info(), we decode the url in
utf-8. - When we pass a url, it passes through
unquote()inurllibwhere it converts all percent encodings, even those which should remain url-encoded.
I am working on this now and will report again soon.
comment:8 by , 11 years ago
| Owner: | changed from to |
|---|
This ticket was recently mentioned in a ML thread: https://groups.google.com/d/topic/django-developers/mS9-HXI4ljw/discussion
I amended the patch from Anubhav to make uri_to_iri() return unicode rather than UTF-8 (Option 2 in PR comment https://github.com/django/django/pull/2932/files#r15440287). This is consistent with Werkzeug's implementation (Refs http://werkzeug.pocoo.org/docs/0.9/utils/#werkzeug.urls.uri_to_iri)
Made a new PR - https://github.com/django/django/pull/3350 - feedback welcome.
comment:9 by , 11 years ago
I reworked the implementation of the "repercent" step. Anyone experienced with unicode to double check?
I still want to review our usage of the various quote/unquote functions and their respective quirks in term of input/return values (and their discrepancies between PY2 and PY3).
comment:11 by , 11 years ago
| Resolution: | → fixed |
|---|---|
| Status: | assigned → closed |
I am thinking to work on this.....
I have some idea of what to do, by visiting the links provided in the summary.