Opened 6 years ago

Closed 4 years ago

#19508 closed Cleanup/optimization (fixed)

Implement URL decoding according to RFC 3987

Reported by: Aymeric Augustin Owned by: Loic Bistuer
Component: HTTP handling Version: master
Severity: Normal Keywords:
Cc: anubhav9042@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no


Since #5738, when Django fails to decode an URL because it isn't valid UTF-8, it returns a HTTP 400 error with no content.

It's a bit sloppy to use a protocol-level message for an application-level requirement — in other words, to reply to a well-formed HTTP request with 400 Bad Request. Section 3.2 of RFC 3987 proposes a solution to this problem. Basically, non-ASCII bytes that do not create a valid utf-8 sequence should remain URL-encoded. This may not be trivial to implement, but it provide better error handling and it's normalized.

With this change, non-existing but well-formed URLs will return a 404 instead of a 400. That's what people expect, as shown in the comments of #5738 and #16541.

Django builds URLs according to section 3.1 of RFC 3987. With this change, URLs will round trip cleanly through the reversing / resolving (that's one of the guarantees of RFC 3987) and Django will be able to deal with legacy, non-utf-8 URLs. I pursued these goals in #19468 with a more primitive technique (depending only on the encoding) and that didn't work out.

Change History (11)

comment:1 Changed 6 years ago by Claude Paroz

Triage Stage: UnreviewedAccepted

comment:2 Changed 5 years ago by ANUBHAV JOSHI

I am thinking to work on this.....
I have some idea of what to do, by visiting the links provided in the summary.

Can anyone give me some ideas to begin with...??

Version 0, edited 5 years ago by ANUBHAV JOSHI (next)

comment:3 Changed 5 years ago by ANUBHAV JOSHI

Cc: anubhav9042@… added

comment:4 Changed 5 years ago by ANUBHAV JOSHI

Owner: changed from nobody to ANUBHAV JOSHI
Status: newassigned

Will be working on this in my GSoC project

comment:5 Changed 4 years ago by ANUBHAV JOSHI

Loic and I had a discussion on this today.
Few things worth mentioning:

  • When having urls like /test/~%A9, we get 400 when project is deployed as WSGIHandler has fallback for 400 in case of UnicodeDecodeError.
  • In development we get 500 because StaticFilesHandler is used and it does not have that fallback.
  • It cannot be reproduced in tests because the ClientHandler again raises UnicodeDecodeError in get_path_info().
  • The problem arises when in get_path_info(), we decode the url in utf-8.
  • When we pass a url, it passes through unquote() in urllib where it converts all percent encodings, even those which should remain url-encoded.

I am working on this now and will report again soon.

comment:6 Changed 4 years ago by ANUBHAV JOSHI

Has patch: set

comment:7 Changed 4 years ago by Tim Graham

Alternate PR from Anubhav.

comment:8 Changed 4 years ago by Loic Bistuer

Owner: changed from ANUBHAV JOSHI to Loic Bistuer

This ticket was recently mentioned in a ML thread:

I amended the patch from Anubhav to make uri_to_iri() return unicode rather than UTF-8 (Option 2 in PR comment This is consistent with Werkzeug's implementation (Refs

Made a new PR - - feedback welcome.

comment:9 Changed 4 years ago by Loic Bistuer

I reworked the implementation of the "repercent" step. Anyone experienced with unicode to double check?

I still want to review our usage of the various quote/unquote functions and their respective quirks in term of input/return values (and their discrepancies between PY2 and PY3).

comment:10 Changed 4 years ago by Claude Paroz

Triage Stage: AcceptedReady for checkin

Looks good, thanks!

comment:11 Changed 4 years ago by Loic Bistuer <loic.bistuer@…>

Resolution: fixed
Status: assignedclosed

In 10b17a22bec2eaf44c3315614aea87c127caee46:

Fixed #19508 -- Implemented uri_to_iri as per RFC.

Thanks Loic Bistuer for helping in shaping the patch and Claude Paroz
for the review.

Note: See TracTickets for help on using tickets.
Back to Top