Opened 11 years ago

Closed 10 years ago

#19508 closed Cleanup/optimization (fixed)

Implement URL decoding according to RFC 3987

Reported by: Aymeric Augustin Owned by: Loic Bistuer
Component: HTTP handling Version: dev
Severity: Normal Keywords:
Cc: anubhav9042@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Since #5738, when Django fails to decode an URL because it isn't valid UTF-8, it returns a HTTP 400 error with no content.

It's a bit sloppy to use a protocol-level message for an application-level requirement — in other words, to reply to a well-formed HTTP request with 400 Bad Request. Section 3.2 of RFC 3987 proposes a solution to this problem. Basically, non-ASCII bytes that do not create a valid utf-8 sequence should remain URL-encoded. This may not be trivial to implement, but it provide better error handling and it's normalized.

With this change, non-existing but well-formed URLs will return a 404 instead of a 400. That's what people expect, as shown in the comments of #5738 and #16541.

Django builds URLs according to section 3.1 of RFC 3987. With this change, URLs will round trip cleanly through the reversing / resolving (that's one of the guarantees of RFC 3987) and Django will be able to deal with legacy, non-utf-8 URLs. I pursued these goals in #19468 with a more primitive technique (depending only on the encoding) and that didn't work out.

Change History (11)

comment:1 by Claude Paroz, 11 years ago

Triage Stage: UnreviewedAccepted

comment:2 by ANUBHAV JOSHI, 10 years ago

I am thinking to work on this.....
I have some idea of what to do, by visiting the links provided in the summary.

Can anyone give me some ideas to begin with...??

Version 0, edited 10 years ago by ANUBHAV JOSHI (next)

comment:3 by ANUBHAV JOSHI, 10 years ago

Cc: anubhav9042@… added

comment:4 by ANUBHAV JOSHI, 10 years ago

Owner: changed from nobody to ANUBHAV JOSHI
Status: newassigned

Will be working on this in my GSoC project

comment:5 by ANUBHAV JOSHI, 10 years ago

Loic and I had a discussion on this today.
Few things worth mentioning:

  • When having urls like /test/~%A9, we get 400 when project is deployed as WSGIHandler has fallback for 400 in case of UnicodeDecodeError.
  • In development we get 500 because StaticFilesHandler is used and it does not have that fallback.
  • It cannot be reproduced in tests because the ClientHandler again raises UnicodeDecodeError in get_path_info().
  • The problem arises when in get_path_info(), we decode the url in utf-8.
  • When we pass a url, it passes through unquote() in urllib where it converts all percent encodings, even those which should remain url-encoded.

I am working on this now and will report again soon.

comment:7 by Tim Graham, 10 years ago

Alternate PR from Anubhav.

comment:8 by Loic Bistuer, 10 years ago

Owner: changed from ANUBHAV JOSHI to Loic Bistuer

This ticket was recently mentioned in a ML thread: https://groups.google.com/d/topic/django-developers/mS9-HXI4ljw/discussion

I amended the patch from Anubhav to make uri_to_iri() return unicode rather than UTF-8 (Option 2 in PR comment https://github.com/django/django/pull/2932/files#r15440287). This is consistent with Werkzeug's implementation (Refs http://werkzeug.pocoo.org/docs/0.9/utils/#werkzeug.urls.uri_to_iri)

Made a new PR - https://github.com/django/django/pull/3350 - feedback welcome.

comment:9 by Loic Bistuer, 10 years ago

I reworked the implementation of the "repercent" step. Anyone experienced with unicode to double check?

I still want to review our usage of the various quote/unquote functions and their respective quirks in term of input/return values (and their discrepancies between PY2 and PY3).

comment:10 by Claude Paroz, 10 years ago

Triage Stage: AcceptedReady for checkin

Looks good, thanks!

comment:11 by Loic Bistuer <loic.bistuer@…>, 10 years ago

Resolution: fixed
Status: assignedclosed

In 10b17a22bec2eaf44c3315614aea87c127caee46:

Fixed #19508 -- Implemented uri_to_iri as per RFC.

Thanks Loic Bistuer for helping in shaping the patch and Claude Paroz
for the review.

Note: See TracTickets for help on using tickets.
Back to Top