Opened 3 years ago

Closed 11 months ago

#19508 closed Cleanup/optimization (fixed)

Implement URL decoding according to RFC 3987

Reported by: aaugustin Owned by: loic
Component: HTTP handling Version: master
Severity: Normal Keywords:
Cc: anubhav9042@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Since #5738, when Django fails to decode an URL because it isn't valid UTF-8, it returns a HTTP 400 error with no content.

It's a bit sloppy to use a protocol-level message for an application-level requirement — in other words, to reply to a well-formed HTTP request with 400 Bad Request. Section 3.2 of RFC 3987 proposes a solution to this problem. Basically, non-ASCII bytes that do not create a valid utf-8 sequence should remain URL-encoded. This may not be trivial to implement, but it provide better error handling and it's normalized.

With this change, non-existing but well-formed URLs will return a 404 instead of a 400. That's what people expect, as shown in the comments of #5738 and #16541.

Django builds URLs according to section 3.1 of RFC 3987. With this change, URLs will round trip cleanly through the reversing / resolving (that's one of the guarantees of RFC 3987) and Django will be able to deal with legacy, non-utf-8 URLs. I pursued these goals in #19468 with a more primitive technique (depending only on the encoding) and that didn't work out.

Change History (11)

comment:1 Changed 3 years ago by claudep

  • Triage Stage changed from Unreviewed to Accepted

comment:2 Changed 18 months ago by anubhav9042

I am thinking to work on this.....
I have some idea of what to do, by visiting the links provided in the summary.

Last edited 14 months ago by anubhav9042 (previous) (diff)

comment:3 Changed 18 months ago by anubhav9042

  • Cc anubhav9042@… added

comment:4 Changed 16 months ago by anubhav9042

  • Owner changed from nobody to anubhav9042
  • Status changed from new to assigned

Will be working on this in my GSoC project

comment:5 Changed 14 months ago by anubhav9042

Loic and I had a discussion on this today.
Few things worth mentioning:

  • When having urls like /test/~%A9, we get 400 when project is deployed as WSGIHandler has fallback for 400 in case of UnicodeDecodeError.
  • In development we get 500 because StaticFilesHandler is used and it does not have that fallback.
  • It cannot be reproduced in tests because the ClientHandler again raises UnicodeDecodeError in get_path_info().
  • The problem arises when in get_path_info(), we decode the url in utf-8.
  • When we pass a url, it passes through unquote() in urllib where it converts all percent encodings, even those which should remain url-encoded.

I am working on this now and will report again soon.

comment:7 Changed 13 months ago by timgraham

Alternate PR from Anubhav.

comment:8 Changed 11 months ago by loic

  • Owner changed from anubhav9042 to loic

This ticket was recently mentioned in a ML thread: https://groups.google.com/d/topic/django-developers/mS9-HXI4ljw/discussion

I amended the patch from Anubhav to make uri_to_iri() return unicode rather than UTF-8 (Option 2 in PR comment https://github.com/django/django/pull/2932/files#r15440287). This is consistent with Werkzeug's implementation (Refs http://werkzeug.pocoo.org/docs/0.9/utils/#werkzeug.urls.uri_to_iri)

Made a new PR - https://github.com/django/django/pull/3350 - feedback welcome.

comment:9 Changed 11 months ago by loic

I reworked the implementation of the "repercent" step. Anyone experienced with unicode to double check?

I still want to review our usage of the various quote/unquote functions and their respective quirks in term of input/return values (and their discrepancies between PY2 and PY3).

comment:10 Changed 11 months ago by claudep

  • Triage Stage changed from Accepted to Ready for checkin

Looks good, thanks!

comment:11 Changed 11 months ago by Loic Bistuer <loic.bistuer@…>

  • Resolution set to fixed
  • Status changed from assigned to closed

In 10b17a22bec2eaf44c3315614aea87c127caee46:

Fixed #19508 -- Implemented uri_to_iri as per RFC.

Thanks Loic Bistuer for helping in shaping the patch and Claude Paroz
for the review.

Note: See TracTickets for help on using tickets.
Back to Top