#22223 closed Bug (fixed)

reverse() escapes unreserved characters

Reported by: erik.van.zijst@… Owned by: nobody
Component: Core (URLs) Version: 1.6
Severity: Normal Keywords:
Cc: denilsonsa@…, apollo13, jorgecarleitao@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Django's django.core.urlresolvers.reverse() seems to have changed its behavior in 1.6. It now runs the arguments through urlquote(), without specifying the safe characters for path components. As a result:

on 1.4.10:

In [2]: reverse('test', args=['foo:bar'])

Out[2]: '/foo:bar'

but on 1.6.2:

In [2]: reverse('test', args=['foo:bar'])
Out[2]: '/foo%3Abar'

It would seem to me that this is a regression, as ":@-._~!$&'()*+,;=" are all allowed unescaped in path segments AFAIK.

I'm bringing this up because it breaks certain OAuth 1 clients against Bitbucket.

In some places we redirect to URLs whose path segment contains a ":". Prior to us upgrading to 1.6 the response's location header preserved that colon, but now it gets escaped, changing the URL (e.g. https://api.bitbucket.org/2.0/repositories/david/django-storages/pullrequests/51/diff redirecting to https://api.bitbucket.org/2.0/repositories/david/django-storages/diff/regadas/django-storages%3A069fd1d01fbf..f153a70ba254)

In OAuth 1, requests are signed, including the request URL, but the RFC-5849 does not mandate any pre-processing of the URL. For several OAuth clients (including requests-oauthlib and python-oauth2) that means they compute the signature over a string that contains "%3A" instead of ":".

On the server however, the request path automatically gets unquoted before it hits the middlewares and views. As our OAuth layer is a middleware that reconstructs the signature, it ends up computing over ":", yielding a different signature than the client, breaking authentication.

This might be addressable by changing these OAuth clients to perform unquoting on the path segment, but a better solution would seem to make urlresolvers.py:RegexURLResolver respect the reserved characters for path segments and not escape what does not need to be escaped.

I'll follow up with a pull request, unless there are strong feelings, or unwanted consequences of that approach.

Change History (13)

comment:1 Changed 17 months ago by erik.van.zijst@…

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

comment:2 Changed 17 months ago by aaugustin

This change is indeed documented in the release notes. It's a consequence of #13260. See my analysis for details.

The change suggested here would still preserve the requirements of #13260, which was primarily concerned with % characters in variable parts of URLs.

Now the question is -- what characters do we consider safe? By default urllib.quote preserves A-Za-z0-9_.- and characters defined as safe, which default to /.

Based on RFC 1738:

  1. <>"#%{}|\^~[]` are unsafe and must be encoded (that list includes the SPACE character).
  2. ;/?:@=& are reserved and must be encoded unless they are used for their special meaning.
  3. $-_.+!*'(), are safe and need not be encoded.

We can certainly put the third set of characters in the safe list.

If characters from the second set end up unencoded in URLs generated by Django, we start relying on user-agent quirks to re-encode them properly in HTTP request lines. However, / is part of this list and considered safe by the stdlib by default (which may not mean much; the stdlib contains many unfortunate API choices).

Sinc the path segment always starts with a slash following the host and ends at the end of the URL or with one of ?, ; or # (which is always unsafe), we may choose to preserve /:@=&, that is, all of the second set except for ? and ;. If we want to be more careful, we may choose to preserve only / and : because / is safe by default and : is only used to separate the protocol from the remainder of the URL. That would resolve your problem.

Can you clarify how you came up with /:@&=+$,? If you're including some characters from the third set above, you should probably include all of them.

The fix should be backported to 1.6.x since it's a regression.

comment:3 Changed 17 months ago by aaugustin

Related bug filed against urllib: http://bugs.python.org/issue2637

See also django.utils.encoding.iri_to_uri whose docstring discusses RFC 3986, with different character sets from those I mentioned :(

See also django.utils.html.smart_urlquote which uses safe=b'!*\'();:@&=+$,/?#[]~'. I'm not sure where that list comes from.

comment:4 Changed 17 months ago by claudep

  • Has patch set
  • Patch needs improvement set
  • Triage Stage changed from Unreviewed to Accepted

comment:5 Changed 17 months ago by erik.van.zijst

Can you clarify how you came up with /:@&=+$,?

I took that from paragraph 3.3 of RFC 2396:

3.3. Path Component

The path component contains data, specific to the authority (or the
scheme if there is no authority component), identifying the resource
within the scope of that scheme and authority.

path = [ abs_path | opaque_part ]

path_segments = segment *( "/" segment )
segment = *pchar *( ";" param )
param = *pchar

pchar = unreserved | escaped |

":" | "@" | "&" | "=" | "+" | "$" | ","

Also, looking more closely at RFC 1738, it seems to me that the only reserved characters in HTTP path segments are: "/;?". I'm not entirely sure where you found those other character sets.

Btw, when I backport the change to 1.6.x, do you want me to create another pull request against master, or are patches against older branches automatically merged up?

comment:6 Changed 17 months ago by aaugustin

My comment is based on notes I made 5 or 6 years ago working on a very similar problem.

We have too many RFCs...

Backporting is usually handled by the committer. This patch should apply cleanly to 1.6.

comment:7 Changed 17 months ago by erik.van.zijst

Anything I need to do to the pull request at this point?

comment:8 Changed 15 months ago by denilsonsa

  • Cc denilsonsa@… added

FYI, I had submitted a ticket asking for a way to customize the quoting behavior. In my case, I wanted to quote /.
https://code.djangoproject.com/ticket/22589

Given, the different needs, maybe the list of safe/unsafe characters could be supplied in urls.py, per-argument per-url? Just a quick idea:

url(r'^something/(?P<title>[^/]*)/something/else$', views.something, name='something', safechars={ 'title': ':@=$-_.+!*'(),' }),

comment:9 Changed 15 months ago by apollo13

  • Cc apollo13 added

@denilsonsa Something like that is certainly not going to happen.

comment:10 Changed 14 months ago by jorgecarleitao

  • Cc jorgecarleitao@… added

I've reviewed the PR, and added a comment on github.

It seems that 2396 is obsolete by RFC 3986 (Jan. 2005) [1]:

In particular

reserved = gen-delims / sub-delims

gen-delims = ":" / "/" / "?" / "#" / "/?" / "@"

sub-delims = "!" / "$" / "&" / "'" / "(" / ")"

/ "*" / "+" / "," / ";" / "="

[...]

2.3. Unreserved Characters

Characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include uppercase and lowercase
letters, decimal digits, hyphen, period, underscore, and tilde.

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

They also have an appendix with a compiled information about this http://tools.ietf.org/html/rfc3986#appendix-A

I'm no expert in this, just bringing your attention to it.

[1] http://tools.ietf.org/html/rfc3986

Last edited 14 months ago by jorgecarleitao (previous) (diff)

comment:11 Changed 13 months ago by claudep

  • Patch needs improvement unset

I think that we should indeed follow RFC 3986. See also http://bugs.python.org/issue16285.

I've pushed a patch with my proposal, based on Erik's patch.
https://github.com/django/django/pull/2859

comment:12 Changed 13 months ago by timo

  • Triage Stage changed from Accepted to Ready for checkin

comment:13 Changed 13 months ago by Claude Paroz <claude@…>

  • Resolution set to fixed
  • Status changed from new to closed

In e167e96cfea670422ca75d0b35fe7c4195f25b63:

Fixed #22223 -- Prevented over-escaping URLs in reverse()

And follow more closely the class of characters defined in the
RFC 3986.
Thanks Erik van Zijst for the report and the initial patch, and
Tim Graham for the review.

Note: See TracTickets for help on using tickets.
Back to Top