Opened 11 years ago
Closed 10 years ago
#22223 closed Bug (fixed)
reverse() escapes unreserved characters
Reported by: | Owned by: | nobody | |
---|---|---|---|
Component: | Core (URLs) | Version: | 1.6 |
Severity: | Normal | Keywords: | |
Cc: | denilsonsa@…, Florian Apolloner, jorgecarleitao@… | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Django's django.core.urlresolvers.reverse()
seems to have changed its behavior in 1.6. It now runs the arguments through urlquote()
, without specifying the safe characters for path components. As a result:
on 1.4.10:
In [2]: reverse('test', args=['foo:bar']) Out[2]: '/foo:bar'
but on 1.6.2:
In [2]: reverse('test', args=['foo:bar']) Out[2]: '/foo%3Abar'
It would seem to me that this is a regression, as ":@-._~!$&'()*+,;=" are all allowed unescaped in path segments AFAIK.
I'm bringing this up because it breaks certain OAuth 1 clients against Bitbucket.
In some places we redirect to URLs whose path segment contains a ":". Prior to us upgrading to 1.6 the response's location header preserved that colon, but now it gets escaped, changing the URL (e.g. https://api.bitbucket.org/2.0/repositories/david/django-storages/pullrequests/51/diff redirecting to https://api.bitbucket.org/2.0/repositories/david/django-storages/diff/regadas/django-storages%3A069fd1d01fbf..f153a70ba254)
In OAuth 1, requests are signed, including the request URL, but the RFC-5849 does not mandate any pre-processing of the URL. For several OAuth clients (including requests-oauthlib and python-oauth2) that means they compute the signature over a string that contains "%3A" instead of ":".
On the server however, the request path automatically gets unquoted before it hits the middlewares and views. As our OAuth layer is a middleware that reconstructs the signature, it ends up computing over ":", yielding a different signature than the client, breaking authentication.
This might be addressable by changing these OAuth clients to perform unquoting on the path segment, but a better solution would seem to make urlresolvers.py:RegexURLResolver
respect the reserved characters for path segments and not escape what does not need to be escaped.
I'll follow up with a pull request, unless there are strong feelings, or unwanted consequences of that approach.
Change History (13)
comment:1 by , 11 years ago
comment:2 by , 11 years ago
This change is indeed documented in the release notes. It's a consequence of #13260. See my analysis for details.
The change suggested here would still preserve the requirements of #13260, which was primarily concerned with % characters in variable parts of URLs.
Now the question is -- what characters do we consider safe? By default urllib.quote preserves A-Za-z0-9_.-
and characters defined as safe, which default to /
.
Based on RFC 1738:
<>"#%{}|\^~[]`
are unsafe and must be encoded (that list includes the SPACE character).;/?:@=&
are reserved and must be encoded unless they are used for their special meaning.$-_.+!*'(),
are safe and need not be encoded.
We can certainly put the third set of characters in the safe list.
If characters from the second set end up unencoded in URLs generated by Django, we start relying on user-agent quirks to re-encode them properly in HTTP request lines. However, /
is part of this list and considered safe by the stdlib by default (which may not mean much; the stdlib contains many unfortunate API choices).
Sinc the path segment always starts with a slash following the host and ends at the end of the URL or with one of ?
, ;
or #
(which is always unsafe), we may choose to preserve /:@=&
, that is, all of the second set except for ?
and ;
. If we want to be more careful, we may choose to preserve only /
and :
because /
is safe by default and :
is only used to separate the protocol from the remainder of the URL. That would resolve your problem.
Can you clarify how you came up with /:@&=+$,
? If you're including some characters from the third set above, you should probably include all of them.
The fix should be backported to 1.6.x since it's a regression.
comment:3 by , 11 years ago
Related bug filed against urllib
: http://bugs.python.org/issue2637
See also django.utils.encoding.iri_to_uri
whose docstring discusses RFC 3986, with different character sets from those I mentioned :(
See also django.utils.html.smart_urlquote
which uses safe=b'!*\'();:@&=+$,/?#[]~'
. I'm not sure where that list comes from.
comment:4 by , 11 years ago
Has patch: | set |
---|---|
Patch needs improvement: | set |
Triage Stage: | Unreviewed → Accepted |
comment:5 by , 11 years ago
Can you clarify how you came up with /:@&=+$,?
I took that from paragraph 3.3 of RFC 2396:
3.3. Path Component
The path component contains data, specific to the authority (or the
scheme if there is no authority component), identifying the resource
within the scope of that scheme and authority.
path = [ abs_path | opaque_part ]
path_segments = segment *( "/" segment )
segment = *pchar *( ";" param )
param = *pchar
pchar = unreserved | escaped |
":" | "@" | "&" | "=" | "+" | "$" | ","
Also, looking more closely at RFC 1738, it seems to me that the only reserved characters in HTTP path segments are: "/;?". I'm not entirely sure where you found those other character sets.
Btw, when I backport the change to 1.6.x, do you want me to create another pull request against master, or are patches against older branches automatically merged up?
comment:6 by , 11 years ago
My comment is based on notes I made 5 or 6 years ago working on a very similar problem.
We have too many RFCs...
Backporting is usually handled by the committer. This patch should apply cleanly to 1.6.
comment:8 by , 10 years ago
Cc: | added |
---|
FYI, I had submitted a ticket asking for a way to customize the quoting behavior. In my case, I wanted to quote /
.
https://code.djangoproject.com/ticket/22589
Given, the different needs, maybe the list of safe/unsafe characters could be supplied in urls.py, per-argument per-url? Just a quick idea:
url(r'^something/(?P<title>[^/]*)/something/else$', views.something, name='something', safechars={ 'title': ':@=$-_.+!*'(),' }),
comment:9 by , 10 years ago
Cc: | added |
---|
@denilsonsa Something like that is certainly not going to happen.
comment:10 by , 10 years ago
Cc: | added |
---|
I've reviewed the PR, and added a comment on github.
It seems that 2396 is obsolete by RFC 3986 (Jan. 2005) [1]:
In particular
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "/" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
[...]
2.3. Unreserved Characters
Characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include uppercase and lowercase
letters, decimal digits, hyphen, period, underscore, and tilde.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
They also have an appendix with a compiled information about this http://tools.ietf.org/html/rfc3986#appendix-A
I'm no expert in this, just bringing your attention to it.
comment:11 by , 10 years ago
Patch needs improvement: | unset |
---|
I think that we should indeed follow RFC 3986. See also http://bugs.python.org/issue16285.
I've pushed a patch with my proposal, based on Erik's patch.
https://github.com/django/django/pull/2859
comment:12 by , 10 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
comment:13 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Pull request: https://github.com/django/django/pull/2408