#22267 closed Bug (fixed)
django.utils.html.smart_urlquote() is incorrectly unquoting the url
Reported by: | Owned by: | Md. Enzam Hossain | |
---|---|---|---|
Component: | Utilities | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | meenzam@…, Kevin Brown | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
When we give django.utils.html.smart_urlquote() an url with an already embedded query as a query paramater, it is unquoting it which it should not do.
For example,
smart_urlquote('http://example.com/?q=http%3A%2F%2Fexample.com%2F%3Fx%3D1%26q%3Ddjango')
returns
'http://example.com/?q=http://example.com/?x=1&q=django'
which has the query parameters {'q': ['http://example.com/?x=1', 'django']}
where it should have returned the exact url back
'http://example.com/?q=http%3A%2F%2Fexample.com%2F%3Fx%3D1%26q%3Ddjango'
which has the query parameters {'q': ['http://example.com/?x=1&q=django']}
I have written a test here - https://github.com/ienzam/django/blob/master/tests/utils_tests/test_html.py#L203
# Ensure that embedded quoted url does not get unquoted self.assertEqual( quote('http://example.com/?q=http%3A%2F%2Fexample.com%2F%3Fx%3D1%26q%3Ddjango'), 'http://example.com/?q=http%3A%2F%2Fexample.com%2F%3Fx%3D1%26q%3Ddjango')
Change History (30)
comment:1 by , 11 years ago
Triage Stage: | Unreviewed → Accepted |
---|---|
Version: | 1.5 → master |
comment:2 by , 11 years ago
This is the commit which introduced unquote https://github.com/django/django/commit/b70c371fc1f18ea0c43b503122df3f311afc7105
comment:3 by , 11 years ago
Cc: | added |
---|---|
Has patch: | set |
The simple fix will be to unquote and quote only the path of the url, I have
https://github.com/ienzam/django/commit/692ac89689dd9ddf5895bfaf0181434c5daf81e9
This passes the current unit tests.
comment:4 by , 11 years ago
Patch needs improvement: | set |
---|
In that patch however, the conditions are altered: quoting is not performed if any exception is raised and caught. This does not appear to be part of the tests, so they still all succeed.
comment:5 by , 11 years ago
Has patch: | unset |
---|
Hmm, yes the commit is wrong, urlize() will not always escape.
Can you provide the intended behaviour of urlize()? It will then be easier to fix when we have a set goal. Currently it is not totally clear what is the correct behaviour of urlize() and smart_urlquote().
comment:6 by , 11 years ago
The formal description is whatever is documented. However, developers may depend on other details of its behaviour, which we should not change if it is not necessary.
comment:7 by , 11 years ago
By which I don't mean that this particular issue should not be fixed anymore, in case I was ambiguous - it is definitely a bug.
comment:8 by , 11 years ago
Then I will suggest to return None from smart_urlquote()
in case of any exception, then urlize()
will not think the string as an url and will show text only.
Also not only path, but also query parameter and fragment should be quoted (because they may contain ").
For query string, we may first parse_qs()
and then individually encode each key value pairs and remake the query string.
For fragment, We can do the same as path (unquoting and quoting again). Not perfect solution though because some sites store query parameters in fragments.
comment:9 by , 11 years ago
comment:10 by , 11 years ago
Component: | Template system → Utilities |
---|
That commit is built upon your previous patch, so it's very hard for me to review. At first sight, I don't think that inline function is really necessary, given how simple it is. But the rest I can't really review in this way. Also, it would be good to have some tests for how urlize handles the invalid URLs for which you now want to return None, as that behaviour should not change.
comment:11 by , 11 years ago
You might want to have a look at #22223 and http://bugs.python.org/issue2637 regarding which characters should be safe for which segments of the URL.
comment:12 by , 11 years ago
Read the bug and ticket.
From RFC 2396 (http://www.ietf.org/rfc/rfc2396.txt), valid characters for different parts of the url are following:
path => Alphanumeric and /;-_.!~*'():@&=+$,
query and fragment => Alphanumeric and /;-_.!~*'():@&=+$,?
(only differs from path with an extra ?
)
Can we not unquote and quote again because it can introduce some corner cases.
We can check if the url is already quoted or not (which was done till v1.4).
comment:13 by , 11 years ago
Has patch: | set |
---|---|
Needs tests: | set |
Owner: | changed from | to
Status: | new → assigned |
Can you look at my patch? Merged all the changes into this one - https://github.com/ienzam/django/commit/be378e404bedd169bb48ab945e5fd90e90a00d1e
comment:15 by , 11 years ago
I'm just curious, is there a precedent for returning None when there's an error? Normally if there's an exception, it should bubble up, or a more specific one should be raised.
comment:16 by , 11 years ago
If None is returned then
https://github.com/ienzam/django/blob/be378e404bedd169bb48ab945e5fd90e90a00d1e/django/utils/html.py#L274
"if url:" will be false and the invalid url will be treated as text.
comment:17 by , 11 years ago
Needs tests: | unset |
---|---|
Patch needs improvement: | unset |
I have chosen a slightly different approach, keeping the unquote/quote process but on the separate parts as ienzam proposed.
https://github.com/django/django/pull/2902
comment:19 by , 10 years ago
Patch needs improvement: | unset |
---|
comment:20 by , 10 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
@meenzam, perhaps you'd like to review this as well before we merge it.
comment:21 by , 10 years ago
I will test the with the tests from here tomorrow - https://github.com/ienzam/django/commit/be378e404bedd169bb48ab945e5fd90e90a00d1e
comment:22 by , 10 years ago
Thank you for doing this. This patch fixes all of the problems I was facing.
One concern is though about the url: http://example.com/%C3%B6/ä/
The last part of the url ä
implies that this url is not quoted so the full url should be quoted - http://example.com/%25C3%25B6/%C3%A4/
Currently it becomes: http://example.com/%C3%B6/%C3%A4/
But maybe this is not a real life scenario (hopefully).
comment:23 by , 10 years ago
Interesting case, but I think this should be a separate issue. This ticket is fixing the query string problem.
Your example might be solved by short-circuiting the unquote
part if the URL contains any non-safe character.
comment:24 by , 10 years ago
Cc: | added |
---|
comment:25 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
It would seem the purpose of its current implementation is that a URL which is already quoted is not quoted again, but an unquoted URL will be quoted. The context in which this is applied, is django.utils.html.urlize. This was added in #9655 (https://github.com/django/django/commit/e3a7bfccbb83712caf0645e4e33f5c03d9dc462b), although at that time with a little more code than it's current version.
The characters you are expecting to be escaped are currently marked as safe. Otherwise, even
http://example.com?q=
would already have the URL encoding applied to it. At first sight, a solution seems to be to makesmart_urlquote
even smarter, and only consider the currently listed characters safe if they are not within a URL parameter themselves. I agree that this is a bug after all.