Opened 11 years ago

Closed 10 years ago

Last modified 10 years ago

#22267 closed Bug (fixed)

django.utils.html.smart_urlquote() is incorrectly unquoting the url

Reported by: meenzam@… Owned by: Md. Enzam Hossain
Component: Utilities Version: dev
Severity: Normal Keywords:
Cc: meenzam@…, Kevin Brown Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

When we give django.utils.html.smart_urlquote() an url with an already embedded query as a query paramater, it is unquoting it which it should not do.

For example,
smart_urlquote('http://example.com/?q=http%3A%2F%2Fexample.com%2F%3Fx%3D1%26q%3Ddjango')
returns
'http://example.com/?q=http://example.com/?x=1&q=django'
which has the query parameters {'q': ['http://example.com/?x=1', 'django']}

where it should have returned the exact url back
'http://example.com/?q=http%3A%2F%2Fexample.com%2F%3Fx%3D1%26q%3Ddjango'
which has the query parameters {'q': ['http://example.com/?x=1&q=django']}

I have written a test here - https://github.com/ienzam/django/blob/master/tests/utils_tests/test_html.py#L203

        # Ensure that embedded quoted url does not get unquoted
        self.assertEqual(
            quote('http://example.com/?q=http%3A%2F%2Fexample.com%2F%3Fx%3D1%26q%3Ddjango'),
            'http://example.com/?q=http%3A%2F%2Fexample.com%2F%3Fx%3D1%26q%3Ddjango')

Change History (30)

comment:1 by Sasha Romijn, 11 years ago

Triage Stage: UnreviewedAccepted
Version: 1.5master

It would seem the purpose of its current implementation is that a URL which is already quoted is not quoted again, but an unquoted URL will be quoted. The context in which this is applied, is django.utils.html.urlize. This was added in #9655 (https://github.com/django/django/commit/e3a7bfccbb83712caf0645e4e33f5c03d9dc462b), although at that time with a little more code than it's current version.

The characters you are expecting to be escaped are currently marked as safe. Otherwise, even http://example.com?q= would already have the URL encoding applied to it. At first sight, a solution seems to be to make smart_urlquote even smarter, and only consider the currently listed characters safe if they are not within a URL parameter themselves. I agree that this is a bug after all.

comment:3 by meenzam@…, 11 years ago

Cc: meenzam@… added
Has patch: set

The simple fix will be to unquote and quote only the path of the url, I have

https://github.com/ienzam/django/commit/692ac89689dd9ddf5895bfaf0181434c5daf81e9

This passes the current unit tests.

comment:4 by Sasha Romijn, 11 years ago

Patch needs improvement: set

In that patch however, the conditions are altered: quoting is not performed if any exception is raised and caught. This does not appear to be part of the tests, so they still all succeed.

comment:5 by meenzam@…, 11 years ago

Has patch: unset

Hmm, yes the commit is wrong, urlize() will not always escape.

Can you provide the intended behaviour of urlize()? It will then be easier to fix when we have a set goal. Currently it is not totally clear what is the correct behaviour of urlize() and smart_urlquote().

comment:6 by Sasha Romijn, 11 years ago

The formal description is whatever is documented. However, developers may depend on other details of its behaviour, which we should not change if it is not necessary.

comment:7 by Sasha Romijn, 11 years ago

By which I don't mean that this particular issue should not be fixed anymore, in case I was ambiguous - it is definitely a bug.

comment:8 by meenzam@…, 11 years ago

Then I will suggest to return None from smart_urlquote() in case of any exception, then urlize() will not think the string as an url and will show text only.
Also not only path, but also query parameter and fragment should be quoted (because they may contain ").

For query string, we may first parse_qs() and then individually encode each key value pairs and remake the query string.
For fragment, We can do the same as path (unquoting and quoting again). Not perfect solution though because some sites store query parameters in fragments.

comment:10 by Sasha Romijn, 11 years ago

Component: Template systemUtilities

That commit is built upon your previous patch, so it's very hard for me to review. At first sight, I don't think that inline function is really necessary, given how simple it is. But the rest I can't really review in this way. Also, it would be good to have some tests for how urlize handles the invalid URLs for which you now want to return None, as that behaviour should not change.

comment:11 by Sasha Romijn, 11 years ago

You might want to have a look at #22223 and http://bugs.python.org/issue2637 regarding which characters should be safe for which segments of the URL.

comment:12 by Md. Enzam Hossain, 11 years ago

Read the bug and ticket.
From RFC 2396 (http://www.ietf.org/rfc/rfc2396.txt), valid characters for different parts of the url are following:
path => Alphanumeric and /;-_.!~*'():@&=+$,
query and fragment => Alphanumeric and /;-_.!~*'():@&=+$,? (only differs from path with an extra ?)

Can we not unquote and quote again because it can introduce some corner cases.
We can check if the url is already quoted or not (which was done till v1.4).

comment:13 by Md. Enzam Hossain, 11 years ago

Has patch: set
Needs tests: set
Owner: changed from nobody to Md. Enzam Hossain
Status: newassigned

Can you look at my patch? Merged all the changes into this one - https://github.com/ienzam/django/commit/be378e404bedd169bb48ab945e5fd90e90a00d1e

comment:14 by meenzam@…, 11 years ago

Any update?

comment:15 by Ben Davis, 11 years ago

I'm just curious, is there a precedent for returning None when there's an error? Normally if there's an exception, it should bubble up, or a more specific one should be raised.

comment:16 by Md. Enzam Hossain, 10 years ago

If None is returned then
https://github.com/ienzam/django/blob/be378e404bedd169bb48ab945e5fd90e90a00d1e/django/utils/html.py#L274
"if url:" will be false and the invalid url will be treated as text.

comment:17 by Claude Paroz, 10 years ago

Needs tests: unset
Patch needs improvement: unset

I have chosen a slightly different approach, keeping the unquote/quote process but on the separate parts as ienzam proposed.
https://github.com/django/django/pull/2902

comment:18 by Tim Graham, 10 years ago

Patch needs improvement: set

PR has some test failures.

comment:19 by Claude Paroz, 10 years ago

Patch needs improvement: unset

comment:20 by Tim Graham, 10 years ago

Triage Stage: AcceptedReady for checkin

@meenzam, perhaps you'd like to review this as well before we merge it.

comment:21 by Md. Enzam Hossain, 10 years ago

comment:22 by Md. Enzam Hossain, 10 years ago

Thank you for doing this. This patch fixes all of the problems I was facing.

One concern is though about the url: http://example.com/%C3%B6/ä/
The last part of the url ä implies that this url is not quoted so the full url should be quoted - http://example.com/%25C3%25B6/%C3%A4/
Currently it becomes: http://example.com/%C3%B6/%C3%A4/
But maybe this is not a real life scenario (hopefully).

comment:23 by Claude Paroz, 10 years ago

Interesting case, but I think this should be a separate issue. This ticket is fixing the query string problem.

Your example might be solved by short-circuiting the unquote part if the URL contains any non-safe character.

comment:24 by Kevin Brown, 10 years ago

Cc: Kevin Brown added

comment:25 by Claude Paroz <claude@…>, 10 years ago

Resolution: fixed
Status: assignedclosed

In 4b8a1d2c0d1a8c5107f3aef01597db78d2a2a5ce:

Fixed #22267 -- Fixed unquote/quote in smart_urlquote

Thanks Md. Enzam Hossain for the report and initial patch, and
Tim Graham for the review.

comment:26 by Claude Paroz <claude@…>, 10 years ago

In b9d9287f59eb5c33dd8bc81179b4cf197fd54456:

Fixed urlize after smart_urlquote rewrite

Refs #22267.

comment:27 by Claude Paroz <claude@…>, 10 years ago

In ec808e807ad711b993f199f6b4165ac6d0e1125b:

Fixed urlize regression with entities in query strings

Refs #22267.
Thanks Shai Berger for spotting the issue and Tim Graham for the
initial patch.

comment:28 by Claude Paroz <claude@…>, 10 years ago

In ac07890f959c467b3fc9c6dd6d36aafc2eff1fcc:

[1.8.x] Fixed urlize regression with entities in query strings

Refs #22267.
Thanks Shai Berger for spotting the issue and Tim Graham for the
initial patch.
Backport of ec808e807 from master.

comment:29 by Tim Graham <timograham@…>, 10 years ago

In 7b1a67cce52e5c191fbfa1bca501c6f0222db019:

Fixed escaping regression in urlize filter.

Now that the URL is always unescaped as of refs #22267,
we should re-escape it before inserting it into the anchor.

comment:30 by Tim Graham <timograham@…>, 10 years ago

In aba74d6f1e221c10be2798f55971d710521cb404:

[1.8.x] Fixed escaping regression in urlize filter.

Now that the URL is always unescaped as of refs #22267,
we should re-escape it before inserting it into the anchor.

Backport of 7b1a67cce52e5c191fbfa1bca501c6f0222db019 from master

Note: See TracTickets for help on using tickets.
Back to Top