Opened 15 months ago

Closed 9 months ago

Last modified 2 months ago

#22267 closed Bug (fixed)

django.utils.html.smart_urlquote() is incorrectly unquoting the url

Reported by: meenzam@… Owned by: ienzam
Component: Utilities Version: master
Severity: Normal Keywords:
Cc: meenzam@…, kevin-brown Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

When we give django.utils.html.smart_urlquote() an url with an already embedded query as a query paramater, it is unquoting it which it should not do.

For example,
smart_urlquote('http://example.com/?q=http%3A%2F%2Fexample.com%2F%3Fx%3D1%26q%3Ddjango')
returns
'http://example.com/?q=http://example.com/?x=1&q=django'
which has the query parameters {'q': ['http://example.com/?x=1', 'django']}

where it should have returned the exact url back
'http://example.com/?q=http%3A%2F%2Fexample.com%2F%3Fx%3D1%26q%3Ddjango'
which has the query parameters {'q': ['http://example.com/?x=1&q=django']}

I have written a test here - https://github.com/ienzam/django/blob/master/tests/utils_tests/test_html.py#L203

        # Ensure that embedded quoted url does not get unquoted
        self.assertEqual(
            quote('http://example.com/?q=http%3A%2F%2Fexample.com%2F%3Fx%3D1%26q%3Ddjango'),
            'http://example.com/?q=http%3A%2F%2Fexample.com%2F%3Fx%3D1%26q%3Ddjango')

Change History (30)

comment:1 Changed 15 months ago by erikr

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Accepted
  • Version changed from 1.5 to master

It would seem the purpose of its current implementation is that a URL which is already quoted is not quoted again, but an unquoted URL will be quoted. The context in which this is applied, is django.utils.html.urlize. This was added in #9655 (https://github.com/django/django/commit/e3a7bfccbb83712caf0645e4e33f5c03d9dc462b), although at that time with a little more code than it's current version.

The characters you are expecting to be escaped are currently marked as safe. Otherwise, even http://example.com?q= would already have the URL encoding applied to it. At first sight, a solution seems to be to make smart_urlquote even smarter, and only consider the currently listed characters safe if they are not within a URL parameter themselves. I agree that this is a bug after all.

comment:3 Changed 15 months ago by meenzam@…

  • Cc meenzam@… added
  • Has patch set

The simple fix will be to unquote and quote only the path of the url, I have

https://github.com/ienzam/django/commit/692ac89689dd9ddf5895bfaf0181434c5daf81e9

This passes the current unit tests.

comment:4 Changed 15 months ago by erikr

  • Patch needs improvement set

In that patch however, the conditions are altered: quoting is not performed if any exception is raised and caught. This does not appear to be part of the tests, so they still all succeed.

comment:5 Changed 15 months ago by meenzam@…

  • Has patch unset

Hmm, yes the commit is wrong, urlize() will not always escape.

Can you provide the intended behaviour of urlize()? It will then be easier to fix when we have a set goal. Currently it is not totally clear what is the correct behaviour of urlize() and smart_urlquote().

comment:6 Changed 15 months ago by erikr

The formal description is whatever is documented. However, developers may depend on other details of its behaviour, which we should not change if it is not necessary.

comment:7 Changed 15 months ago by erikr

By which I don't mean that this particular issue should not be fixed anymore, in case I was ambiguous - it is definitely a bug.

comment:8 Changed 15 months ago by meenzam@…

Then I will suggest to return None from smart_urlquote() in case of any exception, then urlize() will not think the string as an url and will show text only.
Also not only path, but also query parameter and fragment should be quoted (because they may contain ").

For query string, we may first parse_qs() and then individually encode each key value pairs and remake the query string.
For fragment, We can do the same as path (unquoting and quoting again). Not perfect solution though because some sites store query parameters in fragments.

comment:10 Changed 14 months ago by erikr

  • Component changed from Template system to Utilities

That commit is built upon your previous patch, so it's very hard for me to review. At first sight, I don't think that inline function is really necessary, given how simple it is. But the rest I can't really review in this way. Also, it would be good to have some tests for how urlize handles the invalid URLs for which you now want to return None, as that behaviour should not change.

comment:11 Changed 14 months ago by erikr

You might want to have a look at #22223 and http://bugs.python.org/issue2637 regarding which characters should be safe for which segments of the URL.

comment:12 Changed 14 months ago by ienzam

Read the bug and ticket.
From RFC 2396 (http://www.ietf.org/rfc/rfc2396.txt), valid characters for different parts of the url are following:
path => Alphanumeric and /;-_.!~*'():@&=+$,
query and fragment => Alphanumeric and /;-_.!~*'():@&=+$,? (only differs from path with an extra ?)

Can we not unquote and quote again because it can introduce some corner cases.
We can check if the url is already quoted or not (which was done till v1.4).

comment:13 Changed 14 months ago by ienzam

  • Has patch set
  • Needs tests set
  • Owner changed from nobody to ienzam
  • Status changed from new to assigned

Can you look at my patch? Merged all the changes into this one - https://github.com/ienzam/django/commit/be378e404bedd169bb48ab945e5fd90e90a00d1e

comment:14 Changed 13 months ago by meenzam@…

Any update?

comment:15 Changed 11 months ago by bendavis78

I'm just curious, is there a precedent for returning None when there's an error? Normally if there's an exception, it should bubble up, or a more specific one should be raised.

comment:16 Changed 11 months ago by ienzam

If None is returned then
https://github.com/ienzam/django/blob/be378e404bedd169bb48ab945e5fd90e90a00d1e/django/utils/html.py#L274
"if url:" will be false and the invalid url will be treated as text.

comment:17 Changed 11 months ago by claudep

  • Needs tests unset
  • Patch needs improvement unset

I have chosen a slightly different approach, keeping the unquote/quote process but on the separate parts as ienzam proposed.
https://github.com/django/django/pull/2902

comment:18 Changed 10 months ago by timo

  • Patch needs improvement set

PR has some test failures.

comment:19 Changed 10 months ago by claudep

  • Patch needs improvement unset

comment:20 Changed 9 months ago by timgraham

  • Triage Stage changed from Accepted to Ready for checkin

@meenzam, perhaps you'd like to review this as well before we merge it.

comment:21 Changed 9 months ago by ienzam

comment:22 Changed 9 months ago by ienzam

Thank you for doing this. This patch fixes all of the problems I was facing.

One concern is though about the url: http://example.com/%C3%B6/ä/
The last part of the url ä implies that this url is not quoted so the full url should be quoted - http://example.com/%25C3%25B6/%C3%A4/
Currently it becomes: http://example.com/%C3%B6/%C3%A4/
But maybe this is not a real life scenario (hopefully).

comment:23 Changed 9 months ago by claudep

Interesting case, but I think this should be a separate issue. This ticket is fixing the query string problem.

Your example might be solved by short-circuiting the unquote part if the URL contains any non-safe character.

comment:24 Changed 9 months ago by kevin-brown

  • Cc kevin-brown added

comment:25 Changed 9 months ago by Claude Paroz <claude@…>

  • Resolution set to fixed
  • Status changed from assigned to closed

In 4b8a1d2c0d1a8c5107f3aef01597db78d2a2a5ce:

Fixed #22267 -- Fixed unquote/quote in smart_urlquote

Thanks Md. Enzam Hossain for the report and initial patch, and
Tim Graham for the review.

comment:26 Changed 9 months ago by Claude Paroz <claude@…>

In b9d9287f59eb5c33dd8bc81179b4cf197fd54456:

Fixed urlize after smart_urlquote rewrite

Refs #22267.

comment:27 Changed 3 months ago by Claude Paroz <claude@…>

In ec808e807ad711b993f199f6b4165ac6d0e1125b:

Fixed urlize regression with entities in query strings

Refs #22267.
Thanks Shai Berger for spotting the issue and Tim Graham for the
initial patch.

comment:28 Changed 3 months ago by Claude Paroz <claude@…>

In ac07890f959c467b3fc9c6dd6d36aafc2eff1fcc:

[1.8.x] Fixed urlize regression with entities in query strings

Refs #22267.
Thanks Shai Berger for spotting the issue and Tim Graham for the
initial patch.
Backport of ec808e807 from master.

comment:29 Changed 2 months ago by Tim Graham <timograham@…>

In 7b1a67cce52e5c191fbfa1bca501c6f0222db019:

Fixed escaping regression in urlize filter.

Now that the URL is always unescaped as of refs #22267,
we should re-escape it before inserting it into the anchor.

comment:30 Changed 2 months ago by Tim Graham <timograham@…>

In aba74d6f1e221c10be2798f55971d710521cb404:

[1.8.x] Fixed escaping regression in urlize filter.

Now that the URL is always unescaped as of refs #22267,
we should re-escape it before inserting it into the anchor.

Backport of 7b1a67cce52e5c191fbfa1bca501c6f0222db019 from master

Note: See TracTickets for help on using tickets.
Back to Top