Code

Opened 6 years ago

Closed 3 years ago

#9655 closed Bug (fixed)

urlize should not quote already quoted urls.

Reported by: flytwokites Owned by: aaugustin
Component: Template system Version: 1.0
Severity: Normal Keywords: urlize
Cc: towjzhou@…, cmlenz@…, lamotte85@… Triage Stage: Accepted
Has patch: yes Needs documentation: yes
Needs tests: yes Patch needs improvement: no
Easy pickings: yes UI/UX: no

Description

This template:

{{ user.homepage:urlize }}

When user.homepage is "http://hi.baidu.com/%D6%D8%D0%C2%BF%AA%CA%BC",
the output will be "http://hi.baidu.com/%25D6%25D8%25D0%25C2%25BF%25AA%25CA%25BC",
It quoted '%' char in source url and outputs an invalid url.

Attachments (2)

9655.patch (2.1 KB) - added by aaugustin 3 years ago.
9655-2.diff (3.2 KB) - added by claudep 3 years ago.
Same patch with test

Download all attachments as: .zip

Change History (22)

comment:1 Changed 6 years ago by kmtracey

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Resolution set to invalid
  • Status changed from new to closed

urlize works with plain text:

http://docs.djangoproject.com/en/dev/ref/templates/builtins/#urlize

You are not passing plain text here, you are passing something that has already been percent-encoded. There's no way for urlize to know that, so it percent-encodes the link value, which it must do in order to generate correct urls for data that has not already been percent-encoded. If you want to use urlize, you need to pass it a not-percent-quoted, plain text value to work with.

comment:2 Changed 6 years ago by mtredinnick

  • Resolution invalid deleted
  • Status changed from closed to reopened

With all due respect to Karen's logic, I'm going to reopen this. I can't really reconcile the current behaviour as logical. We already treat a few particular characters as "safe" in URLs -- those marked specifically with "http://" or "https://" prefixes. It's not unreasonable to say that those particular strings (with those prefixes) are already correctly written and all we are doing is converting them to links. It looks like just an oversight that we don't have "%" in that special set.

comment:3 Changed 6 years ago by kmtracey

Hmm...so you are saying if the input value already has a http[s]: prefix that the value should be treated as if it were already percent-encoded? Thus urlize for something like Big savings available when you visit www.mystore.com/30%OffCoupons! would percent-encode the percent, but Big savings available when you visit http://www.mystore.com/30%OffCoupons! would not?

comment:4 Changed 6 years ago by mtredinnick

  • Triage Stage changed from Unreviewed to Design decision needed

I'm not sure what to do in your first case. Maybe it should be treated the same way, too. Needs thought. This is certainly a bit of an edge-case, but I can see the use-case where a URL is cut-and-pasted into an email or other text and that text is run through the urlize filter before being displayed. It's more likely than not that it will already be encoded at that point.

A URL of "www.mystore.com/30%OffCoupons" where the "%" is meant to be treated as unencoded is actually invalid (the URL spec includes the encoding requirement). So we need to have a guess at where the balance of use-cases goes: are people going to be using "real" URLs or things they hashed together themselves? We'll need some coins and a bit of time to toss them (although obviously I'm leaning slightly towards favouring the cut-and-paste scenario at the moment).

I just don't see this as an immediate "definitely don't change" case, which is my logic for reopening. We might ultimately decide it's not worth changing, but we at least should work out how people are meant to operate with such URLs. I will move it to "design decision needed", though, which I should have done yesterday, so we remember to actually make a decision if it slips under the radar.

comment:6 Changed 5 years ago by Christopher Lenz <cmlenz@…>

  • Cc cmlenz@… added

I just came across this bug in a Django app of ours. Users copy and paste URLs from the location bar or from elsewhere into a text field, and when those URLs are already encoded (which is very likely as far as I can tell), urlize is often going to break the link (bad). I've had to copy and modify the whole urlize function and filter to get rid of the quoting.

Why not just leave the URL alone, and HTML-escape it (both in the href attribute and the link text) to fend off potential XSS problems. If the original link did indeed require encoding, the user agent will take care of that, and the link will at least work. The author can still go in and paste a properly encoded link if she's a nitpicking-validator-using power user.

At the very least, it should be possible to turn off the quoting (another keyword arg on django.utils.html.urlize). But I still think the current behavior does not make sense and should be changed.

comment:7 Changed 5 years ago by ptarjan

What I really want is to have this in my template array :

http://www.wowarmory.com/character-achievements.xml?r=Mal%27Ganis&n=Vosk

And then have this rendered by urlize :

<a href="http://www.wowarmory.com/character-sheet.xml?r=Mal%27Ganis&amp;cn=Vosk" rel="nofollow">http://www.wowarmory.com/character-sheet.xml?r=Mal&#39;Ganis&amp;cn=Vosk</a>

Are there currently any options to urlize to make this use case possible? Or at least a custom filter that I can use?

comment:8 Changed 5 years ago by ptarjan

I think something like this custom filter does what I want, but I would prefer if urlize did this by default (because then I get mailto: linking and url detection for free) :

from django import template
from django.utils.safestring import mark_safe
import urllib
register = template.Library()

@register.filter(name="myurlize")
def myurlize(url) :

r = '<a href="%s">%s</a>' % (url, urllib.unquote(url))
return mark_safe(r)

comment:9 Changed 5 years ago by ptarjan

bad formatting, sorry

from django import template
from django.utils.safestring import mark_safe
import urllib
register = template.Library()

@register.filter(name="myurlize")
def myurlize(url) :
    r = '<a href="%s">%s</a>' % (url, urllib.unquote(url))
    return mark_safe(r)                       

comment:10 Changed 5 years ago by bugmenot

I agree with the people who want this changed. Beeing able to copy-paste an url containing % from the address bar is more important then beeing able to manually write an url containing % which is an uncommon use case.

The fix seems to be replacing safe='/&=:;#?+*' with safe='/%&=:;#?+*' in django.utils.html.py.

comment:11 Changed 4 years ago by myneur

Please, think in the way of most common usecase instead of "theoretical correctness" or what's your point(?).

Most common usecase is that user copy-paste address from his browser and these links needs to be translated.

  • in urlize, this case doesn't work at all


  • there is absolutely no way how the current urlize can work with a space in link (common in linked google documents) (plain text space breaks detection, and encoded space %20 is replaced into %2520).


  • there is no way why to use this function if it doesn't work:(

What is usecase you have built urlize for?
Does anybody do anything else than copypasting links?

Thank you for rethinking this.

comment:12 Changed 3 years ago by lukeplant

  • Severity set to Normal
  • Type set to Bug

comment:13 Changed 3 years ago by julien

  • Easy pickings unset

#16020 is a dupe and has a more recent discussion and patch.

comment:14 Changed 3 years ago by Dan LaMotte <lamotte85@…>

  • Cc lamotte85@… added
  • UI/UX unset

Seems like this has already been re-considered, but I'd like to add my $0.02. I also came across this problem where someone copy pasted a URL and %20 was encoded %2520 which makes the URL unusable. My argument against re-encoding is simple:

  • By changing the behavior, the use case: http://www.mystore.com/30%OffCoupons! can be worked around by encoding the URL by hand (stick it in your address bar, then copy/paste it)
  • The current behavior cannot be worked around, it renders URLs entirely useless

I feel that while the other comments lead someone to believe that copy/paste is the majority case as an argument (as it may well be), that more importantly, the current behavior cannot be worked around without code change. Given that, will the current behavior be changed?

comment:15 Changed 3 years ago by aaugustin

  • Owner changed from nobody to aaugustin
  • Status changed from reopened to new
  • Triage Stage changed from Design decision needed to Accepted

We have the following options here:

  • keep the current behavior: escape everything
  • keep URLs as is: escape nothing
  • be smart: try to detect if the URL is already URL-encoded, and if it is, don't re-encode it.

Currently, URL escaping is performed with urlquote(url, safe='/&=:;#?+*') (where urlquote from django.utils.http is basically a lazy + smart version of urllib.quote).

Instead, we could test if the URL contains an URL escape (%xx with xx in [0-9A-F]); if it does, assume it's already escaped; otherwise, escape with urllib.quote.

Obviously, the problem is ill-defined. We have no way of knowing if the input is or isn't URL-encoded. A quick test shows that browsers do something akin to the above. If I type google.com/search?q=foo bar in the URL bar, it gets encoded to google.com/search?q=foo%20bar. If I type google.com/search?q=foo%20bar, it doesn't get encoded.

So, even though being smart often isn't a good idea, in this case I believe it's the best solution.


PS: I've researched the topic of URL escaping quite thoroughly a few years ago and here's what I had concluded:

Based on RFC 1738:
   [ <>"#%{}|\^~[]`] are unsafe and must be encoded
   [;/?:@=&]         are reserved and must be encoded
                     unless they are used for their special meaning
                     we assume that they are encoded in input if necessary
   [$-_.+!*'(),]     are safe and need not be encoded

In comparison, Django considers /&=:;#?+* as safe characters — in addition to _.- which Python always considers safe. I think the list of safe characters should be extended to ;/?:@=&$-_.+!*'(), while we're working on this.

comment:16 Changed 3 years ago by aaugustin

For the record, the escaping was introduced at r7079 to fix #6279 and #6514. However, those tickets were about HTML-escaping (<>&'"), not URL-encoding.

comment:17 Changed 3 years ago by aaugustin

Changed 3 years ago by aaugustin

comment:19 Changed 3 years ago by aaugustin

  • Easy pickings set
  • Has patch set
  • Needs documentation set
  • Needs tests set

Changed 3 years ago by claudep

Same patch with test

comment:20 Changed 3 years ago by aaugustin

  • Resolution set to fixed
  • Status changed from new to closed

In [17347]:

Fixed #9655 -- Prevented the urlize template filter from double-quoting URLs. Thanks Claude Paroz for writing the tests.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.