Opened 16 years ago
Closed 13 years ago
#9655 closed Bug (fixed)
urlize should not quote already quoted urls.
Reported by: | flytwokites | Owned by: | Aymeric Augustin |
---|---|---|---|
Component: | Template system | Version: | 1.0 |
Severity: | Normal | Keywords: | urlize |
Cc: | towjzhou@…, cmlenz@…, lamotte85@… | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | yes |
Needs tests: | yes | Patch needs improvement: | no |
Easy pickings: | yes | UI/UX: | no |
Description
This template:
{{ user.homepage:urlize }}
When user.homepage is "http://hi.baidu.com/%D6%D8%D0%C2%BF%AA%CA%BC",
the output will be "http://hi.baidu.com/%25D6%25D8%25D0%25C2%25BF%25AA%25CA%25BC",
It quoted '%' char in source url and outputs an invalid url.
Attachments (2)
Change History (22)
comment:1 by , 16 years ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
comment:2 by , 16 years ago
Resolution: | invalid |
---|---|
Status: | closed → reopened |
With all due respect to Karen's logic, I'm going to reopen this. I can't really reconcile the current behaviour as logical. We already treat a few particular characters as "safe" in URLs -- those marked specifically with "http://" or "https://" prefixes. It's not unreasonable to say that those particular strings (with those prefixes) are already correctly written and all we are doing is converting them to links. It looks like just an oversight that we don't have "%" in that special set.
comment:3 by , 16 years ago
Hmm...so you are saying if the input value already has a http[s]: prefix that the value should be treated as if it were already percent-encoded? Thus urlize for something like Big savings available when you visit www.mystore.com/30%OffCoupons!
would percent-encode the percent, but Big savings available when you visit http://www.mystore.com/30%OffCoupons!
would not?
comment:4 by , 16 years ago
Triage Stage: | Unreviewed → Design decision needed |
---|
I'm not sure what to do in your first case. Maybe it should be treated the same way, too. Needs thought. This is certainly a bit of an edge-case, but I can see the use-case where a URL is cut-and-pasted into an email or other text and that text is run through the urlize filter before being displayed. It's more likely than not that it will already be encoded at that point.
A URL of "www.mystore.com/30%OffCoupons" where the "%" is meant to be treated as unencoded is actually invalid (the URL spec includes the encoding requirement). So we need to have a guess at where the balance of use-cases goes: are people going to be using "real" URLs or things they hashed together themselves? We'll need some coins and a bit of time to toss them (although obviously I'm leaning slightly towards favouring the cut-and-paste scenario at the moment).
I just don't see this as an immediate "definitely don't change" case, which is my logic for reopening. We might ultimately decide it's not worth changing, but we at least should work out how people are meant to operate with such URLs. I will move it to "design decision needed", though, which I should have done yesterday, so we remember to actually make a decision if it slips under the radar.
comment:5 by , 16 years ago
A case for changing the current behavior and treating specially aready quoted urls: currently if urlize is applied to a text, it is impossible to write correct links to some Google Chart API charts. For example, http://chart.apis.google.com/chart?cht=p&chtt=Percentage%20of%20Google%20Chart%20Which%20Resembles%20Pac-man&chs=550x250&chd=t:10,80,10&chco=FAFAFA,FFFF00,FAFAFA&chl=Does%20not%20resemble%20Pac-man|Resembles%20Pac-man will be always linked as http://chart.apis.google.com/chart?cht=p&chtt=Percentage%2520of%2520Google%2520Chart%2520Which%2520Resembles%2520Pac-man&chs=550x250&chd=t:10%2C80%2C10&chco=FAFAFA%2CFFFF00%2CFAFAFA&chl=Does%2520not%2520resemble%2520Pac-man%7CResembles%2520Pac-man
comment:6 by , 16 years ago
Cc: | added |
---|
I just came across this bug in a Django app of ours. Users copy and paste URLs from the location bar or from elsewhere into a text field, and when those URLs are already encoded (which is very likely as far as I can tell), urlize
is often going to break the link (bad). I've had to copy and modify the whole urlize
function and filter to get rid of the quoting.
Why not just leave the URL alone, and HTML-escape it (both in the href
attribute and the link text) to fend off potential XSS problems. If the original link did indeed require encoding, the user agent will take care of that, and the link will at least work. The author can still go in and paste a properly encoded link if she's a nitpicking-validator-using power user.
At the very least, it should be possible to turn off the quoting (another keyword arg on django.utils.html.urlize
). But I still think the current behavior does not make sense and should be changed.
comment:7 by , 16 years ago
What I really want is to have this in my template array :
http://www.wowarmory.com/character-achievements.xml?r=Mal%27Ganis&n=Vosk
And then have this rendered by urlize :
<a href="http://www.wowarmory.com/character-sheet.xml?r=Mal%27Ganis&cn=Vosk" rel="nofollow">http://www.wowarmory.com/character-sheet.xml?r=Mal'Ganis&cn=Vosk</a>
Are there currently any options to urlize to make this use case possible? Or at least a custom filter that I can use?
comment:8 by , 16 years ago
I think something like this custom filter does what I want, but I would prefer if urlize did this by default (because then I get mailto: linking and url detection for free) :
from django import template
from django.utils.safestring import mark_safe
import urllib
register = template.Library()
@register.filter(name="myurlize")
def myurlize(url) :
r = '<a href="%s">%s</a>' % (url, urllib.unquote(url))
return mark_safe(r)
comment:9 by , 16 years ago
bad formatting, sorry
from django import template from django.utils.safestring import mark_safe import urllib register = template.Library() @register.filter(name="myurlize") def myurlize(url) : r = '<a href="%s">%s</a>' % (url, urllib.unquote(url)) return mark_safe(r)
comment:10 by , 15 years ago
I agree with the people who want this changed. Beeing able to copy-paste an url containing % from the address bar is more important then beeing able to manually write an url containing % which is an uncommon use case.
The fix seems to be replacing safe='/&=:;#?+*'
with safe='/%&=:;#?+*'
in django.utils.html.py
.
comment:11 by , 15 years ago
Please, think in the way of most common usecase instead of "theoretical correctness" or what's your point(?).
Most common usecase is that user copy-paste address from his browser and these links needs to be translated.
- in urlize, this case doesn't work at all
- there is absolutely no way how the current urlize can work with a space in link (common in linked google documents) (plain text space breaks detection, and encoded space %20 is replaced into %2520).
- there is no way why to use this function if it doesn't work:(
What is usecase you have built urlize for?
Does anybody do anything else than copypasting links?
Thank you for rethinking this.
comment:12 by , 14 years ago
Severity: | → Normal |
---|---|
Type: | → Bug |
comment:13 by , 14 years ago
Easy pickings: | unset |
---|
#16020 is a dupe and has a more recent discussion and patch.
comment:14 by , 13 years ago
Cc: | added |
---|---|
UI/UX: | unset |
Seems like this has already been re-considered, but I'd like to add my $0.02. I also came across this problem where someone copy pasted a URL and %20 was encoded %2520 which makes the URL unusable. My argument against re-encoding is simple:
- By changing the behavior, the use case: http://www.mystore.com/30%OffCoupons! can be worked around by encoding the URL by hand (stick it in your address bar, then copy/paste it)
- The current behavior cannot be worked around, it renders URLs entirely useless
I feel that while the other comments lead someone to believe that copy/paste is the majority case as an argument (as it may well be), that more importantly, the current behavior cannot be worked around without code change. Given that, will the current behavior be changed?
comment:15 by , 13 years ago
Owner: | changed from | to
---|---|
Status: | reopened → new |
Triage Stage: | Design decision needed → Accepted |
We have the following options here:
- keep the current behavior: escape everything
- keep URLs as is: escape nothing
- be smart: try to detect if the URL is already URL-encoded, and if it is, don't re-encode it.
Currently, URL escaping is performed with urlquote(url, safe='/&=:;#?+*')
(where urlquote
from django.utils.http
is basically a lazy + smart version of urllib.quote
).
Instead, we could test if the URL contains an URL escape (%xx with xx in [0-9A-F]); if it does, assume it's already escaped; otherwise, escape with urllib.quote
.
Obviously, the problem is ill-defined. We have no way of knowing if the input is or isn't URL-encoded. A quick test shows that browsers do something akin to the above. If I type google.com/search?q=foo bar
in the URL bar, it gets encoded to google.com/search?q=foo%20bar
. If I type google.com/search?q=foo%20bar
, it doesn't get encoded.
So, even though being smart often isn't a good idea, in this case I believe it's the best solution.
PS: I've researched the topic of URL escaping quite thoroughly a few years ago and here's what I had concluded:
Based on RFC 1738: [ <>"#%{}|\^~[]`] are unsafe and must be encoded [;/?:@=&] are reserved and must be encoded unless they are used for their special meaning we assume that they are encoded in input if necessary [$-_.+!*'(),] are safe and need not be encoded
In comparison, Django considers /&=:;#?+*
as safe characters — in addition to _.-
which Python always considers safe. I think the list of safe characters should be extended to ;/?:@=&$-_.+!*'(),
while we're working on this.
comment:16 by , 13 years ago
comment:17 by , 13 years ago
The latest reference is http://tools.ietf.org/html/rfc3986#section-2.1
by , 13 years ago
Attachment: | 9655.patch added |
---|
comment:19 by , 13 years ago
Easy pickings: | set |
---|---|
Has patch: | set |
Needs documentation: | set |
Needs tests: | set |
urlize
works with plain text:http://docs.djangoproject.com/en/dev/ref/templates/builtins/#urlize
You are not passing plain text here, you are passing something that has already been percent-encoded. There's no way for
urlize
to know that, so it percent-encodes the link value, which it must do in order to generate correct urls for data that has not already been percent-encoded. If you want to useurlize
, you need to pass it a not-percent-quoted, plain text value to work with.