Opened 7 years ago

Closed 7 years ago

Last modified 3 years ago

#7355 closed (fixed)

Urlize function in django.utils.html does not properly work on https:// links

Reported by: clint Owned by: nobody
Component: Core (Other) Version: master
Severity: Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description (last modified by clint)

Just tested this out on the version of utils/html.py in [7569]

urlize() incorrectly adds a http:// to the beginning of a https link

To replicate:

>>> from django.utils.html import urlize
>>> words = "Hi there https://www.google.com"
>>> urlize(words)
u'Hi there <a href="http://https://www.google.com">http://https://www.google.com</a>'

I did a search in trac for "urlize https" and came up with no hits. As far as I can tell this bug has been around for a while, not sure why no one would've caught this.

Patch:
===================================================================
--- html.py	(revision 7569)
+++ html.py	(working copy)
@@ -99,7 +99,7 @@
             lead, middle, trail = match.groups()
             if safe_input:
                 middle = mark_safe(middle)
-            if middle.startswith('www.') or ('@' not in middle and not middle.startswith('http://') and \
+            if middle.startswith('www.') or ('@' not in middle and not (middle.startswith('http://') or middle.startswith('https://')) and \
                     len(middle) > 0 and middle[0] in string.ascii_letters + string.digits and \
                     (middle.endswith('.org') or middle.endswith('.net') or middle.endswith('.com'))):
                 middle = 'http://%s' % middle

And a test to check for this

Index: tests.py
===================================================================
--- tests.py	(revision 7569)
+++ tests.py	(working copy)
@@ -166,6 +166,11 @@
 >>> urlizetrunc(uri, 2)
 u'<a href="http://31characteruri.com/test/" rel="nofollow">...</a>'
 
+# Check normal urlize
+>>> url = 'https://google.com'
+>>> urlize(url)
+u'<a href="https://google.com" rel="nofollow">https://google.com</a>'
+
 >>> wordcount('')
 0

Attachments (3)

html.py.diff (813 bytes) - added by clint 7 years ago.
The patch for html.py
tests.py.diff (535 bytes) - added by clint 7 years ago.
Update for tests
urlize-https-fix-tests.diff (1.3 KB) - added by clint 7 years ago.

Download all attachments as: .zip

Change History (14)

comment:1 Changed 7 years ago by clint

  • Description modified (diff)
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

Uh.. sorry for the horrible formatting...

v

comment:2 Changed 7 years ago by cgrady

although it only seems to do this if you leave off the trailing slash

>>> urlize("https://www.google.com/")                                                                                                                                                     
u'<a href="https://www.google.com/" rel="nofollow">https://www.google.com/</a>'
>>> urlize("https://www.google.com")                                                                                                                                                                     
u'<a href="http://https://www.google.com" rel="nofollow">http://https://www.google.com</a>'

comment:3 Changed 7 years ago by clint

Could this mean that the real root of the issue is that the regexp that splits urls into lead,middle,and trailing doesn't work right if there's no trailing slash?

comment:4 Changed 7 years ago by clint

Actually it looks like if you have a trailing slash on your url, then:

middle = 'https://google.com/'

and the check done on line 104-106 fails:

            if middle.startswith('www.') or ('@' not in middle and not middle.startswith('http://') and \
                    len(middle) > 0 and middle[0] in string.ascii_letters + string.digits and \
                    (middle.endswith('.org') or middle.endswith('.net') or middle.endswith('.com'))):

And it then matches the next set of conditions on line 108 and is handled properly:

if middle.startswith('http://') or middle.startswith('https://'):

However if you URL has no slash at the end of the URL:

middle = 'https://google.com'

Then you do not fail the first chunk (which it should, but does not because it only checks for http://), and an http:// is appended to the front of the valid https:// URL. So it would appear that my patch looks like it probably is the correct solution.

comment:5 Changed 7 years ago by Simon Greenhill

  • Triage Stage changed from Unreviewed to Accepted

Clint, this looks good - thanks for tracking it down. Can you attach the patches as diffs (svn diff) and promote to ready for checkin?

Changed 7 years ago by clint

The patch for html.py

Changed 7 years ago by clint

Update for tests

comment:6 Changed 7 years ago by clint

  • Triage Stage changed from Accepted to Ready for checkin

I've added them and I'm moving this to rdy for checkin

comment:7 Changed 7 years ago by telenieko

You post a single diff with everything in it.

Changed 7 years ago by clint

comment:8 Changed 7 years ago by clint

Doh. Sorry about that. I cat'd them together. I hope that's similar to the format you need. (I don't have the tests fix deployed on the version of Django I'd updated, so I can't just pop out a single svn diff right now).

comment:9 Changed 7 years ago by telenieko

  • milestone set to 1.0

Thanks that makes things easier for commiters later as they only have to review and apply the latest patch in the ticket ;)

Marking 1.0 as this is a bug.

comment:10 Changed 7 years ago by russellm

  • Resolution set to fixed
  • Status changed from new to closed

(In [7701]) Fixed #7355 -- Modified urlize utility to handle https:// addresses. Thanks for the report and patch, clint.

comment:11 Changed 3 years ago by jacob

  • milestone 1.0 deleted

Milestone 1.0 deleted

Note: See TracTickets for help on using tickets.
Back to Top