Django

Code

Ticket #7355 (closed: fixed)

Opened 3 months ago

Last modified 3 months ago

Urlize function in django.utils.html does not properly work on https:// links

Reported by: clint Assigned to: nobody
Milestone: 1.0 Component: Core framework
Version: SVN Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: 1 Needs documentation: 0
Needs tests: 0 Patch needs improvement: 0

Description (Last modified by clint)

Just tested this out on the version of utils/html.py in [7569]

urlize() incorrectly adds a http:// to the beginning of a https link

To replicate:

>>> from django.utils.html import urlize
>>> words = "Hi there https://www.google.com"
>>> urlize(words)
u'Hi there <a href="http://https://www.google.com">http://https://www.google.com</a>'

I did a search in trac for "urlize https" and came up with no hits. As far as I can tell this bug has been around for a while, not sure why no one would've caught this.

Patch:
===================================================================
--- html.py	(revision 7569)
+++ html.py	(working copy)
@@ -99,7 +99,7 @@
             lead, middle, trail = match.groups()
             if safe_input:
                 middle = mark_safe(middle)
-            if middle.startswith('www.') or ('@' not in middle and not middle.startswith('http://') and \
+            if middle.startswith('www.') or ('@' not in middle and not (middle.startswith('http://') or middle.startswith('https://')) and \
                     len(middle) > 0 and middle[0] in string.ascii_letters + string.digits and \
                     (middle.endswith('.org') or middle.endswith('.net') or middle.endswith('.com'))):
                 middle = 'http://%s' % middle

And a test to check for this

Index: tests.py
===================================================================
--- tests.py	(revision 7569)
+++ tests.py	(working copy)
@@ -166,6 +166,11 @@
 >>> urlizetrunc(uri, 2)
 u'<a href="http://31characteruri.com/test/" rel="nofollow">...</a>'
 
+# Check normal urlize
+>>> url = 'https://google.com'
+>>> urlize(url)
+u'<a href="https://google.com" rel="nofollow">https://google.com</a>'
+
 >>> wordcount('')
 0

Attachments

html.py.diff (0.8 kB) - added by clint on 06/18/08 15:37:51.
The patch for html.py
tests.py.diff (0.5 kB) - added by clint on 06/18/08 15:38:11.
Update for tests
urlize-https-fix-tests.diff (1.3 kB) - added by clint on 06/18/08 18:31:17.

Change History

06/02/08 22:03:45 changed by clint

  • needs_better_patch changed.
  • description changed.
  • needs_tests changed.
  • needs_docs changed.

Uh.. sorry for the horrible formatting...

v

06/02/08 22:52:26 changed by cgrady

although it only seems to do this if you leave off the trailing slash

>>> urlize("https://www.google.com/")                                                                                                                                                     
u'<a href="https://www.google.com/" rel="nofollow">https://www.google.com/</a>'
>>> urlize("https://www.google.com")                                                                                                                                                                     
u'<a href="http://https://www.google.com" rel="nofollow">http://https://www.google.com</a>'

06/02/08 22:55:44 changed by clint

Could this mean that the real root of the issue is that the regexp that splits urls into lead,middle,and trailing doesn't work right if there's no trailing slash?

06/02/08 23:03:10 changed by clint

Actually it looks like if you have a trailing slash on your url, then:

middle = 'https://google.com/'

and the check done on line 104-106 fails:

            if middle.startswith('www.') or ('@' not in middle and not middle.startswith('http://') and \
                    len(middle) > 0 and middle[0] in string.ascii_letters + string.digits and \
                    (middle.endswith('.org') or middle.endswith('.net') or middle.endswith('.com'))):

And it then matches the next set of conditions on line 108 and is handled properly:

if middle.startswith('http://') or middle.startswith('https://'):

However if you URL has no slash at the end of the URL:

middle = 'https://google.com'

Then you do not fail the first chunk (which it should, but does not because it only checks for http://), and an http:// is appended to the front of the valid https:// URL. So it would appear that my patch looks like it probably is the correct solution.

06/14/08 02:24:26 changed by Simon Greenhill

  • stage changed from Unreviewed to Accepted.

Clint, this looks good - thanks for tracking it down. Can you attach the patches as diffs (svn diff) and promote to ready for checkin?

06/18/08 15:37:51 changed by clint

  • attachment html.py.diff added.

The patch for html.py

06/18/08 15:38:11 changed by clint

  • attachment tests.py.diff added.

Update for tests

06/18/08 15:39:10 changed by clint

  • stage changed from Accepted to Ready for checkin.

I've added them and I'm moving this to rdy for checkin

06/18/08 15:42:37 changed by telenieko

You post a single diff with everything in it.

06/18/08 18:31:17 changed by clint

  • attachment urlize-https-fix-tests.diff added.

06/18/08 18:32:28 changed by clint

Doh. Sorry about that. I cat'd them together. I hope that's similar to the format you need. (I don't have the tests fix deployed on the version of Django I'd updated, so I can't just pop out a single svn diff right now).

06/19/08 02:47:02 changed by telenieko

  • milestone set to 1.0.

Thanks that makes things easier for commiters later as they only have to review and apply the latest patch in the ticket ;)

Marking 1.0 as this is a bug.

06/19/08 07:05:40 changed by russellm

  • status changed from new to closed.
  • resolution set to fixed.

(In [7701]) Fixed #7355 -- Modified urlize utility to handle https:// addresses. Thanks for the report and patch, clint.


Add/Change #7355 (Urlize function in django.utils.html does not properly work on https:// links)




Change Properties
Action