Opened 16 years ago

Closed 15 years ago

#11198 closed (fixed)

Forms URLfield regex takes infinite to validate a long field

Reported by: marcob Owned by: nobody
Component: Forms Version: dev
Severity: Keywords:
Cc: Triage Stage: Design decision needed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

With a long value, the validation takes forever.

You can test it event from manage.py shell

>>> from django.forms.fields import url_re
>>> url_re.search(u'http://asdqwsdqqwd.qwdqwdqd.wefwefewasdqwsdqqwd.qwdqwd')
<_sre.SRE_Match object at 0x02521C60>
>>> url_re.search(u'http://asdqwsdqqwd.qwdqwdqd.wefwefewasdqwsdqqwd.qwdqwdsdsdss')

This takes forever....

Attachments (1)

django-ticket-11198.patch (1.7 KB ) - added by wam 15 years ago.
patch to address catastrophic backtracking issue is django.forms.fields.url_re

Download all attachments as: .zip

Change History (11)

comment:1 by Alex Gaynor, 16 years ago

Resolution: worksforme
Status: newclosed

I'm unable to reproduce:

>>> %time url_re.search("http:///asdasdas.asdsa.sadasdsa")
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s

>>> %timeit url_re.search("http:///asdasdas.asdsa.sadasdsa")
1000000 loops, best of 3: 1.54 µs per loop

>>> %timeit url_re.search("http:///asdasdas.asdsa.sadasdsa.adssadsa.dsadsadsadsa.sadsa")
100000 loops, best of 3: 2.18 µs per loop

>>> %timeit url_re.search("http:///asdasdas.asdsa.sadasdsa.adssadsa.dsadsadsadsa.sadsa.dsadffgdsaf.fdsafadsfwq.fdsafdsafas")
100000 loops, best of 3: 2.93 µs per loop

>>> %timeit url_re.search("http:///asdasdas.asdsa.sadasdsa.adssadsa.dsadsadsadsa.sadsa.dsadffgdsaf.fdsafadsfwq.fdsafdsafas.fdafdsafdafdsa.fdsafdsafsafdsafsa.gwretewfewqfewqewfdsfdaf")
100000 loops, best of 3: 4.04 µs per loop

In any event this wouldn't be a django bug, it would be one with the python reguar expression parser/bytecode interpreter.

comment:2 by Alex Gaynor, 16 years ago

Err I lied about not being able to reproduce (darned extra /), however it's still not a bug in Django, it's a bug in Python.

comment:3 by marcob, 16 years ago

Resolution: worksforme
Status: closedreopened

I know it's a bug in Python re implementation (and lot of other re implementations), but I prefer to not have a regex that exploits a Python bug in a form field. Imagine to have a URLfield on a public form on a big site... :)

This version (without the "unnamed" group) works and doesn't "broke" my form:

url_re = re.compile(
    r'^https?://' # http:// or https://
    r'(?:[A-Z0-9]+(?:-*[A-Z0-9]+)*\.)+[A-Z]{2,6}|' #domain...
    r'localhost|' #localhost...
    r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}' # ...or ip
    r'(?::\d+)?' # optional port
    r'(?:/?|/\S+)$', re.IGNORECASE)

comment:4 by dc, 16 years ago

Python regex engine works as expected, this is a Django bug. Or more specifically an url_re bug - it leads to the catastrophic backtracking and takes around 208910 steps (according to the RegexBuddy) to complete.

If you will change this line in url_re:

    r'(?:(?:[A-Z0-9]+(?:-*[A-Z0-9]+)*\.)+[A-Z]{2,6}|' #domain...

to this:

    r'(?:(?:[A-Z0-9](?:-*[A-Z0-9]+)*\.)+[A-Z]{2,6}|' #domain...

it will work as before in normal cases, but in edge cases it will complete much quicker (around 71 step, for the given url)

comment:5 by dc, 16 years ago

Has patch: set
Needs tests: set

comment:6 by dc, 16 years ago

This version (without the "unnamed" group) works

No, your version is invalid. It does not match "http://localhost" "http://127.0.0.1:80" etc. Please run Django test suite.

comment:7 by Alex Gaynor, 15 years ago

Triage Stage: UnreviewedDesign decision needed

comment:8 by wam, 15 years ago

Component: UncategorizedForms
Needs tests: unset

I believe the updated regex submitted by dc also has the same kind of backtracking problem that the current version has (although less severe). I wrote a
simple test case (which hangs the testcase runner rather than erroring, but it at least demonstrates the problem) and all submissions so far still "fail" (hangs) the test:

diff --git a/tests/regressiontests/forms/fields.py b/tests/regressiontests/forms/fields.py
index 9d9d722..a4980b0 100644
--- a/tests/regressiontests/forms/fields.py
+++ b/tests/regressiontests/forms/fields.py
@@ -971,8 +971,12 @@ ValidationError: [u'Enter a valid URL.']
 >>> f.clean('http://example.')
 Traceback (most recent call last):
 ...
 ValidationError: [u'Enter a valid URL.']
+>>> f.clean('http://%s' % ("X"*200,))
+Traceback (most recent call last):
+...
+ValidationError: [u'Enter a valid URL.']
 >>> f.clean('http://.com')
 Traceback (most recent call last):
 ...
 ValidationError: [u'Enter a valid URL.']

I've got an alternative regex to match domains which seems to much better behaved with long invalid URLs and which passes all existing testcases for django.forms.URLField.clean() as well as the pathological one that I list above.

url_re = re.compile(
    r'^https?://' # http:// or https://
    r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}|' #domain...
    r'localhost|' #localhost...
    r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
    r'(?::\d+)?' # optional port
    r'(?:/?|/\S+)$', re.IGNORECASE)

This version has two notable changes from the current regex and the prior submissions to this ticket:

  • The number of places where a regex backtrack can occur is cut from 4 in the original (and in dc's proposal) down to 2 (with one of the backtracks being length limited).
  • The updated version enforces RFC 1035 (page7: sect. 2.3.1) and RFC 2181 (page 12: sect. 11) restrictions on domain label sizes (between 1 and 63 octects/characters) for even tighter validation

by wam, 15 years ago

Attachment: django-ticket-11198.patch added

patch to address catastrophic backtracking issue is django.forms.fields.url_re

comment:9 by wam, 15 years ago

I've updated the patch to also support URLs that terminate in a dot (e.g. http://example.com./) with a few more testcases for this additional functionality. Those URLs are annoying, but domain names are allowed to end in a dot, thus URLs are allowed to include the trailing dot as well.

comment:10 by Jacob, 15 years ago

Resolution: fixed
Status: reopenedclosed

Fixed in [11603]. [11604], [11605].

Note: See TracTickets for help on using tickets.
Back to Top