Opened 6 years ago

Closed 6 years ago

#11198 closed (fixed)

Forms URLfield regex takes infinite to validate a long field

Reported by: marcob Owned by: nobody
Component: Forms Version: master
Severity: Keywords:
Cc: Triage Stage: Design decision needed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description

With a long value, the validation takes forever.

You can test it event from manage.py shell

>>> from django.forms.fields import url_re
>>> url_re.search(u'http://asdqwsdqqwd.qwdqwdqd.wefwefewasdqwsdqqwd.qwdqwd')
<_sre.SRE_Match object at 0x02521C60>
>>> url_re.search(u'http://asdqwsdqqwd.qwdqwdqd.wefwefewasdqwsdqqwd.qwdqwdsdsdss')

This takes forever....

Attachments (1)

django-ticket-11198.patch (1.7 KB) - added by wam 6 years ago.
patch to address catastrophic backtracking issue is django.forms.fields.url_re

Download all attachments as: .zip

Change History (11)

comment:1 Changed 6 years ago by Alex

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Resolution set to worksforme
  • Status changed from new to closed

I'm unable to reproduce:

>>> %time url_re.search("http:///asdasdas.asdsa.sadasdsa")
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s

>>> %timeit url_re.search("http:///asdasdas.asdsa.sadasdsa")
1000000 loops, best of 3: 1.54 µs per loop

>>> %timeit url_re.search("http:///asdasdas.asdsa.sadasdsa.adssadsa.dsadsadsadsa.sadsa")
100000 loops, best of 3: 2.18 µs per loop

>>> %timeit url_re.search("http:///asdasdas.asdsa.sadasdsa.adssadsa.dsadsadsadsa.sadsa.dsadffgdsaf.fdsafadsfwq.fdsafdsafas")
100000 loops, best of 3: 2.93 µs per loop

>>> %timeit url_re.search("http:///asdasdas.asdsa.sadasdsa.adssadsa.dsadsadsadsa.sadsa.dsadffgdsaf.fdsafadsfwq.fdsafdsafas.fdafdsafdafdsa.fdsafdsafsafdsafsa.gwretewfewqfewqewfdsfdaf")
100000 loops, best of 3: 4.04 µs per loop

In any event this wouldn't be a django bug, it would be one with the python reguar expression parser/bytecode interpreter.

comment:2 Changed 6 years ago by Alex

Err I lied about not being able to reproduce (darned extra /), however it's still not a bug in Django, it's a bug in Python.

comment:3 Changed 6 years ago by marcob

  • Resolution worksforme deleted
  • Status changed from closed to reopened

I know it's a bug in Python re implementation (and lot of other re implementations), but I prefer to not have a regex that exploits a Python bug in a form field. Imagine to have a URLfield on a public form on a big site... :)

This version (without the "unnamed" group) works and doesn't "broke" my form:

url_re = re.compile(
    r'^https?://' # http:// or https://
    r'(?:[A-Z0-9]+(?:-*[A-Z0-9]+)*\.)+[A-Z]{2,6}|' #domain...
    r'localhost|' #localhost...
    r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}' # ...or ip
    r'(?::\d+)?' # optional port
    r'(?:/?|/\S+)$', re.IGNORECASE)

comment:4 Changed 6 years ago by dc

Python regex engine works as expected, this is a Django bug. Or more specifically an url_re bug - it leads to the catastrophic backtracking and takes around 208910 steps (according to the RegexBuddy) to complete.

If you will change this line in url_re:

    r'(?:(?:[A-Z0-9]+(?:-*[A-Z0-9]+)*\.)+[A-Z]{2,6}|' #domain...

to this:

    r'(?:(?:[A-Z0-9](?:-*[A-Z0-9]+)*\.)+[A-Z]{2,6}|' #domain...

it will work as before in normal cases, but in edge cases it will complete much quicker (around 71 step, for the given url)

comment:5 Changed 6 years ago by dc

  • Has patch set
  • Needs tests set

comment:6 Changed 6 years ago by dc

This version (without the "unnamed" group) works

No, your version is invalid. It does not match "http://localhost" "http://127.0.0.1:80" etc. Please run Django test suite.

comment:7 Changed 6 years ago by Alex

  • Triage Stage changed from Unreviewed to Design decision needed

comment:8 Changed 6 years ago by wam

  • Component changed from Uncategorized to Forms
  • Needs tests unset

I believe the updated regex submitted by dc also has the same kind of backtracking problem that the current version has (although less severe). I wrote a
simple test case (which hangs the testcase runner rather than erroring, but it at least demonstrates the problem) and all submissions so far still "fail" (hangs) the test:

diff --git a/tests/regressiontests/forms/fields.py b/tests/regressiontests/forms/fields.py
index 9d9d722..a4980b0 100644
--- a/tests/regressiontests/forms/fields.py
+++ b/tests/regressiontests/forms/fields.py
@@ -971,8 +971,12 @@ ValidationError: [u'Enter a valid URL.']
 >>> f.clean('http://example.')
 Traceback (most recent call last):
 ...
 ValidationError: [u'Enter a valid URL.']
+>>> f.clean('http://%s' % ("X"*200,))
+Traceback (most recent call last):
+...
+ValidationError: [u'Enter a valid URL.']
 >>> f.clean('http://.com')
 Traceback (most recent call last):
 ...
 ValidationError: [u'Enter a valid URL.']

I've got an alternative regex to match domains which seems to much better behaved with long invalid URLs and which passes all existing testcases for django.forms.URLField.clean() as well as the pathological one that I list above.

url_re = re.compile(
    r'^https?://' # http:// or https://
    r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}|' #domain...
    r'localhost|' #localhost...
    r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
    r'(?::\d+)?' # optional port
    r'(?:/?|/\S+)$', re.IGNORECASE)

This version has two notable changes from the current regex and the prior submissions to this ticket:

  • The number of places where a regex backtrack can occur is cut from 4 in the original (and in dc's proposal) down to 2 (with one of the backtracks being length limited).
  • The updated version enforces RFC 1035 (page7: sect. 2.3.1) and RFC 2181 (page 12: sect. 11) restrictions on domain label sizes (between 1 and 63 octects/characters) for even tighter validation

Changed 6 years ago by wam

patch to address catastrophic backtracking issue is django.forms.fields.url_re

comment:9 Changed 6 years ago by wam

I've updated the patch to also support URLs that terminate in a dot (e.g. http://example.com./) with a few more testcases for this additional functionality. Those URLs are annoying, but domain names are allowed to end in a dot, thus URLs are allowed to include the trailing dot as well.

comment:10 Changed 6 years ago by jacob

  • Resolution set to fixed
  • Status changed from reopened to closed

Fixed in [11603]. [11604], [11605].

Note: See TracTickets for help on using tickets.
Back to Top