Opened 16 years ago
Closed 15 years ago
#11198 closed (fixed)
Forms URLfield regex takes infinite to validate a long field
Reported by: | marcob | Owned by: | nobody |
---|---|---|---|
Component: | Forms | Version: | dev |
Severity: | Keywords: | ||
Cc: | Triage Stage: | Design decision needed | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
With a long value, the validation takes forever.
You can test it event from manage.py shell
>>> from django.forms.fields import url_re >>> url_re.search(u'http://asdqwsdqqwd.qwdqwdqd.wefwefewasdqwsdqqwd.qwdqwd') <_sre.SRE_Match object at 0x02521C60> >>> url_re.search(u'http://asdqwsdqqwd.qwdqwdqd.wefwefewasdqwsdqqwd.qwdqwdsdsdss')
This takes forever....
Attachments (1)
Change History (11)
comment:1 by , 16 years ago
Resolution: | → worksforme |
---|---|
Status: | new → closed |
comment:2 by , 16 years ago
Err I lied about not being able to reproduce (darned extra /), however it's still not a bug in Django, it's a bug in Python.
comment:3 by , 16 years ago
Resolution: | worksforme |
---|---|
Status: | closed → reopened |
I know it's a bug in Python re implementation (and lot of other re implementations), but I prefer to not have a regex that exploits a Python bug in a form field. Imagine to have a URLfield on a public form on a big site... :)
This version (without the "unnamed" group) works and doesn't "broke" my form:
url_re = re.compile( r'^https?://' # http:// or https:// r'(?:[A-Z0-9]+(?:-*[A-Z0-9]+)*\.)+[A-Z]{2,6}|' #domain... r'localhost|' #localhost... r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}' # ...or ip r'(?::\d+)?' # optional port r'(?:/?|/\S+)$', re.IGNORECASE)
comment:4 by , 16 years ago
Python regex engine works as expected, this is a Django bug. Or more specifically an url_re bug - it leads to the catastrophic backtracking and takes around 208910 steps (according to the RegexBuddy) to complete.
If you will change this line in url_re:
r'(?:(?:[A-Z0-9]+(?:-*[A-Z0-9]+)*\.)+[A-Z]{2,6}|' #domain...
to this:
r'(?:(?:[A-Z0-9](?:-*[A-Z0-9]+)*\.)+[A-Z]{2,6}|' #domain...
it will work as before in normal cases, but in edge cases it will complete much quicker (around 71 step, for the given url)
comment:5 by , 16 years ago
Has patch: | set |
---|---|
Needs tests: | set |
comment:6 by , 16 years ago
This version (without the "unnamed" group) works
No, your version is invalid. It does not match "http://localhost" "http://127.0.0.1:80" etc. Please run Django test suite.
comment:7 by , 15 years ago
Triage Stage: | Unreviewed → Design decision needed |
---|
comment:8 by , 15 years ago
Component: | Uncategorized → Forms |
---|---|
Needs tests: | unset |
I believe the updated regex submitted by dc also has the same kind of backtracking problem that the current version has (although less severe). I wrote a
simple test case (which hangs the testcase runner rather than erroring, but it at least demonstrates the problem) and all submissions so far still "fail" (hangs) the test:
diff --git a/tests/regressiontests/forms/fields.py b/tests/regressiontests/forms/fields.py index 9d9d722..a4980b0 100644 --- a/tests/regressiontests/forms/fields.py +++ b/tests/regressiontests/forms/fields.py @@ -971,8 +971,12 @@ ValidationError: [u'Enter a valid URL.'] >>> f.clean('http://example.') Traceback (most recent call last): ... ValidationError: [u'Enter a valid URL.'] +>>> f.clean('http://%s' % ("X"*200,)) +Traceback (most recent call last): +... +ValidationError: [u'Enter a valid URL.'] >>> f.clean('http://.com') Traceback (most recent call last): ... ValidationError: [u'Enter a valid URL.']
I've got an alternative regex to match domains which seems to much better behaved with long invalid URLs and which passes all existing testcases for django.forms.URLField.clean() as well as the pathological one that I list above.
url_re = re.compile( r'^https?://' # http:// or https:// r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}|' #domain... r'localhost|' #localhost... r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip r'(?::\d+)?' # optional port r'(?:/?|/\S+)$', re.IGNORECASE)
This version has two notable changes from the current regex and the prior submissions to this ticket:
- The number of places where a regex backtrack can occur is cut from 4 in the original (and in dc's proposal) down to 2 (with one of the backtracks being length limited).
- The updated version enforces RFC 1035 (page7: sect. 2.3.1) and RFC 2181 (page 12: sect. 11) restrictions on domain label sizes (between 1 and 63 octects/characters) for even tighter validation
by , 15 years ago
Attachment: | django-ticket-11198.patch added |
---|
patch to address catastrophic backtracking issue is django.forms.fields.url_re
comment:9 by , 15 years ago
I've updated the patch to also support URLs that terminate in a dot (e.g. http://example.com./) with a few more testcases for this additional functionality. Those URLs are annoying, but domain names are allowed to end in a dot, thus URLs are allowed to include the trailing dot as well.
comment:10 by , 15 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
I'm unable to reproduce:
In any event this wouldn't be a django bug, it would be one with the python reguar expression parser/bytecode interpreter.