Opened 18 years ago

Closed 16 years ago

Last modified 16 years ago

#2977 closed enhancement (fixed)

Better handling of regular expressions for reverse urlresolver

Reported by: Chris Beaven Owned by: Malcolm Tredinnick
Component: Core (Other) Version:
Severity: normal Keywords: url reverse
Cc: albertpeschar+djangotrac@…, mattimustang@…, dcwatson@…, alexander.solovyov@…, daybreaker12@…, wbyoung@…, django@…, eric@…, beau@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

I spent a bit of time looking over the urlresolvers.py today. I put together a patch fixing some problems with it.

Firstly, regular expressions should be set up outside the function so they only have to be compiled once (fix 1).

A TODO: was to make the resolver handle recursive parenthesis. After a bit of thinking (and actually making it work like that) I realised that is actually not what you'd need to do. Recursion is not necessary because the outermost matched parenthesis will be replaced with the given argument anyway.

What was needed was a rethink of choosing between named/unnamed groups. It should be named groups only (if any exist) for each whole regular expression (fix 2).

It does need to handle non-grouping parenthesis recursively though (fix 3).

Other expression extension notations (apart from named groups, of course) should be ignored (fix 4).

Probably rarely needed, but it now also handles pipes in regular expressions (fix 5).

Attachments (6)

reverse_urlresolver.patch (5.3 KB ) - added by Chris Beaven 18 years ago.
reverse_urlresolver.2.patch (5.3 KB ) - added by Chris Beaven 18 years ago.
small fix
new_reverse_urlresolver.patch (10.9 KB ) - added by Chris Beaven 18 years ago.
complete rewrite of most of the reverse code!
new_reverse_urlresolver.2.patch (11.1 KB ) - added by Chris Beaven 18 years ago.
new_reverse_urlresolver.3.patch (12.5 KB ) - added by Chris Beaven 18 years ago.
django-plus-sign-fix.patch (642 bytes ) - added by Andres Riofrio 17 years ago.
Fix for problem with plus signs (see #4883) and dots (tried to fix also dollar signs and '' but didn't work--don't know why). Generated with svn diff.

Download all attachments as: .zip

Change History (36)

by Chris Beaven, 18 years ago

Attachment: reverse_urlresolver.patch added

by Chris Beaven, 18 years ago

Attachment: reverse_urlresolver.2.patch added

small fix

comment:1 by Jacob, 18 years ago

This looks good, but I'm having trouble figuring out exactly how some of it works. Could you perhaps add some tests that illustrate how the current resolver fails? The right place for those is source:django/trunk/tests/urlpatterns_reverse/

comment:2 by Jacob, 18 years ago

Er, that should be: source:django/trunk/tests/regressiontests/urlpatterns_reverse

by Chris Beaven, 18 years ago

complete rewrite of most of the reverse code!

comment:3 by Chris Beaven, 18 years ago

All right, my previous implementation had a few fatal flaws I realised. So I have rewritten everything.

A major change is that the RegexURLPattern objects get a compiled matcher with them, which will greatly speed up reverse lookups.

The rewrite was mostly necessary to correctly handle nested parenthesis, which it now does using a basic tokenizer. There is a small change in logic, which was a bug in the previous one: if a RE has both named and non-named groups, only the named groups should be parsed (this is in line with how the forward lookups work and what documentation says).

A few more tests have been added.

comment:4 by Chris Beaven, 18 years ago

Also, the original code (Adrian's) handled argument passing incorrectly when used with a URLResolver. The resolver is should check both itself and its children using one set of arguments.

comment:5 by django@…, 18 years ago

Hi, i just looked briefly on the patch. You should use a helper function like adrian did, currently the regexp will be parsed, even if it is never used.

comment:6 by Chris Beaven, 18 years ago

Thanks for the advice.

The difference is that the parsed REs inside of RegexURLPatterns are cached, and I'm mostly sure that the RegexURLPatterns themselves are cached, so everyone wins.

comment:7 by Chris Beaven, 18 years ago

Triage Stage: UnreviewedDesign decision needed

comment:8 by Chris Beaven, 18 years ago

Patch needs improvement: set

Should work with a mix of named and unnamed arguments

comment:9 by Chris Beaven, 18 years ago

Patch needs improvement: unset

Actually, it shouldn't. The current behaviour of reverse is a bug since forward URL resolving doesn't accept both positional and named args (it drops the positional ones)

comment:10 by Malcolm Tredinnick, 18 years ago

This ticket is a good example of why separate issues should be in separate bugs. You are trying to do many things at once and it makes it hard to split out the good from the not so good and discuss each issue in isolation. That being said, let's work with what you've done (since it's an impressive effort):

Fixes 1, 3 and 5 are all fine. They're just small cleanups. That can be a single patch.

Fix 2 is a design decision and I think you're making the wrong decision. Allowing keyword and positional args to mix (not nest, but be in the same reg-exp) is fine.

Fix 4, I need to think about some more.

I'll commit portions and make a decision on the rest in the next couple of days. No need to write a new patch, I can see what you're doing in the current code and it will be easy to tease it apart.

comment:11 by Chris Beaven, 18 years ago

Yea, I should have written separate patches. Next time ;)

Re fix 2, I'm probably blowing hot air, you already said it's going stay as-is in the other ticket but I'm still not sure why you think mixing keyword and positional arguments is useful if it doesn't work for a forward resolution.

In the form of a question: Why should a reverse match a RegexURLPattern which won't call its view the same arguments which reverse received?

Fix 4 is because I didn't want to tackle the difficult types of extension notation, notably lookahead/behind assertions and named backreferences. Note, the patch also handles non-grouping parenthesis.

comment:12 by Chris Beaven, 18 years ago

See #4075 for a problem which this ticket would fix

comment:13 by anonymous, 18 years ago

Cc: dcwatson@… added

comment:14 by Todd O'Bryan <toddobryan@…>, 18 years ago

I'm not sure that you're sufficiently sneaky when removing special characters from URLs...

For example, you're using the regexes
{{{re_unused = re.compile(r'(?<\\)[$?*+()]')
re_special = re.compile(r'
([.+*()$])') # Characters from the IETF URL standard, RFC 1738.
}}}

I'm worried that re_unused won't match the $ in r'\\$' even though it should. Also, re_special will match \. in r'\\.' even though it shouldn't.

Also, you don't translate unsafe characters from regexes back to normal strings. That might be the correct behavior--I'm not sure what Django/Apache does with a URL like '/words%20with%20space'. If %20 becomes a space before it enters the url lookup system, then urls with unsafe characters are possible and we should be able to reverse them (making sure that they get encoded before they're actually included in a webpage). If it's kept in its encoded form, then we shouldn't have to worry about regexes with unsafe characters.

I just created ticket #4594 with my patch. Let's figure out how to deal with these corner cases and maybe we can pull out the part of your code that deals with this so Malcolm will finally get around to applying the various parts. :-)

by Chris Beaven, 18 years ago

comment:15 by Chris Beaven, 18 years ago

I caved in and undid fix 2 (but made it so that if there is a named group inside of a non-named group then it'll ignore the non-named group).

The slashes should be much more solid.

Unsafe characters from regex are unencoded nicely.

by Chris Beaven, 18 years ago

comment:16 by Chris Beaven, 18 years ago

Unsafe characters from regex are now only unencoded from sections not being replaced by arguments

I handled the somewhat rare case of non-argument sections containing character sets ([aeiou]). It just chooses the first character in there. Perhaps overkill, but I do know that some people prefer to write their special characters this way rather than escaping them (eg [$] rather than \$).

comment:17 by Malcolm Tredinnick, 18 years ago

Owner: changed from Adrian Holovaty to Malcolm Tredinnick
Summary: [patch] Better handling of regular expressions for reverse urlresolverBetter handling of regular expressions for reverse urlresolver
Triage Stage: Design decision neededAccepted

Okay, this is on my review list now. Maybe avoid making nay major changes for a couple of days so that I have a chance to read through it without changes happening underneath. Will let you know if I find anything disturbing.

comment:18 by Alexander Solovyov <alexander.solovyov@…>, 17 years ago

Cc: alexander.solovyov@… added

comment:19 by anonymous, 17 years ago

Cc: daybreaker12@… added

comment:20 by anonymous, 17 years ago

Cc: wbyoung@… added

comment:21 by James Wheare <django@…>, 17 years ago

Cc: django@… added

comment:22 by Matthew Flanagan <mattimustang@…>, 17 years ago

Cc: mattimustang@… added
Patch needs improvement: set

this patch does not work for me. Even after manually applying it to latest trunk it fails with the following complex regexp:

IP4_RE = r'(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}'
IP6_HEX_RE = r'(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}'
IP6_HEX_COMPRESSED_RE = r'((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)::((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)'
IP6_IP4_COMPAT_RE = r'((?:[0-9A-Fa-f]{1,4}:){6,6})(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}'
IP6_IP4_COMPAT_COMPRESSED_RE = r'((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)::((?:[0-9A-Fa-f]{1,4}:)*)(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}'
IP6_RE = '%s|%s|%s|%s' % (IP6_HEX_RE, IP6_HEX_COMPRESSED_RE,
    IP6_IP4_COMPAT_RE, IP6_IP4_COMPAT_COMPRESSED_RE)
IP6_RE_COMPILED = re.compile('^(%s|%s|%s|%s)$' % (IP6_HEX_RE,
    IP6_HEX_COMPRESSED_RE, IP6_IP4_COMPAT_RE, IP6_IP4_COMPAT_COMPRESSED_RE))
IP_RE = "(%s|%s)" % (IP4_RE, IP6_RE)


urlpatterns = patterns('',
    url(r'^(?P<ipsupernetwork>%s)/$' % IP_RE, ipnetwork_list, IPNETWORK_LIST,
        name='ipnetwork_list'),

)

Running django.core.urlresolvers.reverse results in:

In [6]: reverse('ipnetwork_list', urlconf='nong.ip.urls',args=(), kwargs={'ipsupernetwork': '10.0.0.0'})
---------------------------------------------------------------------------
sre_constants.error                                     Traceback (most recent call last)

/srv/dev/mpf/nong/third-party/django/tests/<ipython console>

/srv/dev/mpf/nong/third-party/django/django/core/urlresolvers.py in reverse(viewname, urlconf, args, kwargs)
    399 def reverse(viewname, urlconf=None, args=None, kwargs=None):
    400     args = args or []
    401     kwargs = kwargs or {}
--> 402     return iri_to_uri(u'/' + get_resolver(urlconf).reverse(viewname, *args, **kwargs))
    403

/srv/dev/mpf/nong/third-party/django/django/core/urlresolvers.py in reverse(self, lookup_view, *args, **kwargs)
    383         except (ImportError, AttributeError):
    384             raise NoReverseMatch
--> 385         if lookup_view in self.reverse_dict:
    386             return u''.join([reverse_helper(part.regex, *args, **kwargs) for part in self.reverse_dict[lookup_view]])
    387         raise NoReverseMatch

/srv/dev/mpf/nong/third-party/django/django/core/urlresolvers.py in _get_reverse_dict(self)
    317
    318     def _get_reverse_dict(self):
--> 319         if not self._reverse_dict and hasattr(self.urlconf_module, 'urlpatterns'):
    320             for pattern in reversed(self.urlconf_module.urlpatterns):
    321                 if isinstance(pattern, RegexURLResolver):

/srv/dev/mpf/nong/third-party/django/django/core/urlresolvers.py in _get_urlconf_module(self)
    353         except AttributeError:
    354             try:
--> 355                 self._urlconf_module = __import__(self.urlconf_name, {}, {}, [''])
    356             except ValueError, e:
    357                 # Invalid urlconf_name, such as "foo.bar." (note trailing period)

/srv/dev/mpf/nong/third-party/django/tests/nong/ip/urls.py
     13     (r'^url_tag/', include('regressiontests.templates.urls')),
     14
     15     # django built-in views
     16     (r'^views/', include('regressiontests.views.urls')),
     17 )

/srv/dev/mpf/nong/third-party/django/django/conf/urls/defaults.py in url(regex, view, kwargs, name, prefix)
     28                 raise ImproperlyConfigured('Empty URL pattern view name not permitted (for pattern %r)' % regex)
     29             if prefix:
     30                 view = prefix + '.' + view
---> 31         return RegexURLPattern(regex, view, kwargs, name)
     32

/srv/dev/mpf/nong/third-party/django/django/core/urlresolvers.py in __init__(self, regex, callback, default_args, name)
    239         # callable object (view).
    240         self.regex = re.compile(regex, re.UNICODE)
--> 241         self.reverse_regex_lookup = ReverseRegexLookup(regex)
    242         if callable(callback):
    243             self._callback = callback

/srv/dev/mpf/nong/third-party/django/django/core/urlresolvers.py in __init__(self, text)
    194 class ReverseRegexLookup(object):
    195     def __init__(self, text):
--> 196         self.tokens, self.minimum_arguments, self.error = tokenize(text)
    197
    198     def check(self, args=[], kwargs={}):

/srv/dev/mpf/nong/third-party/django/django/core/urlresolvers.py in tokenize(text)
    179         if isinstance(bit, list):
    180             # Build the regex here so it only has to be compiled once.
--> 181             bit = re.compile('%s$' % build_re(bit))
    182             count += 1
    183         else:

/usr/local/lib/python2.4/sre.py in compile(pattern, flags)
    178 def compile(pattern, flags=0):
    179     "Compile a regular expression pattern, returning a pattern object."
--> 180     return _compile(pattern, flags)
    181
    182 def purge():

/usr/local/lib/python2.4/sre.py in _compile(*key)
    225         p = sre_compile.compile(pattern, flags)
    226     except error, v:
--> 227         raise error, v # invalid expression
    228     if len(_cache) >= _MAXCACHE:
    229         _cache.clear()

error: multiple repeat

by Andres Riofrio, 17 years ago

Attachment: django-plus-sign-fix.patch added

Fix for problem with plus signs (see #4883) and dots (tried to fix also dollar signs and '' but didn't work--don't know why). Generated with svn diff.

comment:23 by Malcolm Tredinnick, 17 years ago

Owner: changed from nobody to Malcolm Tredinnick

comment:24 by James Bennett, 17 years ago

Keywords: url reverse added

comment:25 by anonymous, 17 years ago

Cc: albertpeschar+djangotrac@… added

comment:26 by Sebastian Noack, 17 years ago

I started to solve this problems, too. See #6934. But this patch seems to be more complete. I would like to see it merged into SVN, but unfortunately I can not apply even the patch against the current HEAD.

comment:27 by Éric St-Jean, 17 years ago

Cc: eric@… added

comment:28 by Beau Gunderson <beau@…>, 17 years ago

Cc: beau@… added

I hit this the other day too; looking forward to this being applied! :)

comment:29 by Malcolm Tredinnick, 16 years ago

Resolution: fixed
Status: newclosed

(In [8760]) A rewrite of the reverse URL parsing: the reverse() call and the "url" template tag.

This is fully backwards compatible, but it fixes a bunch of little bugs. Thanks
to SmileyChris and Ilya Semenov for some early patches in this area that were
incorporated into this change.

Fixed #2977, #4915, #6934, #7206.

comment:30 by Malcolm Tredinnick, 16 years ago

I intentionally didn't take the pipe-handling portion of this patch. That's actually a lot trickier than it looks when the alternatives contain different parameter values. The infrastructure is more or less in place to handle that, but I was having a lot of trouble making it stable, so somebody can address that after 1.0.

Note: See TracTickets for help on using tickets.
Back to Top