Opened 18 years ago

Closed 18 years ago

Last modified 18 years ago

#4594 closed (duplicate)

django.core.urlresolvers.reverse_helper doesn't unescape characters that are escaped in URL regexes

Reported by: Todd O'Bryan <toddobryan@…> Owned by: Jacob
Component: Uncategorized Version: dev
Severity: Keywords: url reverse escape
Cc: Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Any backslash-escaped character in a URL doesn't get unescaped by the {% url ...%} tag (and presumably by other methods of view reversing). For example,

urlpatterns = patterns('',
    (r'^prices/less_than_\$(?P<price>\d+)/$', 'cost_less'),                   
    (r'^headlines/(?P<year>/d+)\.(?P<month>\d+)\.(?P<day>\d+)/$', 'daily_headlines'),
    (r'^priests/(?P<name>\w+)\+/$', 'priest_homepage'),
    (r'^windows_path/(?P<drive_name>[A-Z]):\\\\(?P<path>.+)', 'windows_path'),
)

The dollar sign, dot, plus, and backslash in each of the URL patterns match a single character, but don't get converted back to that character by the reverse function.

It seems that there aren't that many of these. Any escape sequence that doesn't match a constant string (i.e. something like \s or \d or \w) had better be part of a pattern so that it can be replaced with the right string to get the URL you're expecting. That leaves the following, I think.

Pattern Replacement
\A '' (equivalent to ^)
\Z '' (equivalent to $)
\b and \B '' (these shouldn't appear in urls, but can only match the empty string)
\., \^, \$, \*, \+, \?, \(, \), \{, \}, \[, \], and \\ the same character, without a backslash

As a first stab, I'd just get rid of \A, \Z, \b, and \B, just as the current code does for ^ and $. This is actually kind of complicated, because you have to make sure that the \ in front isn't part of a pair of backslashes. In other words, \\b should become \b, but \\\b should just become \. Also, the current code removes all ^ and $. That's wrong if they're preceded by a backslash and meant to be the actual character.

There are some gotchas--when you insert values, you have to escape characters that you'll be unescaping later. I do check for character classes that don't map to a single definite character (e.g., \d and \w) and raise an exception if they're still there when we finish (since the reverse lookup can't work). I don't check for things like [a-z] or a{2,3}, but that will almost guarantee the reversing fails, too.

Note that #2977 also addresses this problem, but it does other things, too. Also I think that code may not handle some corner cases correctly. Meanwhile, my patch may be overly agressive and might include handling for characters that will never appear in a URL.

Give SmileyChris and me some time to work this out.

Attachments (2)

reverse_helper.txt (4.8 KB ) - added by Todd O'Bryan <toddobryan@…> 18 years ago.
reverse_helper_cleaned.txt (4.7 KB ) - added by Todd O'Bryan <toddobryan@…> 18 years ago.

Download all attachments as: .zip

Change History (7)

by Todd O'Bryan <toddobryan@…>, 18 years ago

Attachment: reverse_helper.txt added

comment:1 by Malcolm Tredinnick, 18 years ago

A version of the patch in #2977 that only touches the pattern parsing would be welcome. Splits up some of the "multiple features" issues with that ticket into manageable pieces.

Can you or Chris please work out whether to close this ticket or #2977 as a dupe eventually.

by Todd O'Bryan <toddobryan@…>, 18 years ago

Attachment: reverse_helper_cleaned.txt added

comment:2 by Todd O'Bryan <toddobryan@…>, 18 years ago

Ticket #4075 also mentions a possible problem. Say you have this URL

add/?

Should it resolve to add/, add, or raise an exception? For this example, it doesn't really matter, but what about forms like

^profile/((?P<username>[^/]+)/)?

where the whole username pattern is optional? (This makes sense if no username means the current user, for example.)

My inclination would be to tell people to split this up into two patterns with the same view and use the new named URL pattern idea, but it would be nice if this did something reasonable--either raising an exception or only working if the username is supplied or, possibly, working in either case, although at that point we're doing an awful lot of analysis of URLs.

comment:3 by Malcolm Tredinnick, 18 years ago

Do whatever you want for ambiguous reverse patterns -- where "whatever you want" means "whatever leads to the least amount of (computational) work and the simplest code". If somebody is going to write patterns that have ambiguous reversals, we can pick either one. if they want one choice or the other, then they can use named patterns. We can't make your example pattern illegal in any sense, though, since that would restrict forwards resolving unnecessarily.

Either always including or always omitting a truly optional component is probably the right choice.

Be careful, however. Writing a complete reg-exp reverse analysis engine is probably overkill and not worth including. It will be too fragile and difficult to maintain in the long term. This feature is an aid, not a necessity. If somebody is doing really complex things, they can write their own helper function for reversing their complex cases too. It's not worth us carrying the burden. So, by all means add features if it keeps the code simple. However, if there are any tricky design choices like this then (a) think about whether we really need it and (b) start a thread on django-developers, rather than doing design work in the ticket. Supporting most of the simple vocabulary should be possible. Supporting Dr. Evil's worst intentions, maybe not so much.

comment:4 by Chris Beaven, 18 years ago

Resolution: duplicate
Status: newclosed

I'm closing as a dupe. All the functionality is now contained in my patch in #2977. Although it would be nice to split this ticket into multiple features, this is happening inside of the new tokenizer so there's nowhere to "split" it to.

In particular, this specific escaping feature should be happening only inside of regular expression sections of the reverse lookup, not the whole output like this patch does (and my previous one did).

comment:5 by Chris Beaven, 18 years ago

(that should read "...should be happening only outside of replaced regular expression sections")

Note: See TracTickets for help on using tickets.
Back to Top