Opened 3 years ago

Closed 3 years ago

#22378 closed Cleanup/optimization (fixed)

\d in urls examples in documentation is missleading.

Reported by: tomwys Owned by: chris cauley
Component: Documentation Version: 1.6
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

\d with re.UNICODE flag allows to use digits like "१". This flag is used when parsing Django urls.

For example, regexp like that (https://docs.djangoproject.com/en/dev/topics/http/urls/#example):

 url(r'^articles/(\d{4})/$', 'news.views.year_archive'),

will accept "articles/११११/". This is missleading for developer. It could also be harmful. If developer will use string parsed like that in other way than parsing to int(), it could lead to error or even security issue.

Change History (8)

comment:1 Changed 3 years ago by Tim Graham

Could you elaborate a bit on the problem and what change you think we should make? When I try your example with something like articles/१/, the value passed to the view is "1". If there are security implications you should report them privately per https://docs.djangoproject.com/en/1.6/internals/security/#reporting-security-issues. Thanks.

comment:2 Changed 3 years ago by Aymeric Augustin

I think the suggestion is simply to use [0-9] which is unambiguous even in an unicode world.

comment:3 in reply to:  2 Changed 3 years ago by tomwys

Yes, using [0-9] is fine. It would be great to change all \d to [0-9] in documentation and tests related to urls.

ie. developer without knowledge about \d and re.UNICODE used in urls can expect that following code ensures that digit is written to /tmp/otuput.

from django.conf.urls import patterns, include, url

from django.contrib import admin
from django.http import HttpResponse

def view(request, arg):
    open("/tmp/output", 'w').write(arg.encode('utf-8'))
    return HttpResponse(arg)

urlpatterns = patterns('',
    url(r'^(\d)/$', view),
)

This is not true for localhost:8000/१/ (tested on cPython 2.7.5 and Django master). If developer assume that file (or other storage like radius, cache etc.) contains digit, but this is not true, it could cause errors (or even security issues in extreme cases) in the future.

Also, even if it is passed to int(...) and converted to integer that way, it creates multiple valid urls for single resource, without developer knowing about it. This is minor, but unwanted behavior.

And it is not easy to discover, most of Django developers are puzzled when I talk with them about this problem.

Last edited 3 years ago by tomwys (previous) (diff)

comment:4 Changed 3 years ago by Tim Graham

Triage Stage: UnreviewedAccepted
Type: UncategorizedCleanup/optimization

comment:5 Changed 3 years ago by Marc Tamlyn

A related, but more complex issue is the matter of \w with re.UNICODE. I can see that some projects may wish to accept characters like è in URLs, others may be unwittingly doing so. I believe we had this discussion recently regarding username validation.

I think a specific note about \w and \d in the URL documentation would be advisable. In particular, the "correct" regex for capturing a slug field should also be mentioned.

This also relates to some of my comments on https://code.djangoproject.com/ticket/22218#comment:2 regarding providing a non-regex based approach to URL resolving in core where novice developers are not required to think about these problems.

comment:6 Changed 3 years ago by tomwys

Hurl approach is interesting, but it is not resolving issues with re.UNICODE (but could after some modifications).

Also [0-9]+ is not solving all problems, because you can add leading zeros "0001". Correct regex would be ([1-9][0-9]*) or ([1-9][0-9]*)|0 if you want to support 0. But no one would like to use regexps like that in all of their urls.

Other approach would be to introduce more advanced parsing mechanism that not only validates data, but also converts it to objects and back to strings (something like to_python in models). There are several examples of such parsers that would make sense:

  • integer
  • float
  • decimal
  • date
  • time
  • datetime

Also some of users may want to add more advanced custom parsers like:

  • date range (where first date must be earlier)
  • integer range (same as above)
  • GPS coordinate
  • list of slugs (ie. for tags)

Advantages of this approach:

  • more robust - it is guaranteed to receive assumed value without complicated regexps,
  • more elastic - you can write your views and templates without assuming in what format date will be stored in url,
  • removes from views repetitive tasks like parsing date from url.

Example syntax:

url(r'^articles/(\d{4})/$', 'news.views.year_archive'),
# becomes:
url([r'^articles/', year, '/$'], 'news.views.year_archive')),

url(r'^(?P<question_id>\d+)/$', views.detail, name='detail'),
# becomes:
url(r'^', integer('question_id'), r'^/$'], views.detail, name='detail'),

url(r'^weeklyreports/(?P<year>\d{4})/(?P<month>\d{2})/(?P<day>\d+)/$', 'weeklyreports', name='weeklyreports'),
# becomes:
url([r'^weeklyreports/', date('report_date, 'YYYY/MM/DD'), r'/$'], 'weeklyreports', name='weeklyreports'),

This could be combined with Hurl approach.

comment:7 Changed 3 years ago by chris cauley

Owner: changed from nobody to chris cauley
Status: newassigned

comment:8 Changed 3 years ago by Tim Graham <timograham@…>

Resolution: fixed
Status: assignedclosed

In 66ec9ee441618894c1ccebdcdd5eb4d7fbf4a6d3:

Fixed #22378 -- Updated \d to [0-9]+ in urlpatterns of docs and tests.

Thanks tomwys for the suggestion.

Note: See TracTickets for help on using tickets.
Back to Top