#22378 closed Cleanup/optimization (fixed)

\d in urls examples in documentation is missleading.

Reported by: tomwys Owned by: chriscauley
Component: Documentation Version: 1.6
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

\d with re.UNICODE flag allows to use digits like "१". This flag is used when parsing Django urls.

For example, regexp like that (https://docs.djangoproject.com/en/dev/topics/http/urls/#example):

 url(r'^articles/(\d{4})/$', 'news.views.year_archive'),

will accept "articles/११११/". This is missleading for developer. It could also be harmful. If developer will use string parsed like that in other way than parsing to int(), it could lead to error or even security issue.

Change History (8)

comment:1 Changed 12 months ago by timo

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

Could you elaborate a bit on the problem and what change you think we should make? When I try your example with something like articles/१/, the value passed to the view is "1". If there are security implications you should report them privately per https://docs.djangoproject.com/en/1.6/internals/security/#reporting-security-issues. Thanks.

comment:2 follow-up: Changed 12 months ago by aaugustin

I think the suggestion is simply to use [0-9] which is unambiguous even in an unicode world.

comment:3 in reply to: ↑ 2 Changed 12 months ago by tomwys

Yes, using [0-9] is fine. It would be great to change all \d to [0-9] in documentation and tests related to urls.

ie. developer without knowledge about \d and re.UNICODE used in urls can expect that following code ensures that digit is written to /tmp/otuput.

from django.conf.urls import patterns, include, url

from django.contrib import admin
from django.http import HttpResponse

def view(request, arg):
    open("/tmp/output", 'w').write(arg.encode('utf-8'))
    return HttpResponse(arg)

urlpatterns = patterns('',
    url(r'^(\d)/$', view),
)

This is not true for localhost:8000/१/ (tested on cPython 2.7.5 and Django master). If developer assume that file (or other storage like radius, cache etc.) contains digit, but this is not true, it could cause errors (or even security issues in extreme cases) in the future.

Also, even if it is passed to int(...) and converted to integer that way, it creates multiple valid urls for single resource, without developer knowing about it. This is minor, but unwanted behavior.

And it is not easy to discover, most of Django developers are puzzled when I talk with them about this problem.

Last edited 12 months ago by tomwys (previous) (diff)

comment:4 Changed 12 months ago by timo

  • Triage Stage changed from Unreviewed to Accepted
  • Type changed from Uncategorized to Cleanup/optimization

comment:5 Changed 12 months ago by mjtamlyn

A related, but more complex issue is the matter of \w with re.UNICODE. I can see that some projects may wish to accept characters like è in URLs, others may be unwittingly doing so. I believe we had this discussion recently regarding username validation.

I think a specific note about \w and \d in the URL documentation would be advisable. In particular, the "correct" regex for capturing a slug field should also be mentioned.

This also relates to some of my comments on https://code.djangoproject.com/ticket/22218#comment:2 regarding providing a non-regex based approach to URL resolving in core where novice developers are not required to think about these problems.

comment:6 Changed 12 months ago by tomwys

Hurl approach is interesting, but it is not resolving issues with re.UNICODE (but could after some modifications).

Also [0-9]+ is not solving all problems, because you can add leading zeros "0001". Correct regex would be ([1-9][0-9]*) or ([1-9][0-9]*)|0 if you want to support 0. But no one would like to use regexps like that in all of their urls.

Other approach would be to introduce more advanced parsing mechanism that not only validates data, but also converts it to objects and back to strings (something like to_python in models). There are several examples of such parsers that would make sense:

  • integer
  • float
  • decimal
  • date
  • time
  • datetime

Also some of users may want to add more advanced custom parsers like:

  • date range (where first date must be earlier)
  • integer range (same as above)
  • GPS coordinate
  • list of slugs (ie. for tags)

Advantages of this approach:

  • more robust - it is guaranteed to receive assumed value without complicated regexps,
  • more elastic - you can write your views and templates without assuming in what format date will be stored in url,
  • removes from views repetitive tasks like parsing date from url.

Example syntax:

url(r'^articles/(\d{4})/$', 'news.views.year_archive'),
# becomes:
url([r'^articles/', year, '/$'], 'news.views.year_archive')),

url(r'^(?P<question_id>\d+)/$', views.detail, name='detail'),
# becomes:
url(r'^', integer('question_id'), r'^/$'], views.detail, name='detail'),

url(r'^weeklyreports/(?P<year>\d{4})/(?P<month>\d{2})/(?P<day>\d+)/$', 'weeklyreports', name='weeklyreports'),
# becomes:
url([r'^weeklyreports/', date('report_date, 'YYYY/MM/DD'), r'/$'], 'weeklyreports', name='weeklyreports'),

This could be combined with Hurl approach.

comment:7 Changed 12 months ago by chriscauley

  • Owner changed from nobody to chriscauley
  • Status changed from new to assigned

comment:8 Changed 12 months ago by Tim Graham <timograham@…>

  • Resolution set to fixed
  • Status changed from assigned to closed

In 66ec9ee441618894c1ccebdcdd5eb4d7fbf4a6d3:

Fixed #22378 -- Updated \d to [0-9]+ in urlpatterns of docs and tests.

Thanks tomwys for the suggestion.

Note: See TracTickets for help on using tickets.
Back to Top