Opened 18 years ago

Closed 18 years ago

#2543 closed defect (duplicate)

[patch] Django doesn't handle UTF-8 encoded URLs properly

Reported by: Victor Ng <crankycoder@…> Owned by: hugo
Component: Internationalization Version:
Severity: normal Keywords:
Cc: victor.ng@… Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

RFC 2718 and RFC 3986 specify that URLs for non-english characters should be encoded in UTF-8 and unsafe characters should be encoded in %HH format.

Django should decode paths assuming that they are UTF8 data and make them unicode strings.

Here's a patch including a single line addition to the unit test suite to decode paths into unicode data and to enable the re.UNICODE flag on regular expression matching in the urlpatterns.

Index: django/core/urlresolvers.py
===================================================================
--- django/core/urlresolvers.py	(revision 3582)
+++ django/core/urlresolvers.py	(working copy)
@@ -79,9 +79,10 @@
             test_regex = grouped
         # Note we're using re.match here on purpose because the start of
         # to string needs to match.
-        if not re.match(test_regex + '$', str(value)): # TODO: Unicode?
+        match_objs = re.match("(" + test_regex + ')$', unicode(value), re.UNICODE)
+        if not match_objs:
             raise NoReverseMatch("Value %r didn't match regular expression %r" % (value, test_regex))
-        return str(value) # TODO: Unicode?
+        return match_objs.group(0)
 
 class RegexURLPattern(object):
     def __init__(self, regex, callback, default_args=None):
@@ -89,7 +90,7 @@
         # callback is either a string like 'foo.views.news.stories.story_detail'
         # which represents the path to a module and a view function name, or a
         # callable object (view).
-        self.regex = re.compile(regex)
+        self.regex = re.compile(regex, re.UNICODE)
         if callable(callback):
             self._callback = callback
         else:
@@ -143,7 +144,7 @@
     def __init__(self, regex, urlconf_name, default_kwargs=None):
         # regex is a string representing a regular expression.
         # urlconf_name is a string representing the module containing urlconfs.
-        self.regex = re.compile(regex)
+        self.regex = re.compile(regex, re.UNICODE)
         self.urlconf_name = urlconf_name
         self.callback = None
         self.default_kwargs = default_kwargs or {}
@@ -225,6 +226,7 @@
         from django.conf import settings
         urlconf = settings.ROOT_URLCONF
     resolver = RegexURLResolver(r'^/', urlconf)
+
     return resolver.resolve(path)
 
 def reverse(viewname, urlconf=None, args=None, kwargs=None):
Index: django/core/handlers/base.py
===================================================================
--- django/core/handlers/base.py	(revision 3582)
+++ django/core/handlers/base.py	(working copy)
@@ -60,6 +60,9 @@
             if response:
                 return response
 
+        # URLs are encoded in UTF8 (RFC3986, RFC2718)
+        path = path.decode('utf8')
+
         resolver = urlresolvers.RegexURLResolver(r'^/', settings.ROOT_URLCONF)
         try:
             callback, callback_args, callback_kwargs = resolver.resolve(path)
Index: tests/othertests/urlpatterns_reverse.py
===================================================================
--- tests/othertests/urlpatterns_reverse.py	(revision 3582)
+++ tests/othertests/urlpatterns_reverse.py	(working copy)
@@ -23,6 +23,8 @@
     ('^people/(?P<state>\w\w)/(?P<name>\w+)/$', NoReverseMatch, [], {'name': 'adrian'}),
     ('^people/(?P<state>\w\w)/(\w+)/$', NoReverseMatch, ['il'], {'name': 'adrian'}),
     ('^people/(?P<state>\w\w)/(\w+)/$', 'people/il/adrian/', ['adrian'], {'state': 'il'}),
+    ('^people/(?P<state>\w\w)/(\w+)/$', 'people/il/adrian/', ['adrian'], {'state': 'il'}),
+    ('^people/(?P<state>\w\w)/(?P<name>\w+)/$', u'people/il/adria\xd0/', [], {'state': u'il', 'name': u'adria\xd0'}),
 )
 
 def run_tests(verbosity=0):

Change History (7)

comment:1 by anonymous, 18 years ago

Component: Admin interfaceInternationalization
Owner: changed from Adrian Holovaty to hugo

comment:2 by Victor Ng <crankycoder@…>, 18 years ago

The urlresolvers.py patch can fail in some cases where grouping is more complicated.

Here's a safer version :

Index: django/core/urlresolvers.py
===================================================================
--- django/core/urlresolvers.py (revision 3582)
+++ django/core/urlresolvers.py (working copy)
@@ -79,9 +79,9 @@
             test_regex = grouped
         # Note we're using re.match here on purpose because the start of
         # to string needs to match.
-        if not re.match(test_regex + '$', str(value)): # TODO: Unicode?
+        if not re.match(test_regex + '$', unicode(value), re.UNICODE):
             raise NoReverseMatch("Value %r didn't match regular expression %r" % (value, test_regex))
-        return str(value) # TODO: Unicode?
+        return unicode(value)

 class RegexURLPattern(object):
     def __init__(self, regex, callback, default_args=None):
@@ -89,7 +89,7 @@
         # callback is either a string like 'foo.views.news.stories.story_detail'
         # which represents the path to a module and a view function name, or a
         # callable object (view).
-        self.regex = re.compile(regex)
+        self.regex = re.compile(regex, re.UNICODE)
         if callable(callback):
             self._callback = callback
         else:
@@ -143,7 +143,7 @@
     def __init__(self, regex, urlconf_name, default_kwargs=None):
         # regex is a string representing a regular expression.
         # urlconf_name is a string representing the module containing urlconfs.
-        self.regex = re.compile(regex)
+        self.regex = re.compile(regex, re.UNICODE)
         self.urlconf_name = urlconf_name
         self.callback = None
         self.default_kwargs = default_kwargs or {}
@@ -225,6 +225,7 @@
         from django.conf import settings
         urlconf = settings.ROOT_URLCONF
     resolver = RegexURLResolver(r'^/', urlconf)
+
     return resolver.resolve(path)

 def reverse(viewname, urlconf=None, args=None, kwargs=None):

comment:3 by Malcolm Tredinnick, 18 years ago

I'd like the tests on this to be a little more comprehensive, please. In particular, you are only testing what happens when you send valid UTF-8 encoding. What happens when we send something invalid? Having confidence that we handle that correctly is important for security purposes.

comment:4 by anonymous, 18 years ago

I'm not sure I underatand. How would you go about compromising the security of Django with non-UTF8 data in the URL?

I've added a little more code to trap the UnicodeDecodeError and raise an instance of SuspiciousOperation for non-UTF8 data, but to be honest - I don't see how this is more or less secure than the previous patch. Is this what you're looking for?

I'm also not exactly sure how I'm supposed to excercise the URLresolver for the passing case - there doesn't seem to be a sample test that lets me see how to do this. Can someone fill in the valid case of having UTF8 data in a URL decode properly in to test_unicode.py?

Index: django/core/urlresolvers.py
===================================================================
--- django/core/urlresolvers.py (revision 3582)
+++ django/core/urlresolvers.py (working copy)
@@ -79,9 +79,9 @@
             test_regex = grouped
         # Note we're using re.match here on purpose because the start of
         # to string needs to match.
-        if not re.match(test_regex + '$', str(value)): # TODO: Unicode?
+        if not re.match(test_regex + '$', unicode(value), re.UNICODE):
             raise NoReverseMatch("Value %r didn't match regular expression %r" % (value, test_regex))
-        return str(value) # TODO: Unicode?
+        return unicode(value)

 class RegexURLPattern(object):
     def __init__(self, regex, callback, default_args=None):
@@ -89,7 +89,7 @@
         # callback is either a string like 'foo.views.news.stories.story_detail'
         # which represents the path to a module and a view function name, or a
         # callable object (view).
-        self.regex = re.compile(regex)
+        self.regex = re.compile(regex, re.UNICODE)
         if callable(callback):
             self._callback = callback
         else:
@@ -143,7 +143,7 @@
     def __init__(self, regex, urlconf_name, default_kwargs=None):
         # regex is a string representing a regular expression.
         # urlconf_name is a string representing the module containing urlconfs.
-        self.regex = re.compile(regex)
+        self.regex = re.compile(regex, re.UNICODE)
         self.urlconf_name = urlconf_name
         self.callback = None
         self.default_kwargs = default_kwargs or {}
@@ -225,6 +225,7 @@
         from django.conf import settings
         urlconf = settings.ROOT_URLCONF
     resolver = RegexURLResolver(r'^/', urlconf)
+
     return resolver.resolve(path)

 def reverse(viewname, urlconf=None, args=None, kwargs=None):
Index: django/core/handlers/base.py
===================================================================
--- django/core/handlers/base.py        (revision 3582)
+++ django/core/handlers/base.py        (working copy)
@@ -50,10 +50,18 @@

     def get_response(self, path, request):
         "Returns an HttpResponse object for the given HttpRequest"
+        from exceptions import UnicodeDecodeError
         from django.core import exceptions, urlresolvers
         from django.core.mail import mail_admins
         from django.conf import settings

+        # URLs are encoded in UTF8 (RFC3986, RFC2718)
+
+        try:
+            path = path.decode('utf8')
+        except UnicodeDecodeError, ude:
+            raise exceptions.SuspiciousOperation, "Failed to decode the URL using UTF8.  This is probably a user agent."
+
         # Apply request middleware
         for middleware_method in self._request_middleware:
             response = middleware_method(request)
Index: tests/othertests/test_unicode.py
===================================================================
--- tests/othertests/test_unicode.py    (revision 0)
+++ tests/othertests/test_unicode.py    (revision 0)
@@ -0,0 +1,23 @@
+from django.core.handlers.base import BaseHandler
+from django.core.exceptions import SuspiciousOperation
+from django.conf import settings
+
+# Test that django.core.handlers.base.BaseHandler handles malicious UTF8
+# URLs by raising a SuspiciousOperation
+
+utf16 = u'some_other_encoding/foo/bar/'.encode('utf16')
+
+bh = BaseHandler()
+bh._request_middleware = [] # set an empty list of request middleware
+
+try:
+    bh.get_response(utf16, None)
+except SuspiciousOperation, so:
+    # expected error
+    pass
+else:
+    raise "expected a suspicious operation here"
+
+
+# TODO: Test that django.core.handlers.base.BaseHandler handles 'clean' UTF8 URLs
+# I'm actually not sure how to do this.
Index: tests/othertests/urlpatterns_reverse.py
===================================================================
--- tests/othertests/urlpatterns_reverse.py     (revision 3582)
+++ tests/othertests/urlpatterns_reverse.py     (working copy)
@@ -23,6 +23,8 @@
     ('^people/(?P<state>\w\w)/(?P<name>\w+)/$', NoReverseMatch, [], {'name': 'adrian'}),
     ('^people/(?P<state>\w\w)/(\w+)/$', NoReverseMatch, ['il'], {'name': 'adrian'}),
     ('^people/(?P<state>\w\w)/(\w+)/$', 'people/il/adrian/', ['adrian'], {'state': 'il'}),
+    ('^people/(?P<state>\w\w)/(\w+)/$', 'people/il/adrian/', ['adrian'], {'state': 'il'}),
+    ('^people/(?P<state>\w\w)/(?P<name>\w+)/$', u'people/il/adria\xd0/', [], {'state': u'il', 'name': u'adria\xd0'}),
 )

 def run_tests(verbosity=0):

comment:5 by anonymous, 18 years ago

Cc: victor.ng@… added

comment:6 by Malcolm Tredinnick, 18 years ago

There's possibly no need to explicitly turn decoding errors into a new type of exception (SuspiciousOperation) -- I'm not sure about that yet -- but we do need to be able to handle them. What you've done is a reasonable start, since now we know where to look for the error handling.

As far as why it's a security problem: any malformed input is potentially a security problem, because it can cause unpredictable behaviour, whether it just be a crash (denial of service) or something worse. So we should be trying to handle all sorts of malformed input. You are correct that we still need better "forwards" URL resolving testing. That needs to be written and should probably be done as soon as we can. I'll look over the rest of this when I get a chance (if nobody else gets to it first).

In future, please include patches as attachments to the ticket, it makes downloading, applying and reading them easier.

comment:7 by Victor Ng <victor.ng@…>, 18 years ago

Resolution: duplicate
Status: newclosed

#2588 supercedes this patch

Note: See TracTickets for help on using tickets.
Back to Top