Opened 11 years ago

Closed 11 years ago

#19098 closed Bug (fixed)

UnicodeDecodeError when including URLs in Windows with non-ASCII paths

Reported by: artamoshin Owned by: nobody
Component: Core (URLs) Version: 1.4
Severity: Normal Keywords: non-ascii unicode windows path UnicodeDecodeError url include
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Including app URLs raises UnicodeDecodeError when project path contains non-ASCII characters (i.e. c:\Users\Александр\Documents\Projects')

Traceback (most recent call last):
  File "C:\Python27\lib\wsgiref\handlers.py", line 85, in run
    self.result = application(self.environ, self.start_response)
  File "C:\Python27\lib\site-packages\django\contrib\staticfiles\handlers.py", line 67, in __call__
    return self.application(environ, start_response)
  File "C:\Python27\lib\site-packages\django\core\handlers\wsgi.py", line 241, in __call__
    response = self.get_response(request)
  File "C:\Python27\lib\site-packages\django\core\handlers\base.py", line 146, in get_response
    response = debug.technical_404_response(request, e)
  File "C:\Python27\lib\site-packages\django\views\debug.py", line 443, in technical_404_response
    'reason': smart_str(exception, errors='replace'),
  File "C:\Python27\lib\site-packages\django\utils\encoding.py", line 116, in smart_str
    return str(s)
  File "C:\Python27\lib\site-packages\django\core\urlresolvers.py", line 235, in __repr__
    return smart_str(u'<%s %s (%s:%s) %s>' % (self.__class__.__name__, self.urlconf_name, self.app_name, self.namespace, self.regex.pattern))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 37: ordinal
 not in range(128)

because string module representation is bytesting that contains non-ASCII path, and it fails to format with u'%s' % self.urlconf_name.

My solution: convert string representation with sys.getfilesystemencoding() to Unicode.

Attachments (1)

regex-url-resolver_repr_unicode.diff (743 bytes ) - added by artamoshin 11 years ago.

Download all attachments as: .zip

Change History (9)

by artamoshin, 11 years ago

comment:1 by Claude Paroz, 11 years ago

This looks very similar to #17566, which has been fixed when I committed the fix for #17892. Would it be possible for you to test on master?

in reply to:  1 comment:2 by artamoshin, 11 years ago

Replying to claudep:

This looks very similar to #17566, which has been fixed when I committed the fix for #17892. Would it be possible for you to test on master?

Unfortunately on master (commit c99ad64) UnicodeDecodeError raises too because repr(self.urlconf_name) all the same contains non-ASCII and fails formatting with unicode '<%s %s (%s:%s) %s>'.

I think there are 3 ways:

  • decode urlconf_name with sys.getfilesystemencoding()
  • use bytestring format string: b'<%s %s (%s:%s) %s>'
  • drop filename and use only module name: self.urlconf_name.__name__

comment:3 by Claude Paroz, 11 years ago

Triage Stage: UnreviewedAccepted

Still in master, I see another way: do not call repr() on self.urlconf_name (which should be a proper unicode string). Can you test that?

in reply to:  3 comment:4 by artamoshin, 11 years ago

Replying to claudep:

Still in master, I see another way: do not call repr() on self.urlconf_name (which should be a proper unicode string). Can you test that?

Formatting u'%s' % self.urlconf_name anyway call repr() implicitly. Python (at least 2.7) repr(module) always return bytestring (not Unicode) like <module 'modulename' from 'non/ascii/path/modulename.py'>, which may contains codes >127. Next, implicit decoding while formatting uses 'ascii' codec, that raises UnicodeDecodeError because it doesn't know what to do with that codes.

Last edited 11 years ago by artamoshin (previous) (diff)

comment:5 by Claude Paroz, 11 years ago

What is exactly the value of self.urlconf_name at the start of the __repr__ method? If it is a proper unicode string, then it should not be a problem to include it in the format string (unicode both sides). Currently the repr(self.urlconf_name) is producing some encoded chars which then produce the UnicodeDecodeError. Sorry if I miss the point, I try to understand the issue as I cannot reproduce it locally.

in reply to:  5 comment:6 by artamoshin, 11 years ago

Replying to claudep:

What is exactly the value of self.urlconf_name at the start of the __repr__ method? If it is a proper unicode string, then it should not be a problem to include it in the format string (unicode both sides). Currently the repr(self.urlconf_name) is producing some encoded chars which then produce the UnicodeDecodeError. Sorry if I miss the point, I try to understand the issue as I cannot reproduce it locally.

No, __repr__ returns NOT Unicode!

print self.urlconf_name # <module 'testproject.included_urls' from 'C:\Тест\testproject\included_urls.pyc'>

# repr() function returns not ASCII-safe binary string:
print repr(self.urlconf_name)       # <module 'testproject.included_urls' from 'C:\Тест\testproject\included_urls.pyc'>
print type(repr(self.urlconf_name)) # <type 'str'>

print self.urlconf_name.__repr__()       # <module 'testproject.included_urls' from 'C:\Тест\testproject\included_urls.pyc'>
print type(self.urlconf_name.__repr__()) # <type 'str'>

print self.urlconf_name.__file__       # C:\Тест\testproject\included_urls.pyc
print type(self.urlconf_name.__file__) # <type 'str'>


# ASCII-safe:
print repr(repr(self.urlconf_name)) # "<module 'testproject.included_urls' from 'C:\\\xd2\xe5\xf1\xf2\\testproject\\included_urls.pyc'>"


# Formatting:
b'%s' % repr(self.urlconf_name) # OK
u'%s' % repr(self.urlconf_name).decode('mbcs') # OK
u'%s' % repr(self.urlconf_name) # UnicodeEncodeError

You may reproduce this by renaming project path using non-latin characters, so self.urlconf_name.__file__ (binary string) will contain non-ASCII.

comment:7 by Claude Paroz, 11 years ago

Sorry, I was wrongly assuming that self.urlconf_name was unicode, which is not.

Then, when I test with a non-ascii character in project path, Django breaks at several places. I do not say that we should not try to fix it, but currently it is probably not safe to do so...

comment:8 by Claude Paroz, 11 years ago

Resolution: fixed
Status: newclosed

We recently made progress in how we handle non-ascii paths (#19357). I've just tested technical_404_response and it ran fine. Reopen if you can reproduce on recent code.

Note: See TracTickets for help on using tickets.
Back to Top