Code

Opened 4 years ago

Closed 4 years ago

Last modified 3 years ago

#12849 closed (fixed)

django's development server raises an encoding exception when trying to colorize non-ascii text

Reported by: jype Owned by: nobody
Component: django-admin.py runserver Version: master
Severity: Keywords:
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description

Utf-8 should not be hardcoded, but I don't know where to read terminal's charset from. I suppose utf-8 is acceptable as a fallback nowadays.

--- django/utils/termcolors.py  (revision 12416)
+++ django/utils/termcolors.py  (working copy)
@@ -38,7 +38,7 @@
         print colorize('and so should this')
         print 'this should not be red'
     """
-    text = str(text)
+    text = unicode(text).encode('utf-8')
     code_list = []
     if text == '' and len(opts) == 1 and opts[0] == 'reset':
         return '\x1b[%sm' % RESET

Attachments (3)

colorize_unicode.diff (2.8 KB) - added by frasern 4 years ago.
no_str.diff (997 bytes) - added by frasern 4 years ago.
combined.diff (3.8 KB) - added by frasern 4 years ago.
This is the combined version of the two existing patches.

Download all attachments as: .zip

Change History (18)

comment:1 Changed 4 years ago by russellm

  • milestone set to 1.2
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Accepted

Changed 4 years ago by frasern

comment:2 Changed 4 years ago by frasern

I've created a patch (including tests) which allows unicode strings to be passed to colorize.

I had started out with the intention of using django.utils.encoding.DEFAULT_LOCALE_ENCODING (derived from locale.getdefaultlocale); however, I think it is more correct to use locale.getpreferredencoding and ended up going with that instead (in case the user has somehow overridden the default).

For reference, the preferred encoding can be quickly and temporarily altered by setting the LC_CTYPE environment variable:

$ python
>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'
>>> from django.utils.termcolors import colorize
>>> colorize(u'\u00a3', fg='red')
'\x1b[31m\xc2\xa3\x1b[0m'
$ LC_CTYPE=en_GB.iso-8859-1 python
>>> import locale
>>> locale.getpreferredencoding()
'ISO-8859-1'
>>> from django.utils.termcolors import colorize
>>> colorize(u'\u00a3', fg='red')
'\x1b[31m\xa3\x1b[0m'

The function will still raise a UnicodeEncodeError if the unicode string cannot be encoded to the preferred encoding, for example:

$ LC_CTYPE=en_GB.iso-8859-1 python
>>> from django.utils.termcolors import colorize
>>> colorize(u'\u0160', fg='red')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "django/utils/termcolors.py", line 49, in colorize
    text = smart_str(text, encoding=encoding)
  File "django/utils/encoding.py", line 117, in smart_str
    return s.encode(encoding, errors)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0160' in position 0: ordinal not in range(256)

Note that the terminal will need to be set to use the same encoding in order for the colorized output to be displayed correctly. PuTTY, for example, has a configuration setting for this.

Changed 4 years ago by frasern

comment:3 Changed 4 years ago by frasern

I noticed that a couple of places which called colorize were explicitly converting their input to str.

This should no longer be necessary and the no_str.diff patch tidies up these instances.

comment:4 Changed 4 years ago by ericholscher

Should the tests from the original patch still be applied to the second? Would be good to have those as regressions so that this doesn't come back in the future.

comment:5 Changed 4 years ago by ericholscher

  • Needs tests set

comment:6 Changed 4 years ago by frasern

  • Needs tests unset

Hi Eric,

Not sure what you mean? The second patch is in addition to the first; sorry if that was unclear.

Fraser

comment:7 Changed 4 years ago by jacob

  • Patch needs improvement set

Frasern: please combine these into a single patch.

Changed 4 years ago by frasern

This is the combined version of the two existing patches.

comment:8 Changed 4 years ago by frasern

  • Patch needs improvement unset

That's the combined version generated and attached.

comment:9 Changed 4 years ago by jacob

  • Triage Stage changed from Accepted to Ready for checkin

comment:10 Changed 4 years ago by kmtracey

  • Triage Stage changed from Ready for checkin to Accepted

A traceback from the reported problem would have been helpful here.

I don't understand why the fix involves trying to guess an encoding. sys.stdout/sys.stderr have encodings, if we just pass unicode things should work (assuming the output device encoding can actually encode whatever is being sent). If the traceback from when the problem occurs shows it's being caused by some str() call in Django, then I think just removing the str()s would be sufficient.

comment:11 Changed 4 years ago by russellm

I agree that the encoding-guessing is bad -- plus, on at least one of my test boxes (an Ubuntu virtual server on RackSpace), it doesn't work. On my test box, the preferred encoding is "ANSI_X3.4-1968", which is the same as 'ascii'; attempting to encoding a unicode string as ascii will fail on any exotic characters.

Just outputting unicode strings isn't an option; sys.stdout.write() needs to take a byte string. Calling smart_str() to encode strings is about as good as we're going to get here.

comment:12 Changed 4 years ago by russellm

  • Resolution set to fixed
  • Status changed from new to closed

(In [12803]) Fixed #12849 -- Corrected the way strings are encoded for display by the colorizer so that they work with unicode. Thanks to jype for the report, and frasern for his work on the issue.

comment:13 follow-up: Changed 4 years ago by kmtracey

Huh. The description of file.encoding here: http://docs.python.org/library/stdtypes.html?highlight=write#file.write gave me the impression that Python would automatically use the file's encoding to convert unicode passed to write(), but testing shows it's not doing that. Even if it did, now that I think about it for the colorize case, we're adding control bytes so I expect we need to do the encoding to a bytestring prior to adding those bytes.

I'm not sure the committed fix is the best alternative, though. When printing out management command errors I think it would be better to use sys.stderr.encoding, if it exists, since that is where we are sending the output. Windows for example won't be using utf-8 as the terminal encoding (by default), so the fix as committed will result in unreadable output for non-ASCII exception data on Windows. Better than an exception I suppose but I think it would be better to attempt to use the right encoding, and also specify replace rather than strict for error handling so that if the data can't be encoded in the target charset then we still don't raise an exception.

Also simply removing the text = str(text) call from the top of colorize is probably just going to move the exception to the end of that routine, when return ('\x1b[%sm' % ';'.join(code_list)) + text will implicitly convert a unicode text value to str. Which brings me to: I'd still like to understand when this problem crops up, for the colorize case. Unicode query string parms are percent-encoded in output -- under what circumstances is the server being asked to colorize unicode data containing non-ASCII characters?

(Your test box sounds like the build bots that were all broken by a recent test change. The one I looked at -- it was an Ubuntu server also -- had no language packs installed, so even though the locale it was supposedly configured to use had an encoding of utf-8, python was falling back to that highfalutin-sounding ASCII encoding as "preferred". I believe installing the language pack for the specified locale language fixed it. )

comment:14 in reply to: ↑ 13 Changed 4 years ago by russellm

Replying to kmtracey:

I'm not sure the committed fix is the best alternative, though. When printing out management command errors I think it would be better to use sys.stderr.encoding, if it exists, since that is where we are sending the output. Windows for example won't be using utf-8 as the terminal encoding (by default), so the fix as committed will result in unreadable output for non-ASCII exception data on Windows. Better than an exception I suppose but I think it would be better to attempt to use the right encoding, and also specify replace rather than strict for error handling so that if the data can't be encoded in the target charset then we still don't raise an exception.

There are two issues here.

Firstly, whatever encoding we choose, there are going to be problems. On platforms where stderr is ASCII (or equivalent), there is no reliable way to print non-ASCII characters. So, we need to choose ignore/replace (to fail silently) or strict (which will raise the same errors being reported by this ticket). So on ASCII terminals, we're always going to have problems -- it's just a matter of how we hide (or raise) the errors. That said: For the record, my ANSI_X3.4-1968 test box actually manages to print the special unicode characters correctly as long as the bytestring is encoded utf-8. This is why I checked in the patch I did.

Secondly, if we do anything in this area, we're going to need to audit pretty much all the current management commands. At the moment, there is a certain amount of confusion regarding how and when text is encoded for display; for example, sqlall builds everything in unicode, and encodes to UTF-8 before returning the value for the base command framework to print the value. We would be well served to do a full teardown here and ensure we keep unicode right up to the last moment -- but that's a much bigger patch.

[12849] fixed every observable case that I could generate, on UTF-8 and ASCII terminals. I have no doubt that there are cp1252 or KOI8-R terminals that will still have problems, but we'll need some new test cases (and platforms on which to test them).

Which brings me to: I'd still like to understand when this problem crops up, for the colorize case. Unicode query string parms are percent-encoded in output -- under what circumstances is the server being asked to colorize unicode data containing non-ASCII characters?

The test case I've been using is to have a database model with a field that has db_column=u'hello\u00c2\u00c3'. This is output as the column name under sqlall, which broke when it was passed into str() at the start of colorize.

comment:15 Changed 3 years ago by jacob

  • milestone 1.2 deleted

Milestone 1.2 deleted

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.