#12849 closed (fixed)
django's development server raises an encoding exception when trying to colorize non-ascii text
Reported by: | jype | Owned by: | nobody |
---|---|---|---|
Component: | django-admin.py runserver | Version: | dev |
Severity: | Keywords: | ||
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Utf-8 should not be hardcoded, but I don't know where to read terminal's charset from. I suppose utf-8 is acceptable as a fallback nowadays.
--- django/utils/termcolors.py (revision 12416) +++ django/utils/termcolors.py (working copy) @@ -38,7 +38,7 @@ print colorize('and so should this') print 'this should not be red' """ - text = str(text) + text = unicode(text).encode('utf-8') code_list = [] if text == '' and len(opts) == 1 and opts[0] == 'reset': return '\x1b[%sm' % RESET
Attachments (3)
Change History (18)
comment:1 by , 15 years ago
milestone: | → 1.2 |
---|---|
Triage Stage: | Unreviewed → Accepted |
by , 15 years ago
Attachment: | colorize_unicode.diff added |
---|
comment:2 by , 15 years ago
by , 15 years ago
Attachment: | no_str.diff added |
---|
comment:3 by , 15 years ago
I noticed that a couple of places which called colorize
were explicitly converting their input to str
.
This should no longer be necessary and the no_str.diff patch tidies up these instances.
comment:4 by , 15 years ago
Should the tests from the original patch still be applied to the second? Would be good to have those as regressions so that this doesn't come back in the future.
comment:5 by , 15 years ago
Needs tests: | set |
---|
comment:6 by , 15 years ago
Needs tests: | unset |
---|
Hi Eric,
Not sure what you mean? The second patch is in addition to the first; sorry if that was unclear.
Fraser
comment:7 by , 15 years ago
Patch needs improvement: | set |
---|
Frasern: please combine these into a single patch.
by , 15 years ago
Attachment: | combined.diff added |
---|
This is the combined version of the two existing patches.
comment:8 by , 15 years ago
Patch needs improvement: | unset |
---|
That's the combined version generated and attached.
comment:9 by , 15 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
comment:10 by , 15 years ago
Triage Stage: | Ready for checkin → Accepted |
---|
A traceback from the reported problem would have been helpful here.
I don't understand why the fix involves trying to guess an encoding. sys.stdout/sys.stderr have encodings, if we just pass unicode things should work (assuming the output device encoding can actually encode whatever is being sent). If the traceback from when the problem occurs shows it's being caused by some str()
call in Django, then I think just removing the str()
s would be sufficient.
comment:11 by , 15 years ago
I agree that the encoding-guessing is bad -- plus, on at least one of my test boxes (an Ubuntu virtual server on RackSpace), it doesn't work. On my test box, the preferred encoding is "ANSI_X3.4-1968", which is the same as 'ascii'; attempting to encoding a unicode string as ascii will fail on any exotic characters.
Just outputting unicode strings isn't an option; sys.stdout.write() needs to take a byte string. Calling smart_str() to encode strings is about as good as we're going to get here.
comment:12 by , 15 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
follow-up: 14 comment:13 by , 15 years ago
Huh. The description of file.encoding here: http://docs.python.org/library/stdtypes.html?highlight=write#file.write gave me the impression that Python would automatically use the file's encoding to convert unicode passed to write(), but testing shows it's not doing that. Even if it did, now that I think about it for the colorize case, we're adding control bytes so I expect we need to do the encoding to a bytestring prior to adding those bytes.
I'm not sure the committed fix is the best alternative, though. When printing out management command errors I think it would be better to use sys.stderr.encoding, if it exists, since that is where we are sending the output. Windows for example won't be using utf-8 as the terminal encoding (by default), so the fix as committed will result in unreadable output for non-ASCII exception data on Windows. Better than an exception I suppose but I think it would be better to attempt to use the right encoding, and also specify replace rather than strict for error handling so that if the data can't be encoded in the target charset then we still don't raise an exception.
Also simply removing the text = str(text)
call from the top of colorize is probably just going to move the exception to the end of that routine, when return ('\x1b[%sm' % ';'.join(code_list)) + text
will implicitly convert a unicode text value to str. Which brings me to: I'd still like to understand when this problem crops up, for the colorize case. Unicode query string parms are percent-encoded in output -- under what circumstances is the server being asked to colorize unicode data containing non-ASCII characters?
(Your test box sounds like the build bots that were all broken by a recent test change. The one I looked at -- it was an Ubuntu server also -- had no language packs installed, so even though the locale it was supposedly configured to use had an encoding of utf-8, python was falling back to that highfalutin-sounding ASCII encoding as "preferred". I believe installing the language pack for the specified locale language fixed it. )
comment:14 by , 15 years ago
Replying to kmtracey:
I'm not sure the committed fix is the best alternative, though. When printing out management command errors I think it would be better to use sys.stderr.encoding, if it exists, since that is where we are sending the output. Windows for example won't be using utf-8 as the terminal encoding (by default), so the fix as committed will result in unreadable output for non-ASCII exception data on Windows. Better than an exception I suppose but I think it would be better to attempt to use the right encoding, and also specify replace rather than strict for error handling so that if the data can't be encoded in the target charset then we still don't raise an exception.
There are two issues here.
Firstly, whatever encoding we choose, there are going to be problems. On platforms where stderr is ASCII (or equivalent), there is no reliable way to print non-ASCII characters. So, we need to choose ignore/replace (to fail silently) or strict (which will raise the same errors being reported by this ticket). So on ASCII terminals, we're always going to have problems -- it's just a matter of how we hide (or raise) the errors. That said: For the record, my ANSI_X3.4-1968 test box actually manages to print the special unicode characters correctly as long as the bytestring is encoded utf-8. This is why I checked in the patch I did.
Secondly, if we do anything in this area, we're going to need to audit pretty much all the current management commands. At the moment, there is a certain amount of confusion regarding how and when text is encoded for display; for example, sqlall builds everything in unicode, and encodes to UTF-8 before returning the value for the base command framework to print the value. We would be well served to do a full teardown here and ensure we keep unicode right up to the last moment -- but that's a much bigger patch.
[12849] fixed every observable case that I could generate, on UTF-8 and ASCII terminals. I have no doubt that there are cp1252 or KOI8-R terminals that will still have problems, but we'll need some new test cases (and platforms on which to test them).
Which brings me to: I'd still like to understand when this problem crops up, for the colorize case. Unicode query string parms are percent-encoded in output -- under what circumstances is the server being asked to colorize unicode data containing non-ASCII characters?
The test case I've been using is to have a database model with a field that has db_column=u'hello\u00c2\u00c3'. This is output as the column name under sqlall, which broke when it was passed into str() at the start of colorize.
I've created a patch (including tests) which allows unicode strings to be passed to
colorize
.I had started out with the intention of using
django.utils.encoding.DEFAULT_LOCALE_ENCODING
(derived fromlocale.getdefaultlocale
); however, I think it is more correct to uselocale.getpreferredencoding
and ended up going with that instead (in case the user has somehow overridden the default).For reference, the preferred encoding can be quickly and temporarily altered by setting the
LC_CTYPE
environment variable:The function will still raise a
UnicodeEncodeError
if the unicode string cannot be encoded to the preferred encoding, for example:Note that the terminal will need to be set to use the same encoding in order for the colorized output to be displayed correctly. PuTTY, for example, has a configuration setting for this.