#170 closed (invalid)
Unicode field names cause UnicodeEncodeError in main admin handler
Reported by: | Owned by: | Adrian Holovaty | |
---|---|---|---|
Component: | contrib.admin | Version: | 1.0 |
Severity: | major | Keywords: | Unicode |
Cc: | moof@… | Triage Stage: | Unreviewed |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
I'am from China.I can't type words in chinese.There is a UnicodeEncodeError.Just at bellow.
#####################################################
There's been an error:
Traceback (most recent call last):
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\handlers\base.py", line 63, in get_response
return callback(request, param_dict)
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\views\admin\main.py", line 424, in change_list
result_repr = strip_tags(str(field_val))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
#####################################################
There's been an error:
Traceback (most recent call last):
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\handlers\base.py", line 63, in get_response
return callback(request, param_dict)
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\views\admin\main.py", line 956, in change_stage
return HttpResponse(t.render(c))
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 116, in render
return self.nodelist.render(context)
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 437, in render
bits.append(node.render(context))
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template_loader.py", line 97, in render
return compiled_parent.render(context)
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 116, in render
return self.nodelist.render(context)
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 437, in render
bits.append(node.render(context))
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template_loader.py", line 97, in render
return compiled_parent.render(context)
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 116, in render
return self.nodelist.render(context)
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 437, in render
bits.append(node.render(context))
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\defaulttags.py", line 171, in render
return self.nodelist_true.render(context)
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 437, in render
bits.append(node.render(context))
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template_loader.py", line 54, in render
result = self.nodelist.render(context)
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 437, in render
bits.append(node.render(context))
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\defaulttags.py", line 171, in render
return self.nodelist_true.render(context)
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 437, in render
bits.append(node.render(context))
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 467, in render
output = resolve_variable_with_filters(self.var_string, context)
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 412, in resolve_variable_with_filters
obj = registered_filters[name][0](obj, arg)
File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\defaultfilters.py", line 185, in striptags
value = str(value)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
Attachments (1)
Change History (28)
comment:1 by , 19 years ago
Resolution: | → worksforme |
---|---|
Status: | new → closed |
comment:2 by , 19 years ago
Resolution: | worksforme |
---|---|
Status: | closed → reopened |
There are NO special templates,views or datas.
Following the tutorial 2,I run the sample app,polls.I create a new poll,type the poll's title in chinese,then,the "UnicodeEncodeError" exception broken out.The polls app is crash.
My english is poor.Sorry.
comment:3 by , 19 years ago
There are NO special templates,views or datas.
Following the tutorial 2,I run the sample app,polls.I create a new poll,type the poll's title in chinese,then,the "UnicodeEncodeError" exception broken out.The polls app is crash.
My english is poor.Sorry.
comment:4 by , 19 years ago
There are NO special templates,views or datas.
Following the tutorial 2,I run the sample app,polls.I create a new poll,type the poll's title in chinese,then,the "UnicodeEncodeError" exception broken out.The polls app is crash.
My english is poor.Sorry.
comment:5 by , 19 years ago
Looks like your browser sends in stuff with chines characters as unicode and marks the submitted form data as unicode, too. And the python stuff builds a unicode string out of that correctly. But django uses just str() on strings and so can't cope with stuff outside ISO-8859-1 - it would have to check for wether the string is type() or type(u) and use stuff.encode('utf-8') in the latter cases. Best would be if django would go fully unicode and utf-8, because that would solve all special character issues once and for all ...
comment:6 by , 19 years ago
Severity: | normal → major |
---|
When just trying it with Safari as browser and a django installation on OS X, I got the same problem with just a normal german umlaut (ISO-8859-1 char) - I wanted to add a poll with an umlaut in it's text: "erzähl was geht". Got the following traceback:
{{
There's been an error:
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/django/core/handlers/base.py", line 63, in get_response
return callback(request, param_dict)
File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/django/views/admin/main.py", line 427, in change_list
result_repr = strip_tags(str(field_val))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 3: ordinal not in range(128)
}}
This really is a problem for many users that don't live in 7-bit-land, so I upped it to "major" ...
comment:7 by , 19 years ago
I just did some more digging around and found that Django outputs XHTML with charset=utf-8 in the admin pages and that of course makes some browsers to return stuff as utf-8 in forms. That utf-8 stuff is turned into unicode - I think it's the POST payload parsing that uses the email module if the stuff sent is multipart (and a browser that want's to send in charset declarations has to use multipart encoding, even if it's just sending one part).
I think the correct way would be to make the httpwrapper.py stuff to allways produce unicode strings, even if the input isn't unicode (just do stringstuff.decode('iso-8859-1')) and then on the output side not to use str() but to use stringstuff.encode('utf-8') to allways produce utf-8 encoded strings. But that looks like _lot_ of work.
The alternative would be to turn all stuff that is passed in via the email module to utf-8 - but then you would have the possibility of different string encodings in simple strings, because stuff that's not parsed by the email module will be essentially iso-8859-1 ...
comment:8 by , 19 years ago
An additional problem is with sqlite: the pysqlite2 allways returns Unicode strings, at least on OS X with Python 2.3. This poses problems with data that has chars above chr(128), of course, as later code in the system will have the same problems as with the stuff from the form-payload-parsing ...
comment:9 by , 19 years ago
In my quest to find a solution that's easy to do I followed a different approach this time: I changed the code so that all strings that are parsed are turned into utf-8 and not passed on as unicode. This isn't a perfect solution, but I wanted to see how it works out. I used the following diff:
Index: utils/httpwrappers.py =================================================================== --- utils/httpwrappers.py (revision 335) +++ utils/httpwrappers.py (working copy) @@ -52,7 +52,10 @@ 'content': submessage.get_payload(), }) else: - POST.appendlist(name_dict['name'], submessage.get_payload()) + pvalue = submessage.get_payload() + if type(pvalue) == unicode: + pvalue = pvalue.encode('utf-8') + POST.appendlist(name_dict['name'], pvalue) return POST, FILES class QueryDict(datastructures.MultiValueDict):
Then I stumbled about the above problem with sqlite and utf-8/unicode storage. So this might be a possible approach - just turn every input into utf-8 and make sure that every form field is correct utf-8 stuff. The problem: when you use string indexing, your utf-8 stuff won't behave nicely, as a char isn't allways a string index position any more. Additionally the sqlite backend needs to be fixed in that it should return utf-8 encoded strings, too.
Oh, and you shouldn't send out only the meta http-equiv stuff but send out actual HTTP headers - Content-type: text/html; charset=utf-8 - instead, because otherwise there could be problems with default charsets added by Apache that override the http-equiv stuff. And http-equiv is a very bad hack anyway. Better to use real headers.
comment:10 by , 19 years ago
Thanks!hugo.
I use Opera 8 as my browser.The database i used is the sqlite3.I don't use Appache,just django websever.
Maybe you can issue an ANSI version of django,and an UNICODE version too.
comment:11 by , 19 years ago
Ok, I digged a bit deeper and checked the two alternatives we have:
1) change all input to unicode and turn everything to utf-8 on the interfaces to databases and the web. This came out rather nasty, because there are loads of situations where this will break - strings that are constructed in ways that won't work directly with unicode, for example the automatic object naming is done with str usage (by just turning objects into strings via '%s' interpolation). And there are major problems with the postgresql driver if that sees unicode strings - the application just dies without traceback.
2) change all input to utf8 bytestrings and keep the rest as is. This did work out nicely, but has problems with sqlite, because that one does return only unicode strings (at least if you run sqlite with unicode support), so the sqlite backend will need additional work to make it allways return utf8 bytestrings.
The problem with solution 2 is, it will break when people expect strings to be one-position-is-one-character - that's only true for unicode, not for utf-8. But as long as people use highlevel functions to work on utf-8 strings it should be ok. And if we do know that all bytestrings must be utf-8 encoded, we can just as easily turn them into unicode if we need this for a function. But as soon as somebody mixes iso-8859-1 bytestrings and utf8 bytestrings, this will break in ugly ways ...
My patch does introduce a utf8 helper function in django.utils.text that will return an utf-8 encoded string and does autodiscovery on strings. Of course this only works for pure utf-8 strings - if you mix them up with iso-8859-1 strings, this will return doubly encoded utf-8 chars ...
Here is the patch:
Index: utils/httpwrappers.py =================================================================== --- utils/httpwrappers.py (revision 335) +++ utils/httpwrappers.py (working copy) @@ -1,5 +1,6 @@ from Cookie import SimpleCookie from pprint import pformat +from django.utils.text import utf8 import datastructures DEFAULT_MIME_TYPE = 'text/html' @@ -52,7 +53,7 @@ 'content': submessage.get_payload(), }) else: - POST.appendlist(name_dict['name'], submessage.get_payload()) + POST.appendlist(name_dict['name'], utf8(submessage.get_payload())) return POST, FILES class QueryDict(datastructures.MultiValueDict): Index: utils/text.py =================================================================== --- utils/text.py (revision 335) +++ utils/text.py (working copy) @@ -1,5 +1,20 @@ import re +def utf8(ob, onlystr=True): + """ + Turns a given object into an utf-8 string. If onlystr is True, only string + types will be handled, other objects will pass unchanged. if onlystr is False, + objects will be str()ed and the result will be handled. This function will + handle utf-8 and iso-8859-1 autodiscovery automatically. + """ + if onlystr: + if type(ob) not in (str, unicode): return ob + else: ob = str(ob) + if type(ob) == str: + try: ob = ob.decode('utf-8') + except UnicodeError: ob = ob.decode('iso-8859-1') + return ob.encode('utf-8') + def wrap(text, width): """ A word-wrap function that preserves existing line breaks and most spaces in
With this patch I can just use special characters with django, as long as I use the postgresql backend. The sqlite backend still won't work.
comment:12 by , 19 years ago
Cc: | added |
---|
It seems a bit strange to me as a person who uses the unicode object that you'd prefer to have everything as UTF-8 string rather than unicode strings. Seriously. The whole point of the unicode string is you no longer need to worry about charset issues when you use it.
More specifically, not all databases will take UTF-8 strings correctly, and some database drivers will happily do the conversion from a unicode string to the correct charset, whereas if they don't know it's unicode, then they won't bother. I'm thinking of mx.ODBC here, specifically, but anything that knows how to cope with unicode will find it annoying to have to convert to and from UTF-8 all the time.
The correct way to deal with unicode in a program is:
- Convert everything to unicode on input. This requires you to know the character set being input.
- This is actually a lot more difficult than it sounds, especially on the web where there is still no sensible way to send charset data along with form gets and posts. The general rule of thumb is if your web page outputs as UTF-8 then all form posting will be done as UTF-8. It's not infallible, but it's the best we got.
- Use Unicode strings for everything inside the application.
- This is, in a way, related to i18n (see #65), in that wrapping everything in _() markers can be done at the same time. Also, it means you don't have to know the character set of the language file, just assume it'll be converted to unicode.
- Unicode strings handle
%s
sensibly, sort-of.u"%s" % "bar"
will returnu"bar"
having run"bar"
through unicode(). However,"%s" % u"áéíóú"
will return"áéíóú"
, or, more commonly, aUnicodeEncodeError
, which is what is happening in this case, as it runs the unicode string through str(). Both unicode and str() will use sys.getdefaultencoding(), which is, stupidly, "ascii" unless you've gone and done the right thing and editedsitecustomize.py
orsite.py
. PEP 263 is only a half-useful way of solving this issue.- In summary: By changing all internal user strings to unicode, you might trigger
UnicodeEncodeError
s if someone tries to interpolate a non-ASCII string where it might not have done before, but the solution is simple (add u in front of all strings) and it gives people who do have to cope with this stuff day-in and day-out much less in the way of headaches
- In summary: By changing all internal user strings to unicode, you might trigger
- exceptions to this rule include:
- SQL commands and column names. To my knowledge, no database allows non-ASCII identifiers. This does not extend to data values.
cursor.execute("INSERT INTO atable (foo, bar) VALUES (?,?)", (u"côsa", u"thïngy"))
is normally the correct way to do things for dbapi2-compliant drivers. Also, most DB drivers will return unicode strings if the contents are not ASCII, or can be convinced to do so. - Python identifiers, which have to be ASCII, amongst a bunch of other restrictions.
- SQL commands and column names. To my knowledge, no database allows non-ASCII identifiers. This does not extend to data values.
- Replace all references to
str()
tounicode()
. Save yourself from the madness. - Re-encode everything the application outputs.
- In terms of web things this is best done by having an explicitly-stated encoding attribute in the Content-Type header.
- For reasons to do with knowing your input, as expressed above, this should default to "UTF-8", but should be settable by the programmer.
The "solution" abrogated above about doing {{{string.decode("ISO-8859-1")}} is blatantly wrong, because not everybody uses ISO-8859-1. This wouldn't help the original poster who is trying to cope with one of the Chinese alphabets, and might even make it worse for him.
It's not an easy thing to solve, Unicode handling in python sucks on many levels, but not using it sucks even more. I'm willing to put in some legwork towards encoding all strings in unicode, and even try and identify all the inputs and outputs. But it's a massive enough job I'd want to do it in a brnach and then merge it. And I'd be willing to write some proper "Coping with Unicode in Django" documentation.
comment:13 by , 19 years ago
Incidentally, the original poster can work around his problem for the moment by placing this in site-package/sitecustomize.py
:
import sys import locale encoding = "ascii" loc = locale.getdefaultlocale() if loc[1]: encoding = loc[1] sys.setdefaultencoding(encoding)
Or, if he's on a non-locale-aware machine, then by simply placing:
import sys encoding = "ISO-8859-1" # or whatever encoding you are using sys.setdefaultencoding(encoding)
in his sitecustomize.py
. And no, you can't call it outside sitecustomize.py
, which is just one of the many ways python's unicode handling sucks.
comment:14 by , 19 years ago
for the record: adrians patch worked for me, at least in one of two scenarios (the other one will be tested later)
comment:15 by , 19 years ago
I implemented a fix in [340]. Please let me know whether you're still having the problem.
comment:16 by , 19 years ago
With Safari and the sqlite backend I still get the following traceback:
There's been an error: Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/django/core/handlers/base.py", line 63, in get_response return callback(request, **param_dict) File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/django/views/admin/main.py", line 427, in change_list result_repr = strip_tags(str(field_val)) UnicodeEncodeError: 'ascii' codec can't encode character u'\xdc' in position 0: ordinal not in range(128)
But Firefox and Safari against the builtin server on Linux and OS X with the PostgreSQL (both on Linux and OS X) backend does work. So it looks like the sqlite-allways-returns-unicode-objects might be a last problem to fix.
comment:17 by , 19 years ago
This can be circumvented by registering converters for varchar types - but the problem is, you need to register converters for "varchar(200)" and not "varchar" alone - sqlite parses the first word, and the "(200)" is part of the word ...
comment:18 by , 19 years ago
a fix for sqlite and unicode is in ticket 227: http://code.djangoproject.com/ticket/227
comment:21 by , 19 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Since #277 is closed, I'm going to assume this ticket is fixed as well. Please re-open if I'm incorrect.
comment:22 by , 19 years ago
This issue may not have been fixed. My portuguese unicode strings in the models file don´t work. The error and my model code are pasted below. I can be doing something wrong, of course.
There's been an error: Traceback (most recent call last): File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\handlers\base.py", line 64, in get_response response = callback(request, **param_dict) File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\views\admin\main.py", line 978, in change_stage return HttpResponse(t.render(c), mimetype='text/html; charset=utf-8') File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 115, in render return self.nodelist.render(context) File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 436, in render bits.append(node.render(context)) File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template_loader.py", line 97, in render return compiled_parent.render(context) File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 115, in render return self.nodelist.render(context) File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 436, in render bits.append(node.render(context)) File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template_loader.py", line 97, in render return compiled_parent.render(context) File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 115, in render return self.nodelist.render(context) File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 436, in render bits.append(node.render(context)) File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template_loader.py", line 54, in render result = self.nodelist.render(context) File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 439, in render return ''.join(bits) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 206: ordinal not in range(128)
from django.core import meta # Create your models here. class Secao(meta.Model): fields = ( meta.CharField('secao', u'seção', maxlength=30), ) def __repr__(self): return self.secao admin = meta.Admin() class Cliente(meta.Model): fields = ( meta.CharField('cliente', maxlength=50), ) def __repr__(self): return self.cliente admin = meta.Admin() class Comprador(meta.Model): fields = ( meta.ForeignKey(Cliente), meta.CharField('nome', maxlength=50), meta.CharField('email', maxlength=60), ) def __repr__(self): return self.nome admin = meta.Admin() class Ficha(meta.Model): fields = ( meta.ForeignKey(Comprador), meta.ForeignKey(Secao), meta.DateTimeField('criado_date', u'criado em'), meta.CharField('referencia', u'referência', maxlength=20 ), meta.TextField('descricao', u'descrição'), ) def __repr__(self): return self.referencia admin = meta.Admin( fields = ( (None, {'fields': ('criado_date', 'secao_id', 'comprador_id', 'referencia', 'descricao', )}), ), list_display = ('referencia', 'criado_date') )
comment:23 by , 19 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Summary: | There is a UnicodeEncodeError → Unicode field names cause UnicodeEncodeError in main admin handler |
(reopened; changed summary)
comment:24 by , 19 years ago
This brand new PEP may help us all. There´s even a patch but I don´t know how to build patched Python on my Windows XP. Hoping someone to make binaries so I could test it.
comment:25 by , 19 years ago
about the latest error description by perdolfurtado@…... i cannot reproduce it because the model syntax seems to be changed now,
but i think the problem is the following:
django does not support unicode strings.
and python, when he is forced to do a unicode=>bytestring conversion, uses the system's default charset, which is ascii (you can change it,
but it is VERYVERYVERY not recommended).
and that conversion will fail because your string contains non-ascii characters.
you have 2 possibilities:
- use something like: u'referência'.encode('utf-8')
or
- i think this is the best. use simply 'referência' (as bytestring, not as unicode string), and indicate the encoding of the source code at the beginning of the file as defined in
PEP 263. for example :
#!/usr/bin/python # -*- coding: UTF-8 -*-
(p.s: i think this ticket is a NOTABUG or something like that. what's the policy there? who can/should close/resolve a ticket?)
comment:26 by , 19 years ago
Resolution: | → invalid |
---|---|
Status: | reopened → closed |
As Gabor said, you can solve this with a correct coding declaration. Marking INVALID.
comment:27 by , 19 years ago
Type: | defect |
---|
Please give us more information, including your template, your Python view code and your data -- then reopen the ticket.