Opened 19 years ago

Closed 18 years ago

Last modified 17 years ago

#170 closed (invalid)

Unicode field names cause UnicodeEncodeError in main admin handler

Reported by: alang.yl@… Owned by: Adrian Holovaty
Component: contrib.admin Version: 1.0
Severity: major Keywords: Unicode
Cc: moof@… Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I'am from China.I can't type words in chinese.There is a UnicodeEncodeError.Just at bellow.

#####################################################

There's been an error:

Traceback (most recent call last):

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\handlers\base.py", line 63, in get_response

return callback(request, param_dict)

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\views\admin\main.py", line 424, in change_list

result_repr = strip_tags(str(field_val))

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

#####################################################

There's been an error:

Traceback (most recent call last):

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\handlers\base.py", line 63, in get_response

return callback(request, param_dict)

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\views\admin\main.py", line 956, in change_stage

return HttpResponse(t.render(c))

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 116, in render

return self.nodelist.render(context)

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 437, in render

bits.append(node.render(context))

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template_loader.py", line 97, in render

return compiled_parent.render(context)

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 116, in render

return self.nodelist.render(context)

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 437, in render

bits.append(node.render(context))

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template_loader.py", line 97, in render

return compiled_parent.render(context)

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 116, in render

return self.nodelist.render(context)

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 437, in render

bits.append(node.render(context))

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\defaulttags.py", line 171, in render

return self.nodelist_true.render(context)

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 437, in render

bits.append(node.render(context))

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template_loader.py", line 54, in render

result = self.nodelist.render(context)

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 437, in render

bits.append(node.render(context))

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\defaulttags.py", line 171, in render

return self.nodelist_true.render(context)

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 437, in render

bits.append(node.render(context))

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 467, in render

output = resolve_variable_with_filters(self.var_string, context)

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 412, in resolve_variable_with_filters

obj = registered_filters[name][0](obj, arg)

File "c:\python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\defaultfilters.py", line 185, in striptags

value = str(value)

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

Attachments (1)

main.py.patch (2.9 KB ) - added by Adrian Holovaty 19 years ago.
Experimental patch to django.views.admin.main

Download all attachments as: .zip

Change History (28)

comment:1 by Adrian Holovaty, 19 years ago

Resolution: worksforme
Status: newclosed

Please give us more information, including your template, your Python view code and your data -- then reopen the ticket.

comment:2 by alang.yl@…, 19 years ago

Resolution: worksforme
Status: closedreopened

There are NO special templates,views or datas.

Following the tutorial 2,I run the sample app,polls.I create a new poll,type the poll's title in chinese,then,the "UnicodeEncodeError" exception broken out.The polls app is crash.

My english is poor.Sorry.

comment:3 by alang.yl@…, 19 years ago

There are NO special templates,views or datas.

Following the tutorial 2,I run the sample app,polls.I create a new poll,type the poll's title in chinese,then,the "UnicodeEncodeError" exception broken out.The polls app is crash.

My english is poor.Sorry.

comment:4 by alang.yl@…, 19 years ago

There are NO special templates,views or datas.

Following the tutorial 2,I run the sample app,polls.I create a new poll,type the poll's title in chinese,then,the "UnicodeEncodeError" exception broken out.The polls app is crash.

My english is poor.Sorry.

comment:5 by hugo <gb@…>, 19 years ago

Looks like your browser sends in stuff with chines characters as unicode and marks the submitted form data as unicode, too. And the python stuff builds a unicode string out of that correctly. But django uses just str() on strings and so can't cope with stuff outside ISO-8859-1 - it would have to check for wether the string is type() or type(u) and use stuff.encode('utf-8') in the latter cases. Best would be if django would go fully unicode and utf-8, because that would solve all special character issues once and for all ...

comment:6 by hugo <gb@…>, 19 years ago

Severity: normalmajor

When just trying it with Safari as browser and a django installation on OS X, I got the same problem with just a normal german umlaut (ISO-8859-1 char) - I wanted to add a poll with an umlaut in it's text: "erzähl was geht". Got the following traceback:

{{
There's been an error:

Traceback (most recent call last):

File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/django/core/handlers/base.py", line 63, in get_response

return callback(request, param_dict)

File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/django/views/admin/main.py", line 427, in change_list

result_repr = strip_tags(str(field_val))

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 3: ordinal not in range(128)
}}

This really is a problem for many users that don't live in 7-bit-land, so I upped it to "major" ...

comment:7 by hugo <gb@…>, 19 years ago

I just did some more digging around and found that Django outputs XHTML with charset=utf-8 in the admin pages and that of course makes some browsers to return stuff as utf-8 in forms. That utf-8 stuff is turned into unicode - I think it's the POST payload parsing that uses the email module if the stuff sent is multipart (and a browser that want's to send in charset declarations has to use multipart encoding, even if it's just sending one part).

I think the correct way would be to make the httpwrapper.py stuff to allways produce unicode strings, even if the input isn't unicode (just do stringstuff.decode('iso-8859-1')) and then on the output side not to use str() but to use stringstuff.encode('utf-8') to allways produce utf-8 encoded strings. But that looks like _lot_ of work.

The alternative would be to turn all stuff that is passed in via the email module to utf-8 - but then you would have the possibility of different string encodings in simple strings, because stuff that's not parsed by the email module will be essentially iso-8859-1 ...

comment:8 by hugo <gb@…>, 19 years ago

An additional problem is with sqlite: the pysqlite2 allways returns Unicode strings, at least on OS X with Python 2.3. This poses problems with data that has chars above chr(128), of course, as later code in the system will have the same problems as with the stuff from the form-payload-parsing ...

comment:9 by hugo <gb@…>, 19 years ago

In my quest to find a solution that's easy to do I followed a different approach this time: I changed the code so that all strings that are parsed are turned into utf-8 and not passed on as unicode. This isn't a perfect solution, but I wanted to see how it works out. I used the following diff:

Index: utils/httpwrappers.py
===================================================================
--- utils/httpwrappers.py       (revision 335)
+++ utils/httpwrappers.py       (working copy)
@@ -52,7 +52,10 @@
                     'content': submessage.get_payload(),
                 })
             else:
-                POST.appendlist(name_dict['name'], submessage.get_payload())
+                pvalue = submessage.get_payload()
+                if type(pvalue) == unicode:
+                    pvalue = pvalue.encode('utf-8')
+                POST.appendlist(name_dict['name'], pvalue)
     return POST, FILES
 
 class QueryDict(datastructures.MultiValueDict):

Then I stumbled about the above problem with sqlite and utf-8/unicode storage. So this might be a possible approach - just turn every input into utf-8 and make sure that every form field is correct utf-8 stuff. The problem: when you use string indexing, your utf-8 stuff won't behave nicely, as a char isn't allways a string index position any more. Additionally the sqlite backend needs to be fixed in that it should return utf-8 encoded strings, too.

Oh, and you shouldn't send out only the meta http-equiv stuff but send out actual HTTP headers - Content-type: text/html; charset=utf-8 - instead, because otherwise there could be problems with default charsets added by Apache that override the http-equiv stuff. And http-equiv is a very bad hack anyway. Better to use real headers.

comment:10 by alang.yl@…, 19 years ago

Thanks!hugo.
I use Opera 8 as my browser.The database i used is the sqlite3.I don't use Appache,just django websever.

Maybe you can issue an ANSI version of django,and an UNICODE version too.

comment:11 by hugo <gb@…>, 19 years ago

Ok, I digged a bit deeper and checked the two alternatives we have:

1) change all input to unicode and turn everything to utf-8 on the interfaces to databases and the web. This came out rather nasty, because there are loads of situations where this will break - strings that are constructed in ways that won't work directly with unicode, for example the automatic object naming is done with str usage (by just turning objects into strings via '%s' interpolation). And there are major problems with the postgresql driver if that sees unicode strings - the application just dies without traceback.

2) change all input to utf8 bytestrings and keep the rest as is. This did work out nicely, but has problems with sqlite, because that one does return only unicode strings (at least if you run sqlite with unicode support), so the sqlite backend will need additional work to make it allways return utf8 bytestrings.

The problem with solution 2 is, it will break when people expect strings to be one-position-is-one-character - that's only true for unicode, not for utf-8. But as long as people use highlevel functions to work on utf-8 strings it should be ok. And if we do know that all bytestrings must be utf-8 encoded, we can just as easily turn them into unicode if we need this for a function. But as soon as somebody mixes iso-8859-1 bytestrings and utf8 bytestrings, this will break in ugly ways ...

My patch does introduce a utf8 helper function in django.utils.text that will return an utf-8 encoded string and does autodiscovery on strings. Of course this only works for pure utf-8 strings - if you mix them up with iso-8859-1 strings, this will return doubly encoded utf-8 chars ...

Here is the patch:

Index: utils/httpwrappers.py
===================================================================
--- utils/httpwrappers.py       (revision 335)
+++ utils/httpwrappers.py       (working copy)
@@ -1,5 +1,6 @@
 from Cookie import SimpleCookie
 from pprint import pformat
+from django.utils.text import utf8
 import datastructures
 
 DEFAULT_MIME_TYPE = 'text/html'
@@ -52,7 +53,7 @@
                     'content': submessage.get_payload(),
                 })
             else:
-                POST.appendlist(name_dict['name'], submessage.get_payload())
+                POST.appendlist(name_dict['name'], utf8(submessage.get_payload()))
     return POST, FILES
 
 class QueryDict(datastructures.MultiValueDict):
Index: utils/text.py
===================================================================
--- utils/text.py       (revision 335)
+++ utils/text.py       (working copy)
@@ -1,5 +1,20 @@
 import re
 
+def utf8(ob, onlystr=True):
+    """
+    Turns a given object into an utf-8 string. If onlystr is True,  only string
+    types will be handled, other objects will pass unchanged. if onlystr is False,
+    objects will be str()ed and the result will be handled. This function will
+    handle utf-8 and iso-8859-1 autodiscovery automatically.
+    """
+    if onlystr:
+        if type(ob) not in (str, unicode): return ob
+    else: ob = str(ob)
+    if type(ob) == str:
+        try: ob = ob.decode('utf-8')
+        except UnicodeError: ob = ob.decode('iso-8859-1')
+    return ob.encode('utf-8')
+
 def wrap(text, width):
     """
     A word-wrap function that preserves existing line breaks and most spaces in

With this patch I can just use special characters with django, as long as I use the postgresql backend. The sqlite backend still won't work.

comment:12 by Moof <moof@…>, 19 years ago

Cc: moof@… added

It seems a bit strange to me as a person who uses the unicode object that you'd prefer to have everything as UTF-8 string rather than unicode strings. Seriously. The whole point of the unicode string is you no longer need to worry about charset issues when you use it.

More specifically, not all databases will take UTF-8 strings correctly, and some database drivers will happily do the conversion from a unicode string to the correct charset, whereas if they don't know it's unicode, then they won't bother. I'm thinking of mx.ODBC here, specifically, but anything that knows how to cope with unicode will find it annoying to have to convert to and from UTF-8 all the time.

The correct way to deal with unicode in a program is:

  • Convert everything to unicode on input. This requires you to know the character set being input.
    • This is actually a lot more difficult than it sounds, especially on the web where there is still no sensible way to send charset data along with form gets and posts. The general rule of thumb is if your web page outputs as UTF-8 then all form posting will be done as UTF-8. It's not infallible, but it's the best we got.
  • Use Unicode strings for everything inside the application.
    • This is, in a way, related to i18n (see #65), in that wrapping everything in _() markers can be done at the same time. Also, it means you don't have to know the character set of the language file, just assume it'll be converted to unicode.
    • Unicode strings handle %s sensibly, sort-of. u"%s" % "bar" will return u"bar" having run "bar" through unicode(). However, "%s" % u"áéíóú" will return "áéíóú", or, more commonly, a UnicodeEncodeError, which is what is happening in this case, as it runs the unicode string through str(). Both unicode and str() will use sys.getdefaultencoding(), which is, stupidly, "ascii" unless you've gone and done the right thing and edited sitecustomize.py or site.py. PEP 263 is only a half-useful way of solving this issue.
      • In summary: By changing all internal user strings to unicode, you might trigger UnicodeEncodeErrors if someone tries to interpolate a non-ASCII string where it might not have done before, but the solution is simple (add u in front of all strings) and it gives people who do have to cope with this stuff day-in and day-out much less in the way of headaches
    • exceptions to this rule include:
      • SQL commands and column names. To my knowledge, no database allows non-ASCII identifiers. This does not extend to data values. cursor.execute("INSERT INTO atable (foo, bar) VALUES (?,?)", (u"côsa", u"thïngy")) is normally the correct way to do things for dbapi2-compliant drivers. Also, most DB drivers will return unicode strings if the contents are not ASCII, or can be convinced to do so.
      • Python identifiers, which have to be ASCII, amongst a bunch of other restrictions.
  • Replace all references to str() to unicode(). Save yourself from the madness.
  • Re-encode everything the application outputs.
    • In terms of web things this is best done by having an explicitly-stated encoding attribute in the Content-Type header.
    • For reasons to do with knowing your input, as expressed above, this should default to "UTF-8", but should be settable by the programmer.

The "solution" abrogated above about doing {{{string.decode("ISO-8859-1")}} is blatantly wrong, because not everybody uses ISO-8859-1. This wouldn't help the original poster who is trying to cope with one of the Chinese alphabets, and might even make it worse for him.

It's not an easy thing to solve, Unicode handling in python sucks on many levels, but not using it sucks even more. I'm willing to put in some legwork towards encoding all strings in unicode, and even try and identify all the inputs and outputs. But it's a massive enough job I'd want to do it in a brnach and then merge it. And I'd be willing to write some proper "Coping with Unicode in Django" documentation.

comment:13 by Moof <moof@…>, 19 years ago

Incidentally, the original poster can work around his problem for the moment by placing this in site-package/sitecustomize.py:

import sys
import locale

encoding = "ascii"

loc = locale.getdefaultlocale()
if loc[1]:
    encoding = loc[1]

sys.setdefaultencoding(encoding)

Or, if he's on a non-locale-aware machine, then by simply placing:

import sys
encoding = "ISO-8859-1" # or whatever encoding you are using
sys.setdefaultencoding(encoding)

in his sitecustomize.py. And no, you can't call it outside sitecustomize.py, which is just one of the many ways python's unicode handling sucks.

by Adrian Holovaty, 19 years ago

Attachment: main.py.patch added

Experimental patch to django.views.admin.main

comment:14 by hugo <gb@…>, 19 years ago

for the record: adrians patch worked for me, at least in one of two scenarios (the other one will be tested later)

comment:15 by Adrian Holovaty, 19 years ago

I implemented a fix in [340]. Please let me know whether you're still having the problem.

comment:16 by hugo <gb@…>, 19 years ago

With Safari and the sqlite backend I still get the following traceback:

There's been an error:

Traceback (most recent call last):

  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/django/core/handlers/base.py", line 63, in get_response
    return callback(request, **param_dict)

  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/django/views/admin/main.py", line 427, in change_list
    result_repr = strip_tags(str(field_val))

UnicodeEncodeError: 'ascii' codec can't encode character u'\xdc' in position 0: ordinal not in range(128)

But Firefox and Safari against the builtin server on Linux and OS X with the PostgreSQL (both on Linux and OS X) backend does work. So it looks like the sqlite-allways-returns-unicode-objects might be a last problem to fix.

comment:17 by hugo <gb@…>, 19 years ago

This can be circumvented by registering converters for varchar types - but the problem is, you need to register converters for "varchar(200)" and not "varchar" alone - sqlite parses the first word, and the "(200)" is part of the word ...

comment:18 by hugo <gb@…>, 19 years ago

a fix for sqlite and unicode is in ticket 227: http://code.djangoproject.com/ticket/227

comment:19 by alang.yl@…, 19 years ago

Ok. I'll try it and report later.

comment:20 by alang.yl@…, 19 years ago

Ok. I'll test it and report later.

comment:21 by Jacob, 19 years ago

Resolution: fixed
Status: reopenedclosed

Since #277 is closed, I'm going to assume this ticket is fixed as well. Please re-open if I'm incorrect.

comment:22 by pedrolfurtado@…, 19 years ago

This issue may not have been fixed. My portuguese unicode strings in the models file don´t work. The error and my model code are pasted below. I can be doing something wrong, of course.

There's been an error:

Traceback (most recent call last):

  File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\handlers\base.py", line 64, in get_response
    response = callback(request, **param_dict)

  File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\views\admin\main.py", line 978, in change_stage
    return HttpResponse(t.render(c), mimetype='text/html; charset=utf-8')

  File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 115, in render
    return self.nodelist.render(context)

  File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 436, in render
    bits.append(node.render(context))

  File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template_loader.py", line 97, in render
    return compiled_parent.render(context)

  File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 115, in render
    return self.nodelist.render(context)

  File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 436, in render
    bits.append(node.render(context))

  File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template_loader.py", line 97, in render
    return compiled_parent.render(context)

  File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 115, in render
    return self.nodelist.render(context)

  File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 436, in render
    bits.append(node.render(context))

  File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template_loader.py", line 54, in render
    result = self.nodelist.render(context)

  File "C:\Python24\Lib\site-packages\django-1.0.0-py2.4.egg\django\core\template.py", line 439, in render
    return ''.join(bits)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 206: ordinal not in range(128)
from django.core import meta

# Create your models here.

class Secao(meta.Model):
	fields = (
		meta.CharField('secao', u'seção', maxlength=30),
	)
	def __repr__(self):
		return self.secao
	admin = meta.Admin()
	
class Cliente(meta.Model):
	fields = (
		meta.CharField('cliente', maxlength=50),
	)
	def __repr__(self):
		return self.cliente
	admin = meta.Admin()
	
class Comprador(meta.Model):
	fields = (
		meta.ForeignKey(Cliente),
		meta.CharField('nome', maxlength=50),
		meta.CharField('email', maxlength=60),
	)
	def __repr__(self):
		return self.nome
	admin = meta.Admin()

class Ficha(meta.Model):
	fields = (
		meta.ForeignKey(Comprador),
		meta.ForeignKey(Secao),
		meta.DateTimeField('criado_date', u'criado em'),
		meta.CharField('referencia', u'referência', maxlength=20 ),
		meta.TextField('descricao', u'descrição'),
	)
	def __repr__(self):
		return self.referencia
	admin = meta.Admin(
		fields = (
			(None, {'fields': ('criado_date', 'secao_id', 'comprador_id', 'referencia', 'descricao', )}),
		),
		list_display = ('referencia', 'criado_date')
	)

comment:23 by anonymous, 19 years ago

Resolution: fixed
Status: closedreopened
Summary: There is a UnicodeEncodeErrorUnicode field names cause UnicodeEncodeError in main admin handler

(reopened; changed summary)

comment:24 by pedrolfurtado@…, 19 years ago

This brand new PEP may help us all. There´s even a patch but I don´t know how to build patched Python on my Windows XP. Hoping someone to make binaries so I could test it.

PEP 349

comment:25 by Gabor Farkas <gabor@…>, 18 years ago

about the latest error description by perdolfurtado@…... i cannot reproduce it because the model syntax seems to be changed now,
but i think the problem is the following:

django does not support unicode strings.
and python, when he is forced to do a unicode=>bytestring conversion, uses the system's default charset, which is ascii (you can change it,
but it is VERYVERYVERY not recommended).

and that conversion will fail because your string contains non-ascii characters.
you have 2 possibilities:

  • use something like: u'referência'.encode('utf-8')

or

  • i think this is the best. use simply 'referência' (as bytestring, not as unicode string), and indicate the encoding of the source code at the beginning of the file as defined in

PEP 263. for example :

#!/usr/bin/python
# -*- coding: UTF-8 -*-

(p.s: i think this ticket is a NOTABUG or something like that. what's the policy there? who can/should close/resolve a ticket?)

comment:26 by Jacob, 18 years ago

Resolution: invalid
Status: reopenedclosed

As Gabor said, you can solve this with a correct coding declaration. Marking INVALID.

comment:27 by Main, 18 years ago

Type: defect
Note: See TracTickets for help on using tickets.
Back to Top