the smart_unicode in newforms/util.py shouldn't assume utf-8 encoded strings in utf-8 environment
|Reported by:||Owned by:||Adrian Holovaty|
|Has patch:||yes||Needs documentation:||no|
|Needs tests:||yes||Patch needs improvement:||yes|
In utf-8 environment:
>>> from django import newforms as forms >>> f = forms.CharField() >>> f.clean('aaa') u'aaa' >>> f.clean('ąąą') <---- there are latin2 characters, instead utf-8 Traceback (most recent call last): File "<console>", line 1, in ? File "/usr/lib/python2.4/site-packages/django/newforms/fields.py", line 99, in clean value = smart_unicode(value) File "/usr/lib/python2.4/site-packages/django/newforms/util.py", line 15, in smart_unicode s = unicode(s, settings.DEFAULT_CHARSET) UnicodeDecodeError: 'utf8' codec can't decode byte 0xb1 in position 0: unexpected code byte >>>
Django should not trust to anything that the browser sends. It's trivial (and common in non-latin1 countries) to set custom
encoding (instead auto-detect, this helps with wrongly configured www servers that declare to send latin1 and in fact send
local (eg. latin2, or koir8, or...) encoded characters) and this will cause 500 server error messages.
Maybe the solution would be to catch UnicodeError and encode incoming string like:
>>> "ąąą".decode('ascii', 'replace') <--- this is again in latin2 instead utf8 u'\ufffd\ufffd\ufffd' >>> print "ąąą".decode('ascii', 'replace') ���
Other solution could be to add FALLBACK_CHARSET variable, that will be used to decode the string.
This variable could be set per language for i18n environments, so one can set latin1 for de, fr,
latin2 for cz, pl and other encodings as appropriate.