id	summary	reporter	owner	description	type	status	component	version	severity	resolution	keywords	cc	stage	has_patch	needs_docs	needs_tests	needs_better_patch	easy	ui_ux
3690	the smart_unicode in newforms/util.py shouldn't assume utf-8 encoded strings in utf-8 environment	fback+django@…	Adrian Holovaty	"In utf-8 environment:


{{{

>>> from django import newforms as forms
>>> f = forms.CharField()
>>> f.clean('aaa')
u'aaa'
>>> f.clean('ąąą')  <---- there are latin2 characters, instead utf-8
Traceback (most recent call last):
  File ""<console>"", line 1, in ?
  File ""/usr/lib/python2.4/site-packages/django/newforms/fields.py"", line 99, in clean
    value = smart_unicode(value)
  File ""/usr/lib/python2.4/site-packages/django/newforms/util.py"", line 15, in smart_unicode
    s = unicode(s, settings.DEFAULT_CHARSET)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb1 in position 0: unexpected code byte
>>> 

}}}
Django should not trust to anything that the browser sends. It's trivial (and common in non-latin1 countries) to set custom
encoding (instead auto-detect, this helps with wrongly configured www servers that declare to send latin1 and in fact send
local (eg. latin2, or koir8, or...) encoded characters) and this will cause 500 server error messages.

Maybe the solution would be to catch UnicodeError and encode incoming string like:

{{{
>>> ""ąąą"".decode('ascii', 'replace')  <--- this is again in latin2 instead utf8
u'\ufffd\ufffd\ufffd'
>>> print ""ąąą"".decode('ascii', 'replace')
���
}}}

Other solution could be to add FALLBACK_CHARSET variable, that will be used to decode the string.
This variable could be set per language for i18n environments, so one can set latin1 for de, fr,
latin2 for cz, pl and other encodings as appropriate.

Regards,
fback"		closed	Forms	dev		fixed	unicode-branch		Accepted	1	0	1	1	0	0