#3690 closed (fixed)
the smart_unicode in newforms/util.py shouldn't assume utf-8 encoded strings in utf-8 environment
Reported by: | Owned by: | Adrian Holovaty | |
---|---|---|---|
Component: | Forms | Version: | dev |
Severity: | Keywords: | unicode-branch | |
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | yes | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
In utf-8 environment:
>>> from django import newforms as forms >>> f = forms.CharField() >>> f.clean('aaa') u'aaa' >>> f.clean('ąąą') <---- there are latin2 characters, instead utf-8 Traceback (most recent call last): File "<console>", line 1, in ? File "/usr/lib/python2.4/site-packages/django/newforms/fields.py", line 99, in clean value = smart_unicode(value) File "/usr/lib/python2.4/site-packages/django/newforms/util.py", line 15, in smart_unicode s = unicode(s, settings.DEFAULT_CHARSET) UnicodeDecodeError: 'utf8' codec can't decode byte 0xb1 in position 0: unexpected code byte >>>
Django should not trust to anything that the browser sends. It's trivial (and common in non-latin1 countries) to set custom
encoding (instead auto-detect, this helps with wrongly configured www servers that declare to send latin1 and in fact send
local (eg. latin2, or koir8, or...) encoded characters) and this will cause 500 server error messages.
Maybe the solution would be to catch UnicodeError and encode incoming string like:
>>> "ąąą".decode('ascii', 'replace') <--- this is again in latin2 instead utf8 u'\ufffd\ufffd\ufffd' >>> print "ąąą".decode('ascii', 'replace') ���
Other solution could be to add FALLBACK_CHARSET variable, that will be used to decode the string.
This variable could be set per language for i18n environments, so one can set latin1 for de, fr,
latin2 for cz, pl and other encodings as appropriate.
Regards,
fback
Attachments (1)
Change History (6)
by , 18 years ago
Attachment: | util.py.diff added |
---|
comment:1 by , 18 years ago
Has patch: | set |
---|---|
Patch needs improvement: | set |
This is the easiest patch that solves it.
Problems:
this will work for european languages, but not for russian / asian
after discussion on #django we did not agree, if this should be done here, or in some middleware
this could be improved in two ways: one can pass other encoding to use as additional argument to smart_unicode(), or it could try to guess incoming encoding.
comment:2 by , 18 years ago
Needs tests: | set |
---|---|
Triage Stage: | Unreviewed → Accepted |
comment:3 by , 18 years ago
Keywords: | unicode-branch added |
---|
This was fixed on the unicode branch in [5197], in a different way to what is given here: we fix it at the source (input), rather than later on.
I will close the ticket when the branch is merged back into trunk.
comment:4 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
proposed patch