Django

Code

Ticket #3690 (closed: fixed)

Opened 1 year ago

Last modified 1 year ago

the smart_unicode in newforms/util.py shouldn't assume utf-8 encoded strings in utf-8 environment

Reported by: fback+django@fback.net Assigned to: adrian
Milestone: Component: django.newforms
Version: SVN Keywords: unicode-branch
Cc: Triage Stage: Accepted
Has patch: 1 Needs documentation: 0
Needs tests: 1 Patch needs improvement: 1

Description

In utf-8 environment:

>>> from django import newforms as forms
>>> f = forms.CharField()
>>> f.clean('aaa')
u'aaa'
>>> f.clean('ąąą')  <---- there are latin2 characters, instead utf-8
Traceback (most recent call last):
  File "<console>", line 1, in ?
  File "/usr/lib/python2.4/site-packages/django/newforms/fields.py", line 99, in clean
    value = smart_unicode(value)
  File "/usr/lib/python2.4/site-packages/django/newforms/util.py", line 15, in smart_unicode
    s = unicode(s, settings.DEFAULT_CHARSET)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb1 in position 0: unexpected code byte
>>> 

Django should not trust to anything that the browser sends. It's trivial (and common in non-latin1 countries) to set custom encoding (instead auto-detect, this helps with wrongly configured www servers that declare to send latin1 and in fact send local (eg. latin2, or koir8, or...) encoded characters) and this will cause 500 server error messages.

Maybe the solution would be to catch UnicodeError? and encode incoming string like:

>>> "ąąą".decode('ascii', 'replace')  <--- this is again in latin2 instead utf8
u'\ufffd\ufffd\ufffd'
>>> print "ąąą".decode('ascii', 'replace')
���

Other solution could be to add FALLBACK_CHARSET variable, that will be used to decode the string. This variable could be set per language for i18n environments, so one can set latin1 for de, fr, latin2 for cz, pl and other encodings as appropriate.

Regards, fback

Attachments

util.py.diff (483 bytes) - added by fback+django@fback.net on 03/10/07 11:09:43.
proposed patch

Change History

03/10/07 11:09:43 changed by fback+django@fback.net

  • attachment util.py.diff added.

proposed patch

03/10/07 11:10:37 changed by fback+django@fback.net

  • needs_better_patch set to 1.
  • has_patch set to 1.
  • needs_tests changed.
  • needs_docs changed.

This is the easiest patch that solves it.

Problems:

this will work for european languages, but not for russian / asian
after discussion on #django we did not agree, if this should be done here, or in some middleware
this could be improved in two ways: one can pass other encoding to use as additional argument to smart_unicode(), or it could try to guess incoming encoding.

03/18/07 05:03:11 changed by Simon G. <dev@simon.net.nz>

  • needs_tests set to 1.
  • stage changed from Unreviewed to Accepted.

05/12/07 05:59:16 changed by mtredinnick

  • keywords set to unicode-branch.

This was fixed on the unicode branch in [5197], in a different way to what is given here: we fix it at the source (input), rather than later on.

I will close the ticket when the branch is merged back into trunk.

07/04/07 07:11:05 changed by mtredinnick

  • status changed from new to closed.
  • resolution set to fixed.

(In [5609]) Merged Unicode branch into trunk (r4952:5608). This should be fully backwards compatible for all practical purposes.

Fixed #2391, #2489, #2996, #3322, #3344, #3370, #3406, #3432, #3454, #3492, #3582, #3690, #3878, #3891, #3937, #4039, #4141, #4227, #4286, #4291, #4300, #4452, #4702


Add/Change #3690 (the smart_unicode in newforms/util.py shouldn't assume utf-8 encoded strings in utf-8 environment)




Change Properties
Action