Django

Code

Ticket #3878 (closed: fixed)

Opened 2 years ago

Last modified 1 year ago

(JSON)-serializing utf8 data fails

Reported by: alex@gc-web.de Assigned to: mtredinnick
Milestone: Component: Serialization
Version: 0.96 Keywords: utf8 unicode-branch
Cc: django@sparemint.com, reza@zeerak.ir Triage Stage: Accepted
Has patch: 1 Needs documentation: 0
Needs tests: 1 Patch needs improvement: 0

Description

If i try to serialize data from the database (for example using fixtures), which is utf8-encoded, the JSON output will contain unicode-escapes (\uXXXX) which will not be loaded back allright.

Example:

>>> obj = Blah()
>>> obj.test = "ö"
>>> obj.save()
./manage.py dumpdata > blah.json
./manage.py loaddata blah
>>> obj = Blah.objects.all()[0]
>>> print obj.test
blök

Attachments

xml_serializer_error.txt (2.2 kB) - added by Saik on 04/16/07 03:20:57.
uft8 problem with xml serializer

Change History

03/30/07 13:37:35 changed by Gábor Farkas <gabor@nekomancer.net>

  • needs_better_patch changed.
  • needs_tests changed.
  • needs_docs changed.

i haven't checked the django-fixture-code, but this problem is very similar to a problem with simplejson, so probably it is the cause:

with the simplejson serializer. like this example:

>>> from django.utils.simplejson import dumps,loads
>>>
>>> byte_text = '\xe7\x8c\xab' # the utf-8 representation of the japanese 'cat' character
>>> uni_text = byte_text.decode('utf-8')
>>> uni_text
u'\u732b'
>>>
>>> print loads(dumps(byte_text))
u'\xe7\x8c\xab'

and of course this is wrong. but:

>>> print loads(dumps(uni_text))
u'\u732b'

is ok.

so in short, when working with simplejson and non-ascii characters, then all strings that go into dumps have to be unicode-strings (not bytestrings)

04/09/07 10:30:34 changed by mrts

  • has_patch set to 1.
  • version changed from SVN to 0.96.
  • needs_tests set to 1.

I fixed this with the following simple patch:

--- Django-0.96/django/utils/simplejson/encoder.py      2007-01-31 00:34:15.000000000 +0200
+++ /usr/lib/python2.4/site-packages/django/utils/simplejson/encoder.py 2007-04-09 18:04:29.000000000 +0300
@@ -247,7 +247,7 @@ class JSONEncoder(object):
                 encoder = encode_basestring_ascii
             else:
                 encoder = encode_basestring
-            yield encoder(o)
+            yield encoder(o.decode('utf-8'))
         elif o is None:
             yield 'null'
         elif o is True:

04/09/07 14:01:35 changed by Gábor Farkas

i haven't tested the patch, but unfortunately there's a problem with it:

you're assuming that the bytestring-data the user has is encoded in UTF-8.

and that's not always true.

(another approach would be to use settings.DEFAULT_CHARSET, but that one is still not 100% correct)

but, to "bring" also good news, a django-branch has been created to switch it completely to unicode. with that done, this problem wouldn't be there.

04/09/07 18:07:16 changed by mtredinnick

  • owner changed from jacob to mtredinnick.
  • stage changed from Unreviewed to Accepted.

This will be easiest to fix in the unicode branch. It's on the TODO list there. It's intended to be a short-lived sprinting branch, so I think it's best to leave this to be fixed there and then merged back.

The good news is that on that branch, your fix is absolutely the right idea, although we have some helper functions to make it easier.

Leaving the ticket open so that we remember to ensure it really is fixed.

04/16/07 03:20:57 changed by Saik

  • attachment xml_serializer_error.txt added.

uft8 problem with xml serializer

04/17/07 12:29:23 changed by anonymous

Saik: use the patch given above. Report back if you still have problems.

05/07/07 19:16:46 changed by James Wheare

  • cc set to django@sparemint.com.

05/11/07 06:46:13 changed by mtredinnick

  • summary changed from (JSON)-serializing utf8 data fails to [unicode] (JSON)-serializing utf8 data fails.

05/15/07 11:14:56 changed by mtredinnick

(In [5248]) unicode: Made the serializers unicode-aware. Refs #3878, #4227.

05/15/07 11:16:42 changed by mtredinnick

  • keywords changed from utf8 to utf8 unicode-branch.
  • summary changed from [unicode] (JSON)-serializing utf8 data fails to (JSON)-serializing utf8 data fails.

This was fixed in the unicode branch in [5248] (without changing simplejson.py at all, since that already works well with bytestrings and unicode). I'll close this ticket when the branch is merged back into trunk.

05/31/07 08:06:34 changed by anonymous

  • cc changed from django@sparemint.com to django@sparemint.com, reza@zeerak.ir.

07/04/07 07:11:05 changed by mtredinnick

  • status changed from new to closed.
  • resolution set to fixed.

(In [5609]) Merged Unicode branch into trunk (r4952:5608). This should be fully backwards compatible for all practical purposes.

Fixed #2391, #2489, #2996, #3322, #3344, #3370, #3406, #3432, #3454, #3492, #3582, #3690, #3878, #3891, #3937, #4039, #4141, #4227, #4286, #4291, #4300, #4452, #4702


Add/Change #3878 ((JSON)-serializing utf8 data fails)




Change Properties
Action