#3878 closed (fixed)
(JSON)-serializing utf8 data fails
| Reported by: | Owned by: | Malcolm Tredinnick | |
|---|---|---|---|
| Component: | Core (Serialization) | Version: | 0.96 |
| Severity: | Keywords: | utf8 unicode-branch | |
| Cc: | django@…, reza@… | Triage Stage: | Accepted |
| Has patch: | yes | Needs documentation: | no |
| Needs tests: | yes | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description
If i try to serialize data from the database (for example using fixtures), which is utf8-encoded, the JSON output will contain
unicode-escapes (\uXXXX) which will not be loaded back allright.
Example:
>>> obj = Blah() >>> obj.test = "ö" >>> obj.save()
./manage.py dumpdata > blah.json ./manage.py loaddata blah
>>> obj = Blah.objects.all()[0] >>> print obj.test blök
Attachments (1)
Change History (13)
comment:1 by , 19 years ago
comment:2 by , 19 years ago
| Has patch: | set |
|---|---|
| Needs tests: | set |
| Version: | SVN → 0.96 |
I fixed this with the following simple patch:
--- Django-0.96/django/utils/simplejson/encoder.py 2007-01-31 00:34:15.000000000 +0200
+++ /usr/lib/python2.4/site-packages/django/utils/simplejson/encoder.py 2007-04-09 18:04:29.000000000 +0300
@@ -247,7 +247,7 @@ class JSONEncoder(object):
encoder = encode_basestring_ascii
else:
encoder = encode_basestring
- yield encoder(o)
+ yield encoder(o.decode('utf-8'))
elif o is None:
yield 'null'
elif o is True:
comment:3 by , 19 years ago
i haven't tested the patch, but unfortunately there's a problem with it:
you're assuming that the bytestring-data the user has is encoded in UTF-8.
and that's not always true.
(another approach would be to use settings.DEFAULT_CHARSET,
but that one is still not 100% correct)
but, to "bring" also good news, a django-branch has been created to switch
it completely to unicode. with that done, this problem wouldn't be there.
comment:4 by , 19 years ago
| Owner: | changed from to |
|---|---|
| Triage Stage: | Unreviewed → Accepted |
This will be easiest to fix in the unicode branch. It's on the TODO list there. It's intended to be a short-lived sprinting branch, so I think it's best to leave this to be fixed there and then merged back.
The good news is that on that branch, your fix is absolutely the right idea, although we have some helper functions to make it easier.
Leaving the ticket open so that we remember to ensure it really is fixed.
comment:5 by , 19 years ago
Saik: use the patch given above. Report back if you still have problems.
comment:6 by , 19 years ago
| Cc: | added |
|---|
comment:7 by , 19 years ago
| Summary: | (JSON)-serializing utf8 data fails → [unicode] (JSON)-serializing utf8 data fails |
|---|
comment:8 by , 18 years ago
comment:9 by , 18 years ago
| Keywords: | unicode-branch added |
|---|---|
| Summary: | [unicode] (JSON)-serializing utf8 data fails → (JSON)-serializing utf8 data fails |
This was fixed in the unicode branch in [5248] (without changing simplejson.py at all, since that already works well with bytestrings and unicode). I'll close this ticket when the branch is merged back into trunk.
comment:10 by , 18 years ago
| Cc: | added |
|---|
comment:11 by , 18 years ago
| Resolution: | → fixed |
|---|---|
| Status: | new → closed |
i haven't checked the django-fixture-code, but this problem is very similar to a problem with simplejson,
so probably it is the cause:
with the simplejson serializer.
like this example:
>>> from django.utils.simplejson import dumps,loads >>> >>> byte_text = '\xe7\x8c\xab' # the utf-8 representation of the japanese 'cat' character >>> uni_text = byte_text.decode('utf-8') >>> uni_text u'\u732b' >>> >>> print loads(dumps(byte_text)) u'\xe7\x8c\xab'and of course this is wrong.
but:
is ok.
so in short, when working with simplejson and non-ascii characters,
then all strings that go into dumps have to be unicode-strings (not bytestrings)