Opened 11 years ago

Closed 11 years ago

Last modified 11 years ago

#22399 closed Bug (fixed)

loaddata doesn't work correctly when importing utf-8 encoded files

Reported by: bacilla Owned by: nobody
Component: Core (Management commands) Version: 1.6
Severity: Normal Keywords: loaddata utf-8 python3
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Environment: Windows 7, Python 3.3, Django 1.6.2, PyYAML 3.11

When initializing DB with a yaml fixture that contains russian characters, like this:

- model: testapp.City
  fields:
    name: Санкт-Петербург

or unicode escaped sequences, like this:

- model: testapp.City
  fields:
    name: "\u040c\u00ae\u045e\u00ae\u0431\u0401\u040e\u0401\u0430\u0431\u0404"

in a 'name' column appears garbage.

It seems that this happens because a fixture file doesn't properly opened in utf-8 encoding, line 122 of the source file 'django/core/management/commands/loaddata.py' (missing parameter 'encoding="utf-8"').

Python discussions there:
https://mail.python.org/pipermail/python-ideas/2013-June/021230.html

Attachments (1)

testproj.tar.gz (6.3 KB ) - added by bacilla 11 years ago.
sample project

Download all attachments as: .zip

Change History (11)

by bacilla, 11 years ago

Attachment: testproj.tar.gz added

sample project

comment:1 by Claude Paroz, 11 years ago

Keywords: python3 added
Triage Stage: UnreviewedAccepted

encoding="utf-8" is a Python 3 addition to the open() method (that only makes sense when reading the file in text mode).

I think that for best compatibility with other open methods (gzip, zip, bzip), it would be easier to simply force opening the file in binary mode ('rb'), then the deserializing step should automatically care for decoding the file in 'utf-8'. Could you test if using fixture = open_method(fixture_file, 'rb') is solving your issue?

comment:2 by bacilla, 11 years ago

Triage Stage: AcceptedUnreviewed

This fixes the first case (characters in the yaml file), but doesn't fixes second (unicode escaped sequences).

comment:3 by Claude Paroz, 11 years ago

Triage Stage: UnreviewedAccepted

comment:4 by Claude Paroz, 11 years ago

As for the escaped sequence, what are you expecting? If I'm looking at your proposed sequence, the result is really "Ќ®ў®бЁЎЁабЄ"... (\u040c = Ќ, \u00ae = ®, etc.)

comment:5 by bacilla, 11 years ago

Oh you're right it is my fault. Right string is '\u041d\u043e\u0432\u043e\u0441\u0438\u0431\u0438\u0440\u0441\u043a' and it works perfectly.

comment:6 by Claude Paroz, 11 years ago

OK, then there is an obvious fix, always reading in binary mode:

diff --git a/django/core/management/commands/loaddata.py b/django/core/management/commands/loaddata.py
index 44583bd..44946fe 100644
--- a/django/core/management/commands/loaddata.py
+++ b/django/core/management/commands/loaddata.py
@@ -125,7 +125,7 @@ class Command(BaseCommand):
         for fixture_file, fixture_dir, fixture_name in self.find_fixtures(fixture_label):
             _, ser_fmt, cmp_fmt = self.parse_name(os.path.basename(fixture_file))
             open_method = self.compression_formats[cmp_fmt]
-            fixture = open_method(fixture_file, 'r')
+            fixture = open_method(fixture_file, 'rb')
             try:
                 self.fixture_count += 1
                 objects_in_fixture = 0

Or a more elaborate patch that try to take advantage of reading in text mode on Python 3:

diff --git a/django/core/management/commands/loaddata.py b/django/core/management/commands/loaddata.py
index 44583bd..5938770 100644
--- a/django/core/management/commands/loaddata.py
+++ b/django/core/management/commands/loaddata.py
@@ -14,7 +14,7 @@ from django.core.management.base import BaseCommand, CommandError
 from django.core.management.color import no_style
 from django.db import (connections, router, transaction, DEFAULT_DB_ALIAS,
       IntegrityError, DatabaseError)
-from django.utils import lru_cache
+from django.utils import lru_cache, six
 from django.utils.encoding import force_text
 from django.utils.functional import cached_property
 from django.utils._os import upath
@@ -76,13 +76,14 @@ class Command(BaseCommand):
         self.models = set()
 
         self.serialization_formats = serializers.get_public_serializer_formats()
+        kwargs = {'encoding': 'utf-8'} if six.PY3 else {}
         self.compression_formats = {
-            None: open,
-            'gz': gzip.GzipFile,
-            'zip': SingleZipReader
+            None: (open, kwargs),
+            'gz': (gzip.GzipFile, kwargs),
+            'zip': (SingleZipReader, {}),
         }
         if has_bz2:
-            self.compression_formats['bz2'] = bz2.BZ2File
+            self.compression_formats['bz2'] = (bz2.BZ2File, kwargs)
 
         with connection.constraint_checks_disabled():
             for fixture_label in fixture_labels:
@@ -124,8 +125,8 @@ class Command(BaseCommand):
         """
         for fixture_file, fixture_dir, fixture_name in self.find_fixtures(fixture_label):
             _, ser_fmt, cmp_fmt = self.parse_name(os.path.basename(fixture_file))
-            open_method = self.compression_formats[cmp_fmt]
-            fixture = open_method(fixture_file, 'r')
+            open_method, kwargs = self.compression_formats[cmp_fmt]
+            fixture = open_method(fixture_file, 'rb', **kwargs)
             try:
                 self.fixture_count += 1
                 objects_in_fixture = 0

comment:7 by Claude Paroz <claude@…>, 11 years ago

Resolution: fixed
Status: newclosed

In ed532a6a1ee675432940e69cec866b52aca96575:

Fixed #22399 -- Forced fixture reading in binary mode

This might help on systems where default encoding is not UTF-8 (and
on Python 3).
Thanks bacilla for the report.

comment:8 by Claude Paroz <claude@…>, 11 years ago

In 8d7023dc714acc957fac7ef422ccee4d83429b09:

[1.7.x] Fixed #22399 -- Forced fixture reading in binary mode

This might help on systems where default encoding is not UTF-8 (and
on Python 3).
Thanks bacilla for the report.
Backport of ed532a6a1 from master.

comment:9 by Claude Paroz <claude@…>, 11 years ago

In 275811a93c1e5bc6505605967cf2da01f1c038fe:

Adapted fixture read mode to file type

Binary mode added in ed532a6a1e is not supported by ZipFile.
Refs #22399.

comment:10 by Claude Paroz <claude@…>, 11 years ago

In 13340df76984d019ff9d4612ed6f38507546aade:

[1.7.x] Adapted fixture read mode to file type

Binary mode added in ed532a6a1e is not supported by ZipFile.
Refs #22399.
Backport of 275811a93 from master.

Note: See TracTickets for help on using tickets.
Back to Top