Opened 12 months ago

Closed 11 months ago

Last modified 11 months ago

#22399 closed Bug (fixed)

loaddata doesn't work correctly when importing utf-8 encoded files

Reported by: bacilla Owned by: nobody
Component: Core (Management commands) Version: 1.6
Severity: Normal Keywords: loaddata utf-8 python3
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Environment: Windows 7, Python 3.3, Django 1.6.2, PyYAML 3.11

When initializing DB with a yaml fixture that contains russian characters, like this:

- model: testapp.City
  fields:
    name: Санкт-Петербург

or unicode escaped sequences, like this:

- model: testapp.City
  fields:
    name: "\u040c\u00ae\u045e\u00ae\u0431\u0401\u040e\u0401\u0430\u0431\u0404"

in a 'name' column appears garbage.

It seems that this happens because a fixture file doesn't properly opened in utf-8 encoding, line 122 of the source file 'django/core/management/commands/loaddata.py' (missing parameter 'encoding="utf-8"').

Python discussions there:
https://mail.python.org/pipermail/python-ideas/2013-June/021230.html

Attachments (1)

testproj.tar.gz (6.3 KB) - added by bacilla 12 months ago.
sample project

Download all attachments as: .zip

Change History (11)

Changed 12 months ago by bacilla

sample project

comment:1 Changed 12 months ago by claudep

  • Keywords python3 added
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Accepted

encoding="utf-8" is a Python 3 addition to the open() method (that only makes sense when reading the file in text mode).

I think that for best compatibility with other open methods (gzip, zip, bzip), it would be easier to simply force opening the file in binary mode ('rb'), then the deserializing step should automatically care for decoding the file in 'utf-8'. Could you test if using fixture = open_method(fixture_file, 'rb') is solving your issue?

comment:2 Changed 12 months ago by bacilla

  • Triage Stage changed from Accepted to Unreviewed

This fixes the first case (characters in the yaml file), but doesn't fixes second (unicode escaped sequences).

comment:3 Changed 12 months ago by claudep

  • Triage Stage changed from Unreviewed to Accepted

comment:4 Changed 12 months ago by claudep

As for the escaped sequence, what are you expecting? If I'm looking at your proposed sequence, the result is really "Ќ®ў®бЁЎЁабЄ"... (\u040c = Ќ, \u00ae = ®, etc.)

comment:5 Changed 12 months ago by bacilla

Oh you're right it is my fault. Right string is '\u041d\u043e\u0432\u043e\u0441\u0438\u0431\u0438\u0440\u0441\u043a' and it works perfectly.

comment:6 Changed 12 months ago by claudep

OK, then there is an obvious fix, always reading in binary mode:

diff --git a/django/core/management/commands/loaddata.py b/django/core/management/commands/loaddata.py
index 44583bd..44946fe 100644
--- a/django/core/management/commands/loaddata.py
+++ b/django/core/management/commands/loaddata.py
@@ -125,7 +125,7 @@ class Command(BaseCommand):
         for fixture_file, fixture_dir, fixture_name in self.find_fixtures(fixture_label):
             _, ser_fmt, cmp_fmt = self.parse_name(os.path.basename(fixture_file))
             open_method = self.compression_formats[cmp_fmt]
-            fixture = open_method(fixture_file, 'r')
+            fixture = open_method(fixture_file, 'rb')
             try:
                 self.fixture_count += 1
                 objects_in_fixture = 0

Or a more elaborate patch that try to take advantage of reading in text mode on Python 3:

diff --git a/django/core/management/commands/loaddata.py b/django/core/management/commands/loaddata.py
index 44583bd..5938770 100644
--- a/django/core/management/commands/loaddata.py
+++ b/django/core/management/commands/loaddata.py
@@ -14,7 +14,7 @@ from django.core.management.base import BaseCommand, CommandError
 from django.core.management.color import no_style
 from django.db import (connections, router, transaction, DEFAULT_DB_ALIAS,
       IntegrityError, DatabaseError)
-from django.utils import lru_cache
+from django.utils import lru_cache, six
 from django.utils.encoding import force_text
 from django.utils.functional import cached_property
 from django.utils._os import upath
@@ -76,13 +76,14 @@ class Command(BaseCommand):
         self.models = set()
 
         self.serialization_formats = serializers.get_public_serializer_formats()
+        kwargs = {'encoding': 'utf-8'} if six.PY3 else {}
         self.compression_formats = {
-            None: open,
-            'gz': gzip.GzipFile,
-            'zip': SingleZipReader
+            None: (open, kwargs),
+            'gz': (gzip.GzipFile, kwargs),
+            'zip': (SingleZipReader, {}),
         }
         if has_bz2:
-            self.compression_formats['bz2'] = bz2.BZ2File
+            self.compression_formats['bz2'] = (bz2.BZ2File, kwargs)
 
         with connection.constraint_checks_disabled():
             for fixture_label in fixture_labels:
@@ -124,8 +125,8 @@ class Command(BaseCommand):
         """
         for fixture_file, fixture_dir, fixture_name in self.find_fixtures(fixture_label):
             _, ser_fmt, cmp_fmt = self.parse_name(os.path.basename(fixture_file))
-            open_method = self.compression_formats[cmp_fmt]
-            fixture = open_method(fixture_file, 'r')
+            open_method, kwargs = self.compression_formats[cmp_fmt]
+            fixture = open_method(fixture_file, 'rb', **kwargs)
             try:
                 self.fixture_count += 1
                 objects_in_fixture = 0

comment:7 Changed 11 months ago by Claude Paroz <claude@…>

  • Resolution set to fixed
  • Status changed from new to closed

In ed532a6a1ee675432940e69cec866b52aca96575:

Fixed #22399 -- Forced fixture reading in binary mode

This might help on systems where default encoding is not UTF-8 (and
on Python 3).
Thanks bacilla for the report.

comment:8 Changed 11 months ago by Claude Paroz <claude@…>

In 8d7023dc714acc957fac7ef422ccee4d83429b09:

[1.7.x] Fixed #22399 -- Forced fixture reading in binary mode

This might help on systems where default encoding is not UTF-8 (and
on Python 3).
Thanks bacilla for the report.
Backport of ed532a6a1 from master.

comment:9 Changed 11 months ago by Claude Paroz <claude@…>

In 275811a93c1e5bc6505605967cf2da01f1c038fe:

Adapted fixture read mode to file type

Binary mode added in ed532a6a1e is not supported by ZipFile.
Refs #22399.

comment:10 Changed 11 months ago by Claude Paroz <claude@…>

In 13340df76984d019ff9d4612ed6f38507546aade:

[1.7.x] Adapted fixture read mode to file type

Binary mode added in ed532a6a1e is not supported by ZipFile.
Refs #22399.
Backport of 275811a93 from master.

Note: See TracTickets for help on using tickets.
Back to Top