Opened 8 months ago

Closed 8 months ago

#32439 closed Bug (duplicate)

Dumpdata fails on Windows due to non-utf8 system locale

Reported by: helmstedt Owned by: nobody
Component: Uncategorized Version: 3.1
Severity: Normal Keywords: windows, utf8, encoding, dumpdata
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by helmstedt)

The command: "python manage.py dumpdata -o output.json" fails on Windows with a database with characters outside of the system locale. An example of the error is:

"CommandError: Unable to serialize database: 'charmap' codec can't encode character '\u0107' in position 8: character maps to <undefined>" (The character is "ć" in this case.)

The reason for the error is, I think, described in https://stackoverflow.com/questions/64457733/django-dumpdata-fails-on-special-characters/65186947#65186947 with a "hacky" solution. I quote:

"To save json data in django the TextIOWrapper is used:

The default encoding is now locale.getpreferredencoding(False) (...)
In documentation of locale.getpreferredencoding fuction we can read:

Return the encoding used for text data, according to user preferences. User preferences are expressed differently on different systems, and might not be available programmatically on some systems, so this function only returns a guess.

Here I found "hacky" but working method to overwrite these settings:

In file settings.py of your django project add these lines:

import _locale
_locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8'])"'

In Python I can my "inspect _locale._getdefaultlocale" variable in my (Danish) Windows installation:

import _locale
_locale._getdefaultlocale()

('da_DK', 'cp1252')

Because the default encoding on my system is cp1252 instead of utf-8, dumpdata tries to create a json file encoded in cp1252 instead of utf-8 and fails when it encounters a character not supported by this encoding.

I can confirm that the "hacky" solution to override those values will make the data dump work.

Since there doesn't seem to be a settting in Windows to actually specify the default locale encoding to utf8, Django should provide a way to force utf8-encoding (and override system encoding) when using the dumpdata command.

Traceback is attached.

Attachments (1)

traceback.txt (2.0 KB) - added by helmstedt 8 months ago.
Traceback of the error

Download all attachments as: .zip

Change History (8)

Changed 8 months ago by helmstedt

Attachment: traceback.txt added

Traceback of the error

comment:1 Changed 8 months ago by helmstedt

Description: modified (diff)

comment:2 Changed 8 months ago by helmstedt

Description: modified (diff)

comment:3 Changed 8 months ago by helmstedt

Description: modified (diff)

comment:4 Changed 8 months ago by David Smith

Duplicate of #26721?

comment:5 in reply to:  4 Changed 8 months ago by helmstedt

Replying to David Smith:

Duplicate of #26721?

Maybe both have to do with encodings on Windows, but the behavior described in that issue is different, since the dump is actually created, but with wrong encoding (if I understand it correctly). In my case the dump file is only written until the character before the non-supported character. Also, when comparing my traceback to the traceback from https://groups.google.com/forum/#!topic/django-users/NAHD058Gh_Q/discussion, it does seem like two different issues.

I'm not a Python or Django expert by any means, but as you can see from the end of my traceback...

File "C:\Users\Morten\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]

Django uses a wrong Python encoding library for the utf8 database, because it assumes the encoding set in Windows is the right one to use.

But if I am right that Windows will never have utf8 set as a default locale, Django should provide a way to override the default locale encoding and set utf8 instead.

Last edited 8 months ago by helmstedt (previous) (diff)

comment:6 Changed 8 months ago by Carlton Gibson

Do we enforce utf8 internally? If so we could patch dumpdata to pass encoding to the open() call.

However...

But if I am right that Windows will never have utf8 set as a default locale...

That's not quite right. If you go to Settings there's a "Use Unicode UTF-8 for worldwide language support", box in "Language" - "Administrative Language Settings" - "Change system locale" - "Region Settings".

If we apply that, and reboot, then we get a sensible, modern, default encoding from Python:

>>> import _locale
>>> _locale._getdefaultlocale()
('en_GB', 'cp65001')  # This is cp1252 without the setting checked.

This has come up on Django Developers before.

My inclination would be to doc this, as a fix to #26721 maybe, and stick there. What do we think?

comment:7 Changed 8 months ago by Carlton Gibson

Resolution: duplicate
Status: newclosed

OK, I'm going to close as a Duplicate of #26721 for now. It feels like a documentation issue to me, all solved by enabling UTF8 system wide... — if that's not correct, happy to take follow-up and re-open if needed.

Note: See TracTickets for help on using tickets.
Back to Top