Opened 2 years ago

Closed 15 months ago

Last modified 15 months ago

#23271 closed Bug (fixed)

Makemessages can corrupt existing .po files on Windows

Reported by: danielmenzel Owned by: nobody
Component: Internationalization Version: master
Severity: Release blocker Keywords: makemessages utf8 unicode
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Description

Seen on German Windows7 SP1 with 64bit Python 3.4.1 and gettext 0.18.1.

When you have an existing .po file with translations, e.g.

msgid ""
msgstr ""
"Project-Id-Version: \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2014-03-03 10:44+0100\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: \n"
"Language-Team: \n"
"Language: de\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"

msgid "Size"
msgstr "Größe"

and then run

manage.py makemessages --no_location --no_wrap -l de

to update the .po file, you get a corrupted .po file:

msgid ""
msgstr ""
"Project-Id-Version: \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2014-03-03 10:44+0100\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: \n"
"Language-Team: \n"
"Language: de\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"

msgid "Size"
msgstr "Größe"

Investigation

Setting environment variables like LANG, LANGUAGE, LC_ALL, LC_MESSAGES, etc. has no effect on the outcome. Also calling chcp 65001 on the Windows console does not fix the problem.

However, it seems like the described behavior was introduced as a side-effect with this commit https://github.com/django/django/commit/dbb48d2bb99a5f660cf2d85f137b8d87fc12d99f:
All file accesses in makemessages.py were changed to explicitly use utf-8, but the stdout of the gettext binaries (like msgmerge etc.) are still interpreted with the Windows encoding cp1252, as popen_wrapper sets universal_newline=True which in turn uses the encoding returned by locale.getpreferredencoding().

As a test I patched popen_wrapper to interpret output of external processes with utf-8 encoding instead of locale.getpreferredencoding():

def popen_wrapper_utf8(args, os_err_exc_type=django.core.management.base.CommandError):
    """Monkey-patch for django.core.management.utils.popen_wrapper"""
    try:
        p = Popen(args, shell=False, stdout=PIPE, stderr=PIPE,
                  close_fds=os.name != 'nt', universal_newlines=True)
    except OSError as e:
        six.reraise(os_err_exc_type, os_err_exc_type('Error executing %s: %s' %
                                                     (args[0], e.strerror)), sys.exc_info()[2])
    output, errors = p.communicate()

    # Additional utf-8 decoding
    output = output.encode(locale.getpreferredencoding(False)).decode('utf-8')
    #

    return (
        output,
        force_text(errors, DEFAULT_LOCALE_ENCODING, strings_only=True),
        p.returncode
    )

This has the desired effect and prevents the corruption of the .po files.

I also tried changing the value returned by locale.getpreferredencoding() to "utf-8", but that seems impossible on Windows, as Python uses the win32 API GetACP(), which according to MSDN http://msdn.microsoft.com/en-us/library/windows/desktop/dd318070(v=vs.85).aspx only returns ANSI codepages and thus will never return "utf-8".

Change History (13)

comment:1 Changed 2 years ago by claudep

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Severity changed from Normal to Release blocker
  • Triage Stage changed from Unreviewed to Accepted

I'm guilty for the regression, but I'm afraid I will not be able to debug this issue as I don't have access to Windows machines.

If noone else can fix it, I can revert the patch in the 1.7 branch to remove the release blocking for 1.7.

comment:2 Changed 2 years ago by andrewgodwin

I am also without a Windows dev environment. If we have to remove the patch from 1.7, and we think we can do it cleanly, then it's an option.

Last edited 2 years ago by andrewgodwin (previous) (diff)

comment:3 Changed 2 years ago by claudep

  • Version changed from 1.7-rc-2 to master

[cdfefbec721b59695e28] has been reverted on the 1.7.x branch. I've applied [67870137b9e1f1] instead to fix #22686, but I'd like to keep the more extensive fix on master. The release blocking flag is now only applicable to master.

comment:4 Changed 20 months ago by timgraham

Ramiro said he would have some time to look at this before 1.8 alpha.

comment:5 Changed 19 months ago by timgraham

  • Has patch set
  • Triage Stage changed from Accepted to Ready for checkin

comment:6 Changed 19 months ago by Ramiro Morales <ramiro@…>

  • Resolution set to fixed
  • Status changed from new to closed

In 6fb9dee470d57882e378247fd2706d5f9867b5f9:

Fixed #23271 -- Don't corrupt PO files on Windows when updating them.

Make sure PO catalog text fetched from gettext programs via standard
output isn't corrupted by mismatch between assumed (UTF-8) and real
(CP1252) encodings. This can cause mojibake to be written when creating
or updating PO files.

Also fixes #23311.

Thanks to contributor with Trac nick 'danielmenzel' for the report,
excellent research and fix.

comment:7 Changed 19 months ago by timgraham

  • Has patch unset
  • Resolution fixed deleted
  • Status changed from closed to new
  • Triage Stage changed from Ready for checkin to Accepted

Python 3 looks good, but now I have test failures on Python 2 and Windows. Can you reproduce?

comment:8 Changed 19 months ago by Ramiro Morales <cramm0@…>

In 002a8ffe478b7520a64c9176f9f640633f643b9c:

Fixed breakage by 6fb9dee4 under Python2+Windows.

Refs #23271

comment:9 Changed 19 months ago by ramiro

  • Resolution set to fixed
  • Status changed from new to closed

comment:10 Changed 16 months ago by daphshez

  • Resolution fixed deleted
  • Status changed from closed to new

Regression fails on my machine (Windows 7 64, Django development version head, with

locale.getpreferredencoding()=='cp1255'
======================================================================
ERROR: test_po_file_encoding_when_updating (i18n.test_extraction.BasicExtractorTests)
Update of PO file doesn't corrupt it with non-UTF-8 encoding on Python3+Windows (#23271)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\data\dev\django\django\tests\i18n\test_extraction.py", line 407, in test_po_file_encoding_when_updating
    management.call_command('makemessages', locale=['pt_BR'], verbosity=0)
  File "C:\data\dev\django\django\django\core\management\__init__.py", line 118, in call_command
    return command.execute(*args, **defaults)
  File "C:\data\dev\django\django\django\core\management\base.py", line 398, in execute
    output = self.handle(*args, **options)
  File "C:\data\dev\django\django\django\core\management\commands\makemessages.py", line 322, in handle
    self.write_po_file(potfile, locale)
  File "C:\data\dev\django\django\django\core\management\commands\makemessages.py", line 455, in write_po_file
    msgs, errors, status = gettext_popen_wrapper(args)
  File "C:\data\dev\django\django\django\core\management\commands\makemessages.py", line 39, in
gettext_popen_wrapper
    stdout, stderr, status_code = popen_wrapper(args, os_err_exc_type=os_err_exc_type)
  File "C:\data\dev\django\django\django\core\management\utils.py", line 27, in popen_wrapper
    output, errors = p.communicate()
  File "C:\Python34\lib\subprocess.py", line 959, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "C:\Python34\lib\subprocess.py", line 1234, in _communicate
    stdout = stdout[0]
IndexError: list index out of range

Also, this seems also to be the source of #21928 which isn't resolved for other people.

The problem is that Popen still tries to encode msgmerge output stream using DEFAULT_LOCALE_ENCODING, when the data is actually in UTF-8.

A way around it is to call Popen with universal_newlines=true, and then take care of line ending inside gettext_popen_wrapper. I'll post a patch soon.

(BTW the current solution is for Windows only. Linux uses have to set undocumented environment variables. I think we should strive to solve this for Linux users as well).

comment:11 Changed 16 months ago by daphshez

Well, I patched the code and sent a pull request at

https://github.com/django/django/pull/4532

This is my first attempt to contribute code and I seem to have messed up with the process. I am sorry about it. I hope my fix will get considered anyway...

comment:12 Changed 15 months ago by Tim Graham <timograham@…>

  • Resolution set to fixed
  • Status changed from new to closed

In 57202a11:

Fixed #23271 -- Fixed makemessages crash/test failure for some locales.

comment:13 Changed 15 months ago by Tim Graham <timograham@…>

In c45fd57f:

[1.8.x] Fixed #23271 -- Fixed makemessages crash/test failure for some locales.

Backport of 57202a112a966593857725071ecd652a87c157fb from master

Note: See TracTickets for help on using tickets.
Back to Top