Opened 12 months ago

Closed 3 months ago

Last modified 3 months ago

#23271 closed Bug (fixed)

Makemessages can corrupt existing .po files on Windows

Reported by: danielmenzel Owned by: nobody
Component: Internationalization Version: master
Severity: Release blocker Keywords: makemessages utf8 unicode
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Description

Seen on German Windows7 SP1 with 64bit Python 3.4.1 and gettext 0.18.1.

When you have an existing .po file with translations, e.g.

msgid ""
msgstr ""
"Project-Id-Version: \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2014-03-03 10:44+0100\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: \n"
"Language-Team: \n"
"Language: de\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"

msgid "Size"
msgstr "Größe"

and then run

manage.py makemessages --no_location --no_wrap -l de

to update the .po file, you get a corrupted .po file:

msgid ""
msgstr ""
"Project-Id-Version: \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2014-03-03 10:44+0100\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: \n"
"Language-Team: \n"
"Language: de\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"

msgid "Size"
msgstr "Größe"

Investigation

Setting environment variables like LANG, LANGUAGE, LC_ALL, LC_MESSAGES, etc. has no effect on the outcome. Also calling chcp 65001 on the Windows console does not fix the problem.

However, it seems like the described behavior was introduced as a side-effect with this commit https://github.com/django/django/commit/dbb48d2bb99a5f660cf2d85f137b8d87fc12d99f:
All file accesses in makemessages.py were changed to explicitly use utf-8, but the stdout of the gettext binaries (like msgmerge etc.) are still interpreted with the Windows encoding cp1252, as popen_wrapper sets universal_newline=True which in turn uses the encoding returned by locale.getpreferredencoding().

As a test I patched popen_wrapper to interpret output of external processes with utf-8 encoding instead of locale.getpreferredencoding():

def popen_wrapper_utf8(args, os_err_exc_type=django.core.management.base.CommandError):
    """Monkey-patch for django.core.management.utils.popen_wrapper"""
    try:
        p = Popen(args, shell=False, stdout=PIPE, stderr=PIPE,
                  close_fds=os.name != 'nt', universal_newlines=True)
    except OSError as e:
        six.reraise(os_err_exc_type, os_err_exc_type('Error executing %s: %s' %
                                                     (args[0], e.strerror)), sys.exc_info()[2])
    output, errors = p.communicate()

    # Additional utf-8 decoding
    output = output.encode(locale.getpreferredencoding(False)).decode('utf-8')
    #

    return (
        output,
        force_text(errors, DEFAULT_LOCALE_ENCODING, strings_only=True),
        p.returncode
    )

This has the desired effect and prevents the corruption of the .po files.

I also tried changing the value returned by locale.getpreferredencoding() to "utf-8", but that seems impossible on Windows, as Python uses the win32 API GetACP(), which according to MSDN http://msdn.microsoft.com/en-us/library/windows/desktop/dd318070(v=vs.85).aspx only returns ANSI codepages and thus will never return "utf-8".

Change History (13)

comment:1 Changed 12 months ago by claudep

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Severity changed from Normal to Release blocker
  • Triage Stage changed from Unreviewed to Accepted

I'm guilty for the regression, but I'm afraid I will not be able to debug this issue as I don't have access to Windows machines.

If noone else can fix it, I can revert the patch in the 1.7 branch to remove the release blocking for 1.7.

comment:2 Changed 12 months ago by andrewgodwin

I am also without a Windows dev environment. If we have to remove the patch from 1.7, and we think we can do it cleanly, then it's an option.

Last edited 12 months ago by andrewgodwin (previous) (diff)

comment:3 Changed 12 months ago by claudep

  • Version changed from 1.7-rc-2 to master

[cdfefbec721b59695e28] has been reverted on the 1.7.x branch. I've applied [67870137b9e1f1] instead to fix #22686, but I'd like to keep the more extensive fix on master. The release blocking flag is now only applicable to master.

comment:4 Changed 7 months ago by timgraham

Ramiro said he would have some time to look at this before 1.8 alpha.

comment:5 Changed 7 months ago by timgraham

  • Has patch set
  • Triage Stage changed from Accepted to Ready for checkin

comment:6 Changed 7 months ago by Ramiro Morales <ramiro@…>

  • Resolution set to fixed
  • Status changed from new to closed

In 6fb9dee470d57882e378247fd2706d5f9867b5f9:

Fixed #23271 -- Don't corrupt PO files on Windows when updating them.

Make sure PO catalog text fetched from gettext programs via standard
output isn't corrupted by mismatch between assumed (UTF-8) and real
(CP1252) encodings. This can cause mojibake to be written when creating
or updating PO files.

Also fixes #23311.

Thanks to contributor with Trac nick 'danielmenzel' for the report,
excellent research and fix.

comment:7 Changed 7 months ago by timgraham

  • Has patch unset
  • Resolution fixed deleted
  • Status changed from closed to new
  • Triage Stage changed from Ready for checkin to Accepted

Python 3 looks good, but now I have test failures on Python 2 and Windows. Can you reproduce?

comment:8 Changed 7 months ago by Ramiro Morales <cramm0@…>

In 002a8ffe478b7520a64c9176f9f640633f643b9c:

Fixed breakage by 6fb9dee4 under Python2+Windows.

Refs #23271

comment:9 Changed 7 months ago by ramiro

  • Resolution set to fixed
  • Status changed from new to closed

comment:10 Changed 3 months ago by daphshez

  • Resolution fixed deleted
  • Status changed from closed to new

Regression fails on my machine (Windows 7 64, Django development version head, with

locale.getpreferredencoding()=='cp1255'
======================================================================
ERROR: test_po_file_encoding_when_updating (i18n.test_extraction.BasicExtractorTests)
Update of PO file doesn't corrupt it with non-UTF-8 encoding on Python3+Windows (#23271)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\data\dev\django\django\tests\i18n\test_extraction.py", line 407, in test_po_file_encoding_when_updating
    management.call_command('makemessages', locale=['pt_BR'], verbosity=0)
  File "C:\data\dev\django\django\django\core\management\__init__.py", line 118, in call_command
    return command.execute(*args, **defaults)
  File "C:\data\dev\django\django\django\core\management\base.py", line 398, in execute
    output = self.handle(*args, **options)
  File "C:\data\dev\django\django\django\core\management\commands\makemessages.py", line 322, in handle
    self.write_po_file(potfile, locale)
  File "C:\data\dev\django\django\django\core\management\commands\makemessages.py", line 455, in write_po_file
    msgs, errors, status = gettext_popen_wrapper(args)
  File "C:\data\dev\django\django\django\core\management\commands\makemessages.py", line 39, in
gettext_popen_wrapper
    stdout, stderr, status_code = popen_wrapper(args, os_err_exc_type=os_err_exc_type)
  File "C:\data\dev\django\django\django\core\management\utils.py", line 27, in popen_wrapper
    output, errors = p.communicate()
  File "C:\Python34\lib\subprocess.py", line 959, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "C:\Python34\lib\subprocess.py", line 1234, in _communicate
    stdout = stdout[0]
IndexError: list index out of range

Also, this seems also to be the source of #21928 which isn't resolved for other people.

The problem is that Popen still tries to encode msgmerge output stream using DEFAULT_LOCALE_ENCODING, when the data is actually in UTF-8.

A way around it is to call Popen with universal_newlines=true, and then take care of line ending inside gettext_popen_wrapper. I'll post a patch soon.

(BTW the current solution is for Windows only. Linux uses have to set undocumented environment variables. I think we should strive to solve this for Linux users as well).

comment:11 Changed 3 months ago by daphshez

Well, I patched the code and sent a pull request at

https://github.com/django/django/pull/4532

This is my first attempt to contribute code and I seem to have messed up with the process. I am sorry about it. I hope my fix will get considered anyway...

comment:12 Changed 3 months ago by Tim Graham <timograham@…>

  • Resolution set to fixed
  • Status changed from new to closed

In 57202a11:

Fixed #23271 -- Fixed makemessages crash/test failure for some locales.

comment:13 Changed 3 months ago by Tim Graham <timograham@…>

In c45fd57f:

[1.8.x] Fixed #23271 -- Fixed makemessages crash/test failure for some locales.

Backport of 57202a112a966593857725071ecd652a87c157fb from master

Note: See TracTickets for help on using tickets.
Back to Top