Opened 7 months ago

Closed 2 months ago

#23271 closed Bug (fixed)

Makemessages can corrupt existing .po files on Windows

Reported by: danielmenzel Owned by: nobody
Component: Internationalization Version: master
Severity: Release blocker Keywords: makemessages utf8 unicode
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Description

Seen on German Windows7 SP1 with 64bit Python 3.4.1 and gettext 0.18.1.

When you have an existing .po file with translations, e.g.

msgid ""
msgstr ""
"Project-Id-Version: \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2014-03-03 10:44+0100\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: \n"
"Language-Team: \n"
"Language: de\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"

msgid "Size"
msgstr "Größe"

and then run

manage.py makemessages --no_location --no_wrap -l de

to update the .po file, you get a corrupted .po file:

msgid ""
msgstr ""
"Project-Id-Version: \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2014-03-03 10:44+0100\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: \n"
"Language-Team: \n"
"Language: de\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"

msgid "Size"
msgstr "Größe"

Investigation

Setting environment variables like LANG, LANGUAGE, LC_ALL, LC_MESSAGES, etc. has no effect on the outcome. Also calling chcp 65001 on the Windows console does not fix the problem.

However, it seems like the described behavior was introduced as a side-effect with this commit https://github.com/django/django/commit/dbb48d2bb99a5f660cf2d85f137b8d87fc12d99f:
All file accesses in makemessages.py were changed to explicitly use utf-8, but the stdout of the gettext binaries (like msgmerge etc.) are still interpreted with the Windows encoding cp1252, as popen_wrapper sets universal_newline=True which in turn uses the encoding returned by locale.getpreferredencoding().

As a test I patched popen_wrapper to interpret output of external processes with utf-8 encoding instead of locale.getpreferredencoding():

def popen_wrapper_utf8(args, os_err_exc_type=django.core.management.base.CommandError):
    """Monkey-patch for django.core.management.utils.popen_wrapper"""
    try:
        p = Popen(args, shell=False, stdout=PIPE, stderr=PIPE,
                  close_fds=os.name != 'nt', universal_newlines=True)
    except OSError as e:
        six.reraise(os_err_exc_type, os_err_exc_type('Error executing %s: %s' %
                                                     (args[0], e.strerror)), sys.exc_info()[2])
    output, errors = p.communicate()

    # Additional utf-8 decoding
    output = output.encode(locale.getpreferredencoding(False)).decode('utf-8')
    #

    return (
        output,
        force_text(errors, DEFAULT_LOCALE_ENCODING, strings_only=True),
        p.returncode
    )

This has the desired effect and prevents the corruption of the .po files.

I also tried changing the value returned by locale.getpreferredencoding() to "utf-8", but that seems impossible on Windows, as Python uses the win32 API GetACP(), which according to MSDN http://msdn.microsoft.com/en-us/library/windows/desktop/dd318070(v=vs.85).aspx only returns ANSI codepages and thus will never return "utf-8".

Change History (9)

comment:1 Changed 7 months ago by claudep

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Severity changed from Normal to Release blocker
  • Triage Stage changed from Unreviewed to Accepted

I'm guilty for the regression, but I'm afraid I will not be able to debug this issue as I don't have access to Windows machines.

If noone else can fix it, I can revert the patch in the 1.7 branch to remove the release blocking for 1.7.

comment:2 Changed 7 months ago by andrewgodwin

I am also without a Windows dev environment. If we have to remove the patch from 1.7, and we think we can do it cleanly, then it's an option.

Last edited 7 months ago by andrewgodwin (previous) (diff)

comment:3 Changed 7 months ago by claudep

  • Version changed from 1.7-rc-2 to master

[cdfefbec721b59695e28] has been reverted on the 1.7.x branch. I've applied [67870137b9e1f1] instead to fix #22686, but I'd like to keep the more extensive fix on master. The release blocking flag is now only applicable to master.

comment:4 Changed 2 months ago by timgraham

Ramiro said he would have some time to look at this before 1.8 alpha.

comment:5 Changed 2 months ago by timgraham

  • Has patch set
  • Triage Stage changed from Accepted to Ready for checkin

comment:6 Changed 2 months ago by Ramiro Morales <ramiro@…>

  • Resolution set to fixed
  • Status changed from new to closed

In 6fb9dee470d57882e378247fd2706d5f9867b5f9:

Fixed #23271 -- Don't corrupt PO files on Windows when updating them.

Make sure PO catalog text fetched from gettext programs via standard
output isn't corrupted by mismatch between assumed (UTF-8) and real
(CP1252) encodings. This can cause mojibake to be written when creating
or updating PO files.

Also fixes #23311.

Thanks to contributor with Trac nick 'danielmenzel' for the report,
excellent research and fix.

comment:7 Changed 2 months ago by timgraham

  • Has patch unset
  • Resolution fixed deleted
  • Status changed from closed to new
  • Triage Stage changed from Ready for checkin to Accepted

Python 3 looks good, but now I have test failures on Python 2 and Windows. Can you reproduce?

comment:8 Changed 2 months ago by Ramiro Morales <cramm0@…>

In 002a8ffe478b7520a64c9176f9f640633f643b9c:

Fixed breakage by 6fb9dee4 under Python2+Windows.

Refs #23271

comment:9 Changed 2 months ago by ramiro

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.
Back to Top