Opened 5 years ago

Closed 5 years ago

#26093 closed Bug (fixed)

makemessages messes up unicode characters on Python 3

Reported by: Sylvain Fankhauser Owned by: nobody
Component: Internationalization Version: 1.9
Severity: Normal Keywords: python3
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

If you run the makemessages command on Python 3 (tested on Python 3.4.2, Django 1.9.1) and you have strings that contain unicode characters they will get incorrectly escaped or even stripped out from the generated PO file.

For example in a template with {% trans "hello world" %} (the space here is the unicode character 202f), you'll end up with an msgid "hello
u202fworld", which makes the original string unrecognized as a translation key. Trying with the non-breaking space character (00a0) makes it disappear completely and creates an msgid "helloworld".

The same works fine on Python 2, the unicode characters are preserved in the resulting PO file.

Change History (7)

comment:1 Changed 5 years ago by Claude Paroz

I was not able to reproduce. It might be nice to provide a test in the Django test suite to ensure the behavior is correct (or not!).

comment:2 Changed 5 years ago by Claude Paroz

Triage Stage: UnreviewedAccepted

I might have talked too quickly. No problem with É, for example, but with a non-breaking space, the makemessages output: ./templates/ext_edit.html.py:25: invalid multibyte sequence.

comment:4 Changed 5 years ago by Claude Paroz

This issue is related to the way xgettext interprets escape sequences in Python source files.
u'sequence: \xa0' (note the prefix) is interpreted as an ending unicode non-breaking space (correct).
'sequence: \xa0' (without the prefix) is interpreted as an ending \xa0 byte (which is non-valid UTF-8).
There are not many characters that %r outputs as an escape, but the non-breaking space is still an important use case.

So xgettext is still interpreting strings in the Python 2 way, as it cannot differentiate between Python versions by simply reading the source file.

A possible workaround would be to force outputting the u'' prefix on Python 3 when we templatize templates during the extraction process.

comment:5 Changed 5 years ago by Claude Paroz

Has patch: set

comment:6 Changed 5 years ago by Tim Graham

Triage Stage: AcceptedReady for checkin

comment:7 Changed 5 years ago by Claude Paroz <claude@…>

Resolution: fixed
Status: newclosed

In 104eddb:

Fixed #26093 -- Allowed escape sequences extraction by gettext on Python 3

Thanks Sylvain Fankhauser for the report and Tim Graham for the review.

Note: See TracTickets for help on using tickets.
Back to Top