Opened 8 years ago

Closed 8 years ago

#3923 closed (fixed)

[unicode] make-messages.py does not work well with unicode strings

Reported by: mtredinnick Owned by: mtredinnick
Component: Internationalization Version: master
Severity: Keywords:
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: no
Easy pickings: UI/UX:

Description

In some cases we would like to write

gettext_lazy(u"Baden-Württemberg")

using a unicode string that contains non-ASCII characters. Running make-messages.py over this causes an error because xgettext cannot work out the encoding of the string and errors out.

Not sure what the right solution is yet (pygettext.py is deprecated in favour of xgettext, in theory).

Attachments (2)

omit-headers-workaround.diff (1.2 KB) - added by Evren Esat Özkan <sleytr@…> 8 years ago.
patch for make-messages.py to workaround ommit-headers bug in xgettext
omit-headers-workaround.2.diff (1.4 KB) - added by Evren Esat Özkan <sleytr@…> 8 years ago.
copied same 2 line of codes which used for py files.

Download all attachments as: .zip

Change History (17)

comment:1 Changed 8 years ago by mtredinnick

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Accepted

comment:2 Changed 8 years ago by mtredinnick

  • Resolution set to wontfix
  • Status changed from new to closed

I've done enough research of email archives and source code reading to convince myself this isn't solvable. The only acceptable solution is to ensure that msgid strings (the arguments to gettext() and friends) are ASCII. No non-ASCII characters at all. Part of the reason for this is because gettext does binary byte-wise comparisons when looking up messages in the catalog, so if the catalog value is, say, UTF-8 and the source file is something else, you're doomed.

This means that if somebody wants their original text to be in a language other than North American English, they need to use the translation infrastructure (it cannot be done with USE_I18N=False). A shame, but cannot be helped.

comment:3 follow-up: Changed 8 years ago by spacetaxi@…

What about this possible solution?

Step 1:
make-messages.py converts all non-ASCII characters in the msgid strings to HTML-entities, e.g. 'ä' gets replaced by '&auml;'.
So the msgids will be clean ASCII strings containing HTML-entities instead of 8bit- or utf8-characters.

Step 2:
In Django, when a string is going to be translated, it first has to be converted in the same manner before gettext is called to retrieve the translated string.

This would allow the msgid strings to contain non-ASCII characters, as make-messages.py and Django would take care of converting them to their clean ASCII representation as HTML-entities. Wouldn't this work?

comment:4 in reply to: ↑ 3 ; follow-up: Changed 8 years ago by spacetaxi@…

Replying to spacetaxi@gmail.com:

This would allow the msgid strings to contain non-ASCII characters [...]

Sorry, the last sentence should read: "This would allow the strings, that are going to be translated, to contain non-ASCII characters [...]"

comment:5 in reply to: ↑ 4 ; follow-up: Changed 8 years ago by Michael Axiak <axiak@…>

Replying to spacetaxi@gmail.com:

Replying to spacetaxi@gmail.com:

This would allow the msgid strings to contain non-ASCII characters [...]

Sorry, the last sentence should read: "This would allow the strings, that are going to be translated, to contain non-ASCII characters [...]"

From what I understand, regardless of what make_messages.py generates, the gettext library still needs to work at runtime. That is, it expects ASCII characters to do the byte comparisons. make_messages.py is only used to generate the appropriate files to help gettext do its work.

comment:6 in reply to: ↑ 5 ; follow-up: Changed 8 years ago by spacetaxi@…

Replying to Michael Axiak <axiak@mit.edu>:

From what I understand, regardless of what make_messages.py generates, the gettext library still needs to work at runtime. That is, it expects ASCII characters to do the byte comparisons. make_messages.py is only used to generate the appropriate files to help gettext do its work.

That's what I understand, too. That's the reason for "Step 2", which I've explained above. Django would have to convert these strings to plain ASCII on the fly (by replacing non-ASCII characters with their representation as HTML entities). So Django would do the same character conversion as make_messages.py.

comment:7 in reply to: ↑ 6 Changed 8 years ago by anonymous

Replying to spacetaxi@gmail.com:

That's what I understand, too. That's the reason for "Step 2", which I've explained above. Django would have to convert these strings to plain ASCII on the fly (by replacing non-ASCII characters with their representation as HTML entities). So Django would do the same character conversion as make_messages.py.

So you are suggesting changing gettext/ugettext? I could see that working...patch?

-Mike

comment:8 Changed 8 years ago by Michael Axiak <axiak@…>

  • Resolution wontfix deleted
  • Status changed from closed to reopened

I'm interested in seeing what happens if we use smart_quote in the gettext wrappers.

comment:9 Changed 8 years ago by Michael Axiak <axiak@…>

  • Summary changed from make-messages.py does not work well with unicode strings to [unicode] make-messages.py does not work well with unicode strings

Changing title to unicode.

While it's flagged as [unicode], I certainly don't think this should slow down the unicode branch. I'll write a patch against the unicode branch -- if it gets into trunk after unicode is merged, that should be fine.

comment:10 Changed 8 years ago by mtredinnick

  • Resolution set to wontfix
  • Status changed from reopened to closed

Please open another ticket if you're going to explore replacing all of the third-party gettext support tools. That's a much bigger and different issue than using existing well-debugged libraries that we don't have to maintain.

This ticket was, and remains, about the fact that all of the supporting infrastructure for extracting mesasge catalogs, compiling them into MO files and acessing them at runtime relies on msgids being in ASCII. At the moment, there isn't a viable alternative to switching away from those tools, hence the conclusion of the ticket. It's really "notabug", rather than "wontfix", but that's close enough.

If you want to totally change the way the i18n support works, then open another ticket please, because you are talking about replacing all of the gettext pieces that are supplied by existing tools (including message extraction), writing a lot of code and doing lots of timing tests to show that it isn't slower than the existing code.

By the way, step 1 in comment 3 is not a good idea. It breaks whatever translation memories people might be using with their tools to speed up the translation process. You are also asking people who are not software developers or web coders (i.e. translators) to understand all HTML entities (and since they aren't very comprehensive, also be able to look up random things like &#123D; to know what it means).

comment:11 follow-up: Changed 8 years ago by spacetaxi@…

I've found a strange make-messages.py/xgettext behaviour: make-messages.py/xgettext is able to extract utf8-strings from *one* file! But as soon as there are *more than one* files to process, make-messages.py/xgettext fails.

This problem has something to do with the way of xgettext being called: When a pot file already exists (e. g. if xgettext is called on the *second* file to be processed), xgettext is called with the "--omit-header" option... and will fail on utf8 encoded files. Without "--omit-header" the message extraction is working alright.

Maybe make-messages.py could be changed to regard this, so one could use utf8 encoded messages for translation?

Translation of utf8 encoded strings seems to work without problems...

Changed 8 years ago by Evren Esat Özkan <sleytr@…>

patch for make-messages.py to workaround ommit-headers bug in xgettext

comment:12 in reply to: ↑ 11 ; follow-up: Changed 8 years ago by Evren Esat Özkan <sleytr@…>

  • Has patch set
  • Needs tests set
  • Resolution wontfix deleted
  • Status changed from closed to reopened

I've attached a patch for make-messages.py to workaround this xgettext problem. Patch simply removes first 17 line from output of xgettext to clip headers.

Replying to spacetaxi@gmail.com:

I've found a strange make-messages.py/xgettext behaviour: make-messages.py/xgettext is able to extract utf8-strings from *one* file! But as soon as there are *more than one* files to process, make-messages.py/xgettext fails.

This problem has something to do with the way of xgettext being called: When a pot file already exists (e. g. if xgettext is called on the *second* file to be processed), xgettext is called with the "--omit-header" option... and will fail on utf8 encoded files. Without "--omit-header" the message extraction is working alright.

Maybe make-messages.py could be changed to regard this, so one could use utf8 encoded messages for translation?

Translation of utf8 encoded strings seems to work without problems...

comment:13 in reply to: ↑ 12 ; follow-up: Changed 8 years ago by ramiro

Replying to Evren Esat Özkan <sleytr@gmail.com>:

I've attached a patch for make-messages.py to workaround this xgettext problem. Patch simply removes first 17 line
from output of xgettext to clip headers.

I don't know if the issue reported by spacetaxi@… on comment 11 (IMHO it shold have been reported on another different ticket) is still valid because a fix that solves this has been applied for strings extracted from Python source code files and templates on [5722].

Your patch means you are seeing the same behavior for extraction from JavaScript source files. I don't know if that's a real make-message.py bug because I don't know if string literals on JavaScript can be UTF-8 encoded (otherwise this ticket should be closed) but in case it is I'd suggest taking an approach similar to the one implemented on that revision. Using fixed line stripping is doomed to fail at some point as shown by #4899

Changed 8 years ago by Evren Esat Özkan <sleytr@…>

copied same 2 line of codes which used for py files.

comment:14 in reply to: ↑ 13 Changed 8 years ago by Evren Esat Özkan <sleytr@…>

I tried non-ascii msgids in js files on Opera9.5, Firefox2 and IE6 and didn't see any problems, everything working properly.
I've uploaded a second patch according to your suggestion.

Thanks,

Replying to ramiro:

I don't know if the issue reported by spacetaxi@… on comment 11 (IMHO it shold have been reported on another different ticket) is still valid because a fix that solves this has been applied for strings extracted from Python source code files and templates on [5722].

Your patch means you are seeing the same behavior for extraction from JavaScript source files. I don't know if that's a real make-message.py bug because I don't know if string literals on JavaScript can be UTF-8 encoded (otherwise this ticket should be closed) but in case it is I'd suggest taking an approach similar to the one implemented on that revision. Using fixed line stripping is doomed to fail at some point as shown by #4899

comment:15 Changed 8 years ago by mtredinnick

  • Resolution set to fixed
  • Status changed from reopened to closed

Please open a new ticket to describe whatever problem these patches are trying to address.

The initial issue reported in this ticket was fixed a long time ago and let's not bring it back to life by reopening just for a related thing. New ticket for a new bug, please.

Note: See TracTickets for help on using tickets.
Back to Top