Opened 18 years ago
Closed 17 years ago
#3923 closed (fixed)
[unicode] make-messages.py does not work well with unicode strings
Reported by: | Malcolm Tredinnick | Owned by: | Malcolm Tredinnick |
---|---|---|---|
Component: | Internationalization | Version: | dev |
Severity: | Keywords: | ||
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | yes | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
In some cases we would like to write
gettext_lazy(u"Baden-Württemberg")
using a unicode string that contains non-ASCII characters. Running make-messages.py over this causes an error because xgettext cannot work out the encoding of the string and errors out.
Not sure what the right solution is yet (pygettext.py is deprecated in favour of xgettext, in theory).
Attachments (2)
Change History (17)
comment:1 by , 18 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:2 by , 17 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
follow-up: 4 comment:3 by , 17 years ago
What about this possible solution?
Step 1:
make-messages.py converts all non-ASCII characters in the msgid strings to HTML-entities, e.g. 'ä' gets replaced by 'ä'.
So the msgids will be clean ASCII strings containing HTML-entities instead of 8bit- or utf8-characters.
Step 2:
In Django, when a string is going to be translated, it first has to be converted in the same manner before gettext is called to retrieve the translated string.
This would allow the msgid strings to contain non-ASCII characters, as make-messages.py and Django would take care of converting them to their clean ASCII representation as HTML-entities. Wouldn't this work?
follow-up: 5 comment:4 by , 17 years ago
Replying to spacetaxi@gmail.com:
This would allow the msgid strings to contain non-ASCII characters [...]
Sorry, the last sentence should read: "This would allow the strings, that are going to be translated, to contain non-ASCII characters [...]"
follow-up: 6 comment:5 by , 17 years ago
Replying to spacetaxi@gmail.com:
Replying to spacetaxi@gmail.com:
This would allow the msgid strings to contain non-ASCII characters [...]
Sorry, the last sentence should read: "This would allow the strings, that are going to be translated, to contain non-ASCII characters [...]"
From what I understand, regardless of what make_messages.py
generates, the gettext library still needs to work at runtime. That is, it expects ASCII characters to do the byte comparisons. make_messages.py
is only used to generate the appropriate files to help gettext do its work.
follow-up: 7 comment:6 by , 17 years ago
Replying to Michael Axiak <axiak@mit.edu>:
From what I understand, regardless of what
make_messages.py
generates, the gettext library still needs to work at runtime. That is, it expects ASCII characters to do the byte comparisons.make_messages.py
is only used to generate the appropriate files to help gettext do its work.
That's what I understand, too. That's the reason for "Step 2", which I've explained above. Django would have to convert these strings to plain ASCII on the fly (by replacing non-ASCII characters with their representation as HTML entities). So Django would do the same character conversion as make_messages.py
.
comment:7 by , 17 years ago
Replying to spacetaxi@gmail.com:
That's what I understand, too. That's the reason for "Step 2", which I've explained above. Django would have to convert these strings to plain ASCII on the fly (by replacing non-ASCII characters with their representation as HTML entities). So Django would do the same character conversion as
make_messages.py
.
So you are suggesting changing gettext/ugettext? I could see that working...patch?
-Mike
comment:8 by , 17 years ago
Resolution: | wontfix |
---|---|
Status: | closed → reopened |
I'm interested in seeing what happens if we use smart_quote in the gettext wrappers.
comment:9 by , 17 years ago
Summary: | make-messages.py does not work well with unicode strings → [unicode] make-messages.py does not work well with unicode strings |
---|
Changing title to unicode.
While it's flagged as [unicode], I certainly don't think this should slow down the unicode branch. I'll write a patch against the unicode branch -- if it gets into trunk after unicode is merged, that should be fine.
comment:10 by , 17 years ago
Resolution: | → wontfix |
---|---|
Status: | reopened → closed |
Please open another ticket if you're going to explore replacing all of the third-party gettext support tools. That's a much bigger and different issue than using existing well-debugged libraries that we don't have to maintain.
This ticket was, and remains, about the fact that all of the supporting infrastructure for extracting mesasge catalogs, compiling them into MO files and acessing them at runtime relies on msgids being in ASCII. At the moment, there isn't a viable alternative to switching away from those tools, hence the conclusion of the ticket. It's really "notabug", rather than "wontfix", but that's close enough.
If you want to totally change the way the i18n support works, then open another ticket please, because you are talking about replacing all of the gettext pieces that are supplied by existing tools (including message extraction), writing a lot of code and doing lots of timing tests to show that it isn't slower than the existing code.
By the way, step 1 in comment 3 is not a good idea. It breaks whatever translation memories people might be using with their tools to speed up the translation process. You are also asking people who are not software developers or web coders (i.e. translators) to understand all HTML entities (and since they aren't very comprehensive, also be able to look up random things like {D; to know what it means).
follow-up: 12 comment:11 by , 17 years ago
I've found a strange make-messages.py/xgettext behaviour: make-messages.py/xgettext is able to extract utf8-strings from *one* file! But as soon as there are *more than one* files to process, make-messages.py/xgettext fails.
This problem has something to do with the way of xgettext being called: When a pot file already exists (e. g. if xgettext is called on the *second* file to be processed), xgettext is called with the "--omit-header" option... and will fail on utf8 encoded files. Without "--omit-header" the message extraction is working alright.
Maybe make-messages.py could be changed to regard this, so one could use utf8 encoded messages for translation?
Translation of utf8 encoded strings seems to work without problems...
by , 17 years ago
Attachment: | omit-headers-workaround.diff added |
---|
patch for make-messages.py to workaround ommit-headers bug in xgettext
follow-up: 13 comment:12 by , 17 years ago
Has patch: | set |
---|---|
Needs tests: | set |
Resolution: | wontfix |
Status: | closed → reopened |
I've attached a patch for make-messages.py to workaround this xgettext problem. Patch simply removes first 17 line from output of xgettext to clip headers.
Replying to spacetaxi@gmail.com:
I've found a strange make-messages.py/xgettext behaviour: make-messages.py/xgettext is able to extract utf8-strings from *one* file! But as soon as there are *more than one* files to process, make-messages.py/xgettext fails.
This problem has something to do with the way of xgettext being called: When a pot file already exists (e. g. if xgettext is called on the *second* file to be processed), xgettext is called with the "--omit-header" option... and will fail on utf8 encoded files. Without "--omit-header" the message extraction is working alright.
Maybe make-messages.py could be changed to regard this, so one could use utf8 encoded messages for translation?
Translation of utf8 encoded strings seems to work without problems...
follow-up: 14 comment:13 by , 17 years ago
Replying to Evren Esat Özkan <sleytr@gmail.com>:
I've attached a patch for make-messages.py to workaround this xgettext problem. Patch simply removes first 17 line
from output of xgettext to clip headers.
I don't know if the issue reported by spacetaxi@… on comment 11 (IMHO it shold have been reported on another different ticket) is still valid because a fix that solves this has been applied for strings extracted from Python source code files and templates on [5722].
Your patch means you are seeing the same behavior for extraction from JavaScript source files. I don't know if that's a real make-message.py
bug because I don't know if string literals on JavaScript can be UTF-8 encoded (otherwise this ticket should be closed) but in case it is I'd suggest taking an approach similar to the one implemented on that revision. Using fixed line stripping is doomed to fail at some point as shown by #4899
by , 17 years ago
Attachment: | omit-headers-workaround.2.diff added |
---|
copied same 2 line of codes which used for py files.
comment:14 by , 17 years ago
I tried non-ascii msgids in js files on Opera9.5, Firefox2 and IE6 and didn't see any problems, everything working properly.
I've uploaded a second patch according to your suggestion.
Thanks,
Replying to ramiro:
I don't know if the issue reported by spacetaxi@… on comment 11 (IMHO it shold have been reported on another different ticket) is still valid because a fix that solves this has been applied for strings extracted from Python source code files and templates on [5722].
Your patch means you are seeing the same behavior for extraction from JavaScript source files. I don't know if that's a real
make-message.py
bug because I don't know if string literals on JavaScript can be UTF-8 encoded (otherwise this ticket should be closed) but in case it is I'd suggest taking an approach similar to the one implemented on that revision. Using fixed line stripping is doomed to fail at some point as shown by #4899
comment:15 by , 17 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Please open a new ticket to describe whatever problem these patches are trying to address.
The initial issue reported in this ticket was fixed a long time ago and let's not bring it back to life by reopening just for a related thing. New ticket for a new bug, please.
I've done enough research of email archives and source code reading to convince myself this isn't solvable. The only acceptable solution is to ensure that msgid strings (the arguments to gettext() and friends) are ASCII. No non-ASCII characters at all. Part of the reason for this is because gettext does binary byte-wise comparisons when looking up messages in the catalog, so if the catalog value is, say, UTF-8 and the source file is something else, you're doomed.
This means that if somebody wants their original text to be in a language other than North American English, they need to use the translation infrastructure (it cannot be done with USE_I18N=False). A shame, but cannot be helped.