Opened 7 years ago

Closed 6 years ago

#9212 closed (fixed)

German Umlauts and possible other foreign languages special characters

Reported by: nekron Owned by: nobody
Component: Internationalization Version: 1.1
Severity: Keywords: Umlauts
Cc: mtredinnick Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description

SVN: 9084

Today I played around with internationalization. My source code labels and template files contain German language by default which I wanted to translate into English. All files are utf-8 encoded so that I can use special Umlauts like "äöüß".

Creating the .po file with "makemessages -l en" I find that e.g. the word "Straße" (=street) will be shown in the .po file as

#: .\survey\models.py:71
msgid "Straße"   <-- strange chars here!
msgstr "Street"

I was editing the .po file with VIM utf-8 encoding set on so in my opinion I should see "Straße" instead of that strange looking word. I am not sure if this is a Windows gettext related bug and will try it on my Linux box tomorrow. Anyway other languages might be affected by this, too. On the other hand this is only a little quirk and translation within the application works fine for me.

Attachments (1)

9212.diff (1.9 KB) - added by kmtracey 7 years ago.

Download all attachments as: .zip

Change History (17)

comment:1 Changed 7 years ago by nekron

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

The next test I did on my Windows box was setting the .py file coding to #-*- coding: iso-8859-1 -*-# and changing the special chars for that encoding. The result was that makemessages compiled a .po file with Umlauts that could be displayed ok within VIM (:set encoding=iso-8859-1) and some external po editor (http://www.poedit.net/). As mentioned in the docs the .po file should be always utf-8, so somewhere in the makemessage process encoding is not utf-8, but iso-8859-1 and unicode chars get encoded that way thus the strange looking chars which are iso-8859-1 encoded utf-8 chars.



comment:2 Changed 7 years ago by mtredinnick

I suspect there's something Windows-specific going on here, possibly with respect to the native system encoding or possibly the behaviour of msgmerge or maybe something else. I cannot repeat this problem on Linux.

Somebody with a Windows box who can repeat this problem will need to debug this. Probably by printing repr(msgs) at each step throughout django/core/management/command/makemessages.py to see where the problem occurs. If at all possible, I would like to avoid having to use codecs.open() in that file, since that might restrict the portability of that tool to places like Jython, IronPython, PyPy, etc. It also shouldn't make a difference, since writing to the filesystem should not be magically re-encoding the bytes we are writing out.

I do notice that line 156 of makemessages.py doesn't open for writing in wb mode (just w mode), although I don't think that should be making a difference on the encoding of the written data.

So this needs somebody who can confirm it (and/or provide a repeatable test case, since the one in the bug description doesn't seem to be enough to repeat it for me) and maybe somebody on Windows to work out what's going wrong there.

comment:3 Changed 7 years ago by kmtracey

I've got this on my list of things to look into, since I've got Windows to play with. It'll probably be a few more days before I have any time to give it, though.

comment:4 Changed 7 years ago by kmtracey

This appears to be an xgettext bug in the version of xgettext available from http://sourceforge.net/projects/gettext. The latest version there is 0.13.1, dated Jan 15, 2004. Based on its behavior and this comment:

  /* We assume the program source is in ISO-8859-1 (for consistency with
     Python's \ooo and \xnn syntax inside strings), but we produce a POT
     file in UTF-8 encoding.  */

in the source (gettext-tools/src/x-python.c) it assumes Python source files are encoded using ISO-8859-1. So if you give it a utf-8 encoded source file you get nonsense output as it re-encodes the already utf8-encoded data bytes assuming they are ISO-8859-1 to start with. (The --from-code arg is ignored for Python files.)

Looking at the 0.17 (latest I could find) source, this file is rather different and that comment is gone. But the gettext for Win32 project at SourceForge does not offer anything more recent than 0.13. Apparently the base project started supporting Win32 directly so this other project has been abandoned? Except "support" means only that it's possible to build the base project on Windows, not that binaries are provided, and I didn't find it trivially easy to build the gettext package on Windows. So I went searching for pre-built binaries. The only ones I found that seem to be recent enough are the ones you get from cygwin, currently 0.15. With that version of xgettext installed via cygwin, the xgettext output is correct for a utf-8 encoded Python source file.

So, I suppose we could update the docs here: http://docs.djangoproject.com/en/dev/topics/i18n/#gettext-on-windows to mention getting gettext utilities via cygwin in addition to (or instead of?) the other site. The problem with cygwin is it's rather harder to give instructions and it's quite possible there are more dependencies than just the obvious gettext packages that need to be installed (I also had to install the libexpat package to resolve a dll not found error, but who knows what else might be required that I already had, since I had installed rather a lot of cygwin packages already on this particular machine).

Opinions? Anyone know if there are more recent easily-installed binaries for gettext tools on Windows out there somewhere? (The 0.14.4 ones I found here: http://gnuwin32.sourceforge.net/packages/gettext.htm show the same problem as the 0.13 ones, so I'm guessing 0.15 or later is what's needed.)

comment:5 Changed 7 years ago by mtredinnick

  • Cc mtredinnick added
  • Triage Stage changed from Unreviewed to Accepted

Oh, I've played this game before. :-( gettext seemed to make a new release every couple of months in the early years of this decade and working out which version added new features was hard.

I'm inclined to be nice to older versions -- in the sense of working around their limitations -- because it isn't always easy to upgrade and I don't want to put a big learning curve in the form of compiling the toolchain in the way of people trying to localise their applications. If somebody can work out how to get the version number out of xgettext so that we can do it in the Python code, I'm very happy to put something into makemessages.py that looks like

if version < (0, 15, 0):
   output_text = output_text.decode('iso-8859-1').encode('utf-8')

I have a feeling that using a reg-exp on the first line of output from xgettext --version is going to be enough here. I seem to recollect (from working on GNOME's intltool) that that was enough to work out the version pretty reliably. We'll only need the major and minor number.

We have a requirement that the source text must be in UTF-8 for strings being sent to gettext and so we don't support things like codding: iso-8859-1. Gettext and other tools aren't smart enough to understand that sort of stuff (they just do a lexical scan of the file, they don't understand too much about Python), so I'm comfortable with making that a requirement for people writing the source. But that means we have to actually support UTF-8. So if older gettext versions are treating things as iso-8859-1 by default, I can live with using programmatic hammers to force them back to UTF-8. No bytes get lost in the conversion process; it's just that the intermediate text doesn't make any sense (as noted in the original bug report).

comment:6 Changed 7 years ago by kmtracey

OK, I can take a look at fixing it by figuring out the xgettext version and undoing the mangling for xgettexts older than 0.15 (maybe I should verify that's specifically when it was fixed also). Only I think the encoding names need to be reversed there -- what you show is essentially what xgettext is doing and we need to reverse that:

>>> u = u'\xdf'
>>> print u
ß
>>> e1 = u.encode('utf-8')
>>> e1
'\xc3\x9f'
>>> e2 = e1.decode('iso-8859-1').encode('utf-8')
>>> e2
'\xc3\x83\xc2\x9f'
>>> e3 = e2.decode('utf-8').encode('iso-8859-1')
>>> e3
'\xc3\x9f'

e1 is the bytes in the source .py file, e2 is what the older xgettext outputs, e3 is what we want instead (same as e1).

comment:7 Changed 7 years ago by mtredinnick

Oh, I see what's happening. Yes, it's treating the input as iso-8859-1. So you're right: We have to decode from utf-8 back to unicode, then see what it would look like if treated as iso-8859-1 and then know that that is really the UTF-8 bytestrings. Good grief.

It also just occurred to me how fragile this is. It means any Python input files using bytes that don't fit into ISO-8859-1 cannot be handled by that old xgettext version. That's just going to be tough luck for those people and they'll have to get a more recent gettext; the tool isn't flexible enough to handle varying encodings. But we will have to document that limitation (in the "coming one day soon now, hopefully" rewrite of i18n.txt to be something comprehensible).

comment:8 Changed 7 years ago by Karen Tracey <kmtracey@…>

Yes, it's ugly and backwards-looking but I don't think it's fragile actually. The old xgettext doesn't reject any bytes as invalid, it simply takes any byte that has the high bit set and distributes its original 8 bits between two new bytes, with the high bits of the new bytes set as appropriate for a 2-byte utf-8 sequence. This transformation is completely reversible and by doing a .decode('utf-8').encode('iso-8859-1') we get back the originally-provided bytes...which may well be the utf-8 encoding of something not representable in iso-8859-1.

I'll attach a patch for review (since I'm not entirely comfortable with this code so would prefer someone review it before committing) that attempts to retrieve/interpret the xgettext version and do the decode if necessary. I did verify via the online cvs history for the GNU gettext package that 0.15 is the first version of xgettext that does not assume Python source is encoded in iso-8859-1, so checking for lower than 0.15 is the right check. I tested and verified that the patch works with:

1 - the Windows xgettext binary pointed at by our docs (0.13.1). Here we do the decode to restore the original bytes.

2 - the current Windows xgettext binary from cygwin (0.15). Here we do not do the decode, which is correct since xgettext doesn't reencode its input.

3 - the xgettext version on my Linux box (0.16.1). Here again we do not do the extra decode since it's not necessary.

However I ran into a hiccup with the the Windows 0.14.4 binary I found online (mentioned earlier). This version does mangle its output. However it traps when you try to run 'xgettext --version' (or 'xgettext -V'). That is you get a Windows popup "xgettext.exe has generated errors and will be closed by Windows. You will need to restart the program. An error log is being created". You get this if you try the version flag from the command line or as part of makemessages.

Once you hit "OK" the makemessages script continues and produces incorrect output because I coded the default in case of trouble determining version to be to NOT do the extra decode. (I'm uncomfortable with the idea of reversing that...if we can't reliably determine that the version is one that mangles output I'm thinking we should not go ahead and potentially mangle perfectly good output.)

But that leaves this one version I'm aware of where we do the wrong thing. The user gets a cryptic indication that there's something wrong in the form of that popup...but it's pretty cryptic. Of course this version seems to be rather broken, besides trapping when you try to get it to return the version string it reports its name as (null), for example invoking it with no arguments produces:

xgettext: no input file given
Try `(null) --help' for more information.

I did try raising a CommandError if 'xgettext --version' fails to return anything. However that produces another cryptic error:

Error: xgettext --version returns no version information
d:\u\kmt\django\trunk\django\core\management\base.py:234: RuntimeWarning: tp_compare didn't return -1 or -2 for exception
  sys.exit(1)

that I really don't feel like tracking down. So I'm inclined to not worry about working correctly with this version, because it's fundamentally broken. Other opinions?

Changed 7 years ago by kmtracey

comment:9 Changed 7 years ago by mtredinnick

This patch looks good, Karen, with the one exception noted below. We can document that 0.14.4 binary for Windows as "don't use that" in the i18n documentation (put it there for now; I'm going to make time this week to rewrite that particular piece of documentation). I'm happy to somewhat accommodating of people using pre-compiled binaries, but there's a limit to how nice I'm willing to be if the binary is just compiled to be broken out of the box.

The only thing I'd change in the patch (and it might be more style than substance, but it jumped out at me) is line 88, the regular expression line. Firstly, there's no need to call re.compile() when you're only using the reg-exp once. Also, you can drop the leading .*? bit and just use search() instead of match(), since that will find the first occurrence of the pattern. So

match = re.search(r'(?P<major>\d+)\.(?P<minor>\d+)', stdout.read())

is equivalent. Go ahead and commit whenever you like.

comment:10 Changed 7 years ago by kmtracey

  • Resolution set to fixed
  • Status changed from new to closed

(In [9155]) Fixed #9212: Added code to check the xgettext version, and if it is lower than 0.15, undo an incorrect encoding to utf-8 done by xgettext. This bug was fixed in xgettext 0.15, but the most-easily-installed Windows gettext binaries are older (0.13.1), so we work around it.

comment:11 Changed 7 years ago by kmtracey

(In [9156]) [1.0.X] Fixed #9212: Added code to check the xgettext version, and if it is lower than 0.15, undo an incorrect encoding to utf-8 done by xgettext. This bug was fixed in xgettext 0.15, but the most-easily-installed Windows gettext binaries are older (0.13.1), so we work around it.

Backport of r9155 from trunk.

comment:12 Changed 7 years ago by kmtracey

Thanks, made the change and committed. Also included a note about not using broken gettext binaries which kind of reads like stating the obvious to me -- if you're going to be editing that section anyway maybe you can come up with something better.

comment:13 Changed 6 years ago by anonymous

  • milestone post-1.0 deleted

Milestone post-1.0 deleted

comment:14 follow-up: Changed 6 years ago by thie1210

  • Resolution fixed deleted
  • Status changed from closed to reopened
  • Version changed from SVN to 1.1

Hi guys,

I'm running xgettext version 0.14.1 here. The makemessages command was working fine until I updated from v1.0 to v1.1. I now get the following error:

File "/usr/lib/python2.3/site-packages/django/core/management/commands/makemessages.py", line 166, in make_messages msgs = msgs.decode('utf-8').encode('iso-8859-1') UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2009' in position 600: ordinal not in range(256)

The file in questions, views.py, contains unicode strings and characters (u2009, the thin space for instance),

It's true that version 0.14.1 of xgettext assume ASCII input files for input. However, you can specify the encoding of the input files with the "--from-code" option. And that's what's done in makemessages.py when xgettext gets used at line 128.

I've bypassed the xgettext_reencodes_utf8 test in my makemessages.py file and I'm now back up and running. I'm new to this so I don't know how to make an official change. I guess we need a new solution for the original issue too?

comment:15 in reply to: ↑ 14 Changed 6 years ago by kmtracey

Replying to thie1210:

Why are you using version 0.14.1? The easiest fix would be to update to a more recent level. At the time this change went in it was hard to find precompiled binaries for Windows higher than 0.13. Since then a source has been found (see http://docs.djangoproject.com/en/dev/topics/i18n/#gettext-on-windows) so there is less motivation to workaround gettext problems in Django code.

What you seem to be reporting is that the version 0.14.1 you have respects the --from-code for .py files, which the versions I found that exhibited the original problem did not do. The 0.13.1 likely used by the original reporter did not, nor did some other 0.14.4 Windows binaries I found. These would both re-encode the input file to utf-8 assuming an input encoding of iso-8859-1 (not ASCII). Based on the error you are getting, the version you have is not doing this, so our attempt to un-do it fails.

That leaves us with being unable to figure out, based on reported version, whether the command we are calling mangles its input. If we cannot figure that out, we cannot know whether un-do the mangling. Given we've now updated our docs to include pointers to how to get a level of gettext utilities for Windows that does not exhibit the original problem, my initial inclination for how to fix this new problem is to remove the fix we put in for the original problem, and add a requirement that you must use gettext 0.15 or higher.

comment:16 Changed 6 years ago by ramiro

  • Resolution set to fixed
  • Status changed from reopened to closed

More than four months have passed since Karen asked the user that reopened this already properly fixed ticked about the reasons for reopening it and additional information about what seems a uncommon working enviromnent. I'm closing it with status fixed because there hasn't been any answer.

Additionally the next stable relase will have GNU gettext >= 0.15 as a requirement.

Note: See TracTickets for help on using tickets.
Back to Top