Opened 7 years ago

Closed 5 years ago

#9753 closed (fixed)

makemessages failed on long Chinese text

Reported by: Will Owned by: nobody
Component: Internationalization Version: 1.0
Severity: Keywords: django-admin.py makemessages
Cc: 1.0.2 Triage Stage: Design decision needed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description

  1. Enclose a Chinese string longer than 76 Chinese characters by {% trans "" %}, e.g., "四千年前有一个姑娘叫姜嫄,她有一天觉得很空虚,就到郊外玩,看见一只巨人脚印,也许是外星人留下的,她想上去比一比,看看谁的脚丫子更大,就踩上去。踩上去就发现肚子里乱动,跟怀了孕似的。回去以后,肚子里的小孩,又老不出来,过了十二个月才生下来。"
  1. Run django-admin.py makemessages -l en -e htm
  1. You will see an error. I don't remember the exact error message, but basically it means "a string ends unexpectedly", probably because the code doesn't handle multi-byte characters correctly and truncates it at the middle of a Chinese character. A same length English string works fine.
  1. Because of it, we have to write our program in English, and then provide the Chinese version, using Django's internationalization tools.

Change History (7)

comment:1 Changed 7 years ago by mtredinnick

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

Django handles multibyte characters (either UTF-8 files or with settings.FILE_CHARSET set appropriately) without any problems.

I just tested your example as described, and it works without error for me. Remember that template tags must all be on a single line (they cannot contain newlines), but even if I mess this up, the template tag is simply ignored, rather than raising an error.

To diagnose this further, could you please attach a very short template file that causes the error to be raised for you.

comment:2 Changed 7 years ago by kmtracey

This may be related to #9212. Using gettext utilities 0.15 (from cygwin) on Windows I have no problem with the specified Chinese string in a templates. However, using gettext utilities 0.13 on Windows I get these errors:

Error: errors happened while running msguniq
D:\u\kmt\software\web\xword\locale\en\LC_MESSAGES\django.pot:48:76: invalid multibyte sequence
D:\u\kmt\software\web\xword\locale\en\LC_MESSAGES\django.pot:48:77: invalid multibyte sequence
D:\u\kmt\software\web\xword\locale\en\LC_MESSAGES\django.pot:49:2: invalid multibyte sequence

Lines 45-50 from the .pot file are:

#: .\crossword\templates\500.html.py:7
msgid ""
"四千年前有一个姑娘叫姜嫄,她有一天觉得很空虚,就到郊外玩,看见一只巨人脚印"
",也许是外星人留下的,她想上去比一比,看看谁的脚丫子更大,就踩上去。踩上去�"
"�发现肚子里乱动,跟怀了孕似的。回去以后,肚子里的小孩,又老不出来,过了十二个月才生下来。"
msgstr ""

which (if it displays as it is in the composition window) shows an invalid utf-8 sequence at the end of the 2nd message line and the beginning of the 3rd (the lines identified in the error messages).

The problem identified in #9212 was that the older xgettext assumes iso8859-1 encoding for Python files, and takes that assumed iso8859-1 input and encodes it to utf-8. However Django requires the source to already be utf8-encoded, so xgettext outputs doubly utf-8 encoded data. We "fixed" that by un-doing the extra utf-8 encoding being done by xgettext. However here we see apparently xgettext splits long messages, and when it does that using the doubly-encoded utf-8 data, it may split at a point that, when the 2nd utf-8 encoding is un-done, results in one of the original utf-8 encoded Unicode chars being split across a line boundary, so extra close quote and newline chars are stuffed in the middle of some original n-byte utf-8 sequence, resulting in the error.

Now, this is a template file and not a Python file -- it isn't entirely clear to me why we say the language is Python for all of the "extra" extensions processed. If that isn't really necessary then perhaps this could be worked around by not specifying language Python for extra extensions. But Python code would still potentially suffer from this problem (I'm assuming we do need to specify language Python for actual Python source).

Anyway a more straightforward way to get around this problem, for all files where we specify language Python to xgettext, is to specify --no-wrap to xgettext. That way it doesn't split lines and we don't wind up with invalid utf-8 sequences at line boundaries even after un-doing the extra utf-8 encoding. I thought it would be best to only specify --no-wrap when we are in this oddball case of needing to un-do xgettext's incorrect double encoding of the original utf-8 data, but it seems msguniq (which we run after xgettext) also splits lines, so even when we specify --no-wrap to xgettext, we get nicely wrapped lines in the ultimate output file. So I think it would be OK to just add --no-wrap unconditionally to the args for xgettext. Sound OK?

All this is assuming the problem I am seeing is the same as the reported one -- as the original report lacks the actual error messages seen and any details on platform used or gettext utilities version in use, I can't be 100% sure of that.

comment:3 Changed 7 years ago by mtredinnick

Ah, another windows problem. Fantastic. :-(

A couple of comments:

  1. The reason we don't have --no-wrap on unconditionally is that it leads to pretty ugly PO files, particularly for those working in text editors. However, it's becoming clear that there's a need to allow arbitrary gettext options to be specified and passed through our wrappers to xgettext, with --no-wrap being one of those things. With luck, we'll get that change in for 1.1.
  2. We mark the template files as being of type "Python" because the conversion process to the intermediate form that we ultimately run xgettext over converts variables to Python format strings (not valid for {% trans %}, but certainly so for {% blocktrans %} tags). The advantage of doing that is that translation tools know about Python format and can mark them appropriately (e.g. syntax highlighting) and compilemessages will warn about any misspellings of format strings, or omitted format strings. That catches a lot of errors. If the gettext tools somehow magically knew about Django markup, we'd use "Django" format, but that isn't the case today, so the conversion to intermediate Python format, which is then processed with xgettext is actually a pretty neat way of working with existing tools for translators.

If the original poster can confirm this is just a problems with an older gettext on Windows then the resolution is to upgrade to a more recent version of gettext (it's been over 2 years since gettext 0.15 was released, so we're not requiring anything that's just been released in the last few months) and, in the future, we'll also allow the arbitrary parameter passing.

comment:4 Changed 7 years ago by kmtracey

I figured there might be a good reason for specifying Python as the language, I just didn't know what it was. I've not done much with translation besides trying to recreate (since I do have Windows boxes to test on) these problems that seem to affect only Windows due to the older level of gettext that is most prevalent there.

The thing I tried to say about specifying --no-wrap to xgettext was that it doesn't affect the .po file output by our makemessages management command, since the subsequent invocation of msguniq appears to wrap the lines that weren't wrapped by xgettext. That is, when I tried fixing the problem by adding --no-wrap on xgettext (conditionally only when running the old gettext level), I was surprised to find that the output file still had that very long Chinese message wrapped). So I was thinking it would be OK to fix this problem (assuming it's actually what the original poster was seeing) by always specifying --no-wrap on xgettext, since the xgettext output (near as a I can tell?) is always runs through another processing step that will take care of wrapping the long lines. But as I said I'm not really familiar with any of these utilities so perhaps I am missing something?

comment:5 Changed 6 years ago by anonymous

  • milestone post-1.0 deleted

Milestone post-1.0 deleted

comment:6 Changed 6 years ago by jacob

  • Triage Stage changed from Unreviewed to Design decision needed

comment:7 Changed 5 years ago by ramiro

  • Resolution set to fixed
  • Status changed from new to closed

I'm closing this ticket because in r12296 we've raised the minumum gettext version required to 0.15 that shouldn't have the double utf-8 encoding problems and this makes the --no-wrap in-gettext-invokation workaround unnecesary.

Note: See TracTickets for help on using tickets.
Back to Top