Code

Opened 7 years ago

Closed 7 years ago

#5056 closed (wontfix)

gettext/ngettext do not accept UTF-8 and Unicode input strings.

Reported by: Christopher Lenz <cmlenz@…> Owned by: mtredinnick
Component: Internationalization Version: master
Severity: Keywords: gettext unicode
Cc: bugs@… Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description

Which basically breaks things when you have non-ASCII chars in your localizations and try to use them with _(). Patch follows.

Attachments (5)

ugettext.diff (1.2 KB) - added by Christopher Lenz <cmlenz@…> 7 years ago.
Map _() to ugettext
ugettext.2.diff (5.4 KB) - added by Christopher Lenz <cmlenz@…> 7 years ago.
Updated patch
ugettext.3.diff (575 bytes) - added by Christopher Lenz <cmlenz@…> 7 years ago.
Oops, fixed a problem with previous patch
ugettext.4.diff (5.5 KB) - added by Christopher Lenz <cmlenz@…> 7 years ago.
New combined patch
ugettext.5.diff (5.5 KB) - added by Christopher Lenz <cmlenz@…> 7 years ago.
Fix typo in previous patch

Download all attachments as: .zip

Change History (14)

Changed 7 years ago by Christopher Lenz <cmlenz@…>

Map _() to ugettext

comment:1 Changed 7 years ago by Christopher Lenz <cmlenz@…>

  • Has patch set
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

And BTW, any string will look like a unicode string on the Django traceback pages.

I.e. I had a string returned from _() that was in fact a UTF-8 encoded byte-string, but was displayed in the "Locals Vars" section as u"bla bla". What was actually in the local vars, though, was a byte string.

comment:2 Changed 7 years ago by Johan Bergström <bugs@…>

  • Cc bugs@… added
  • Keywords gettext unicode added

While were at it, doesn't the python docs generally recommend to use

    import __builtin__
    __builtin__._ = stuff

instead of the current

    __builtins__['_'] = stuff

(see: http://docs.python.org/ref/naming.html , and http://www.mail-archive.com/pypy-dev@codespeak.net/msg02678.html)

comment:3 Changed 7 years ago by Christopher Lenz <cmlenz@…>

  • Summary changed from _() is still mapped to gettext instead of ugettext after unicode merge to gettext/ngettext should no longer be available after unicode merge

Okay, it was pointed out to me that the implicit _() function is supposed to go away according to #2920.

So in that case I'd say the point of this ticket is slightly broader: the non-unicode versions of the gettext functions make no sense in a unicode framework/application. So I'd suggest simply aliasing ugettext to gettext and ungettext to ngettext. Still providing the raw byte-string versions of those functions is prone to break the app in possibly unexpected ways: it just takes a translation using non-ASCII letters, and all the sudden you have a UnicodeDecodingError.

I'm attaching a patch that does this and also simplifies the related code a bit. Of course, it still fixes the implicit _() builtin until that is actually removed.

Changed 7 years ago by Christopher Lenz <cmlenz@…>

Updated patch

Changed 7 years ago by Christopher Lenz <cmlenz@…>

Oops, fixed a problem with previous patch

Changed 7 years ago by Christopher Lenz <cmlenz@…>

New combined patch

Changed 7 years ago by Christopher Lenz <cmlenz@…>

Fix typo in previous patch

comment:4 follow-up: Changed 7 years ago by mtredinnick

  • Summary changed from gettext/ngettext should no longer be available after unicode merge to gettext/ngettext do not accept UTF-8 and Unicode input strings.
  • Triage Stage changed from Unreviewed to Accepted

This can't go in as it is.

The problem is return types: ugettext() returns a Unicode string, gettext() returns a UTF-8 encoded bytestring. This makes a difference in some places (e.g. __str__ and __unicode__ methods). I would also, in general, be reluctant to change the return types to differ from the similarly named Python standard library functions to avoid any confusion for users, but it turns out that "neatness" urge is supported by the functionality requirements.

Fixing #2920 removes the ability for people to accidentally shoot themselves in the foot on output. They will need to explicitly import one or other translation function. If we also ensure that gettext() and ugettext() can both accept UTF-8 bytestrings and Unicode strings, we avoid accidents on input as well, which I gather might have been the main motivation. That fixes all the main usability problems, as I understand it.

Are there other usability issues you were trying to fix?

(Changing the title to reflect the more refined problem we are fixing. Removing has_patch checkbox, since all the current patches are not appropriate.)

comment:5 Changed 7 years ago by mtredinnick

  • Has patch unset

comment:6 in reply to: ↑ 4 Changed 7 years ago by Christopher Lenz <cmlenz@…>

Replying to mtredinnick:

This can't go in as it is.

The problem is return types: ugettext() returns a Unicode string, gettext() returns a UTF-8 encoded bytestring. This makes a difference in some places (e.g. __str__ and __unicode__ methods).

Oh, I thought that after the unicode merge, throwing byte-strings around the framework was undesirable. If that is not the case, you can close this ticket.

However, I'd like to emphasize that I think allowing/encouraging the use of byte-strings is a bad idea and asking for trouble. Even if you make all Django code accept both UTF-8 byte-strings and unicode interchangeably, there are still tons of ways in normal Python usage where this will break left and right. I think I know about this stuff, and the problem that caused me file this ticket and write the patch really threw me off for a couple of hours (the fact that bytestrings are rendered as unicode representations on the debug screen did not help ;-).

In particular for translations, where most gettext() calls start out returning just ASCII or unicode (basically the untranslated string), but when things actually get translated, all the sudden you have UTF-8 encoded bytestrings, which inevitably breaks things. Keeping the byte-string versions of the gettext functions available in a unicode application is not a feature, it's a design bug IMHO.

I would also, in general, be reluctant to change the return types to differ from the similarly named Python standard library functions to avoid any confusion for users, but it turns out that "neatness" urge is supported by the functionality requirements.

Note that the naming of the gettext functions in the standard library is due to it having to support both unicode and non-unicode applications. The actual gettext C library doesn't actually have a uxxx function for every other function. Personally, I see no problem with a framework defaulting to the unicode variants, and thus encouraging (or even forcing) the user to do the right thing. (I suspect we both know how many developers are ignorant about unicode and encodings.)

Fixing #2920 removes the ability for people to accidentally shoot themselves in the foot on output. They will need to explicitly import one or other translation function. If we also ensure that gettext() and ugettext() can both accept UTF-8 bytestrings and Unicode strings, we avoid accidents on input as well, which I gather might have been the main motivation. That fixes all the main usability problems, as I understand it.

I actually don't care much about the input of the gettext functions, as I use ASCII for that for better translation tooling interop. It definitely isn't the point I intended to make with this ticket.

comment:7 Changed 7 years ago by mtredinnick

  • Resolution set to invalid
  • Status changed from new to closed

Allowing bytestrings is a useful feature and is there by design. It means that code which uses only ASCII requires essentially no porting, for example.

I'll probably fix the input case at some point, but since it's not the point of this ticket, I'll close it. Thanks for putting the necessary thinking into the changes, but we do want to keep bytestring support; it's not pragmatic to insist everybody always uses Unicode.

comment:8 Changed 7 years ago by mtredinnick

  • Resolution invalid deleted
  • Status changed from closed to reopened

comment:9 Changed 7 years ago by mtredinnick

  • Resolution set to wontfix
  • Status changed from reopened to closed

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.