Opened 17 years ago

Closed 17 years ago

#5056 closed (wontfix)

gettext/ngettext do not accept UTF-8 and Unicode input strings.

Reported by: Christopher Lenz <cmlenz@…> Owned by: Malcolm Tredinnick
Component: Internationalization Version: dev
Severity: Keywords: gettext unicode
Cc: bugs@… Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Which basically breaks things when you have non-ASCII chars in your localizations and try to use them with _(). Patch follows.

Attachments (5)

ugettext.diff (1.2 KB ) - added by Christopher Lenz <cmlenz@…> 17 years ago.
Map _() to ugettext
ugettext.2.diff (5.4 KB ) - added by Christopher Lenz <cmlenz@…> 17 years ago.
Updated patch
ugettext.3.diff (575 bytes ) - added by Christopher Lenz <cmlenz@…> 17 years ago.
Oops, fixed a problem with previous patch
ugettext.4.diff (5.5 KB ) - added by Christopher Lenz <cmlenz@…> 17 years ago.
New combined patch
ugettext.5.diff (5.5 KB ) - added by Christopher Lenz <cmlenz@…> 17 years ago.
Fix typo in previous patch

Download all attachments as: .zip

Change History (14)

by Christopher Lenz <cmlenz@…>, 17 years ago

Attachment: ugettext.diff added

Map _() to ugettext

comment:1 by Christopher Lenz <cmlenz@…>, 17 years ago

Has patch: set

And BTW, any string will look like a unicode string on the Django traceback pages.

I.e. I had a string returned from _() that was in fact a UTF-8 encoded byte-string, but was displayed in the "Locals Vars" section as u"bla bla". What was actually in the local vars, though, was a byte string.

comment:2 by Johan Bergström <bugs@…>, 17 years ago

Cc: bugs@… added
Keywords: gettext unicode added

While were at it, doesn't the python docs generally recommend to use

    import __builtin__
    __builtin__._ = stuff

instead of the current

    __builtins__['_'] = stuff

(see: http://docs.python.org/ref/naming.html , and http://www.mail-archive.com/pypy-dev@codespeak.net/msg02678.html)

comment:3 by Christopher Lenz <cmlenz@…>, 17 years ago

Summary: _() is still mapped to gettext instead of ugettext after unicode mergegettext/ngettext should no longer be available after unicode merge

Okay, it was pointed out to me that the implicit _() function is supposed to go away according to #2920.

So in that case I'd say the point of this ticket is slightly broader: the non-unicode versions of the gettext functions make no sense in a unicode framework/application. So I'd suggest simply aliasing ugettext to gettext and ungettext to ngettext. Still providing the raw byte-string versions of those functions is prone to break the app in possibly unexpected ways: it just takes a translation using non-ASCII letters, and all the sudden you have a UnicodeDecodingError.

I'm attaching a patch that does this and also simplifies the related code a bit. Of course, it still fixes the implicit _() builtin until that is actually removed.

by Christopher Lenz <cmlenz@…>, 17 years ago

Attachment: ugettext.2.diff added

Updated patch

by Christopher Lenz <cmlenz@…>, 17 years ago

Attachment: ugettext.3.diff added

Oops, fixed a problem with previous patch

by Christopher Lenz <cmlenz@…>, 17 years ago

Attachment: ugettext.4.diff added

New combined patch

by Christopher Lenz <cmlenz@…>, 17 years ago

Attachment: ugettext.5.diff added

Fix typo in previous patch

comment:4 by Malcolm Tredinnick, 17 years ago

Summary: gettext/ngettext should no longer be available after unicode mergegettext/ngettext do not accept UTF-8 and Unicode input strings.
Triage Stage: UnreviewedAccepted

This can't go in as it is.

The problem is return types: ugettext() returns a Unicode string, gettext() returns a UTF-8 encoded bytestring. This makes a difference in some places (e.g. __str__ and __unicode__ methods). I would also, in general, be reluctant to change the return types to differ from the similarly named Python standard library functions to avoid any confusion for users, but it turns out that "neatness" urge is supported by the functionality requirements.

Fixing #2920 removes the ability for people to accidentally shoot themselves in the foot on output. They will need to explicitly import one or other translation function. If we also ensure that gettext() and ugettext() can both accept UTF-8 bytestrings and Unicode strings, we avoid accidents on input as well, which I gather might have been the main motivation. That fixes all the main usability problems, as I understand it.

Are there other usability issues you were trying to fix?

(Changing the title to reflect the more refined problem we are fixing. Removing has_patch checkbox, since all the current patches are not appropriate.)

comment:5 by Malcolm Tredinnick, 17 years ago

Has patch: unset

in reply to:  4 comment:6 by Christopher Lenz <cmlenz@…>, 17 years ago

Replying to mtredinnick:

This can't go in as it is.

The problem is return types: ugettext() returns a Unicode string, gettext() returns a UTF-8 encoded bytestring. This makes a difference in some places (e.g. __str__ and __unicode__ methods).

Oh, I thought that after the unicode merge, throwing byte-strings around the framework was undesirable. If that is not the case, you can close this ticket.

However, I'd like to emphasize that I think allowing/encouraging the use of byte-strings is a bad idea and asking for trouble. Even if you make all Django code accept both UTF-8 byte-strings and unicode interchangeably, there are still tons of ways in normal Python usage where this will break left and right. I think I know about this stuff, and the problem that caused me file this ticket and write the patch really threw me off for a couple of hours (the fact that bytestrings are rendered as unicode representations on the debug screen did not help ;-).

In particular for translations, where most gettext() calls start out returning just ASCII or unicode (basically the untranslated string), but when things actually get translated, all the sudden you have UTF-8 encoded bytestrings, which inevitably breaks things. Keeping the byte-string versions of the gettext functions available in a unicode application is not a feature, it's a design bug IMHO.

I would also, in general, be reluctant to change the return types to differ from the similarly named Python standard library functions to avoid any confusion for users, but it turns out that "neatness" urge is supported by the functionality requirements.

Note that the naming of the gettext functions in the standard library is due to it having to support both unicode and non-unicode applications. The actual gettext C library doesn't actually have a uxxx function for every other function. Personally, I see no problem with a framework defaulting to the unicode variants, and thus encouraging (or even forcing) the user to do the right thing. (I suspect we both know how many developers are ignorant about unicode and encodings.)

Fixing #2920 removes the ability for people to accidentally shoot themselves in the foot on output. They will need to explicitly import one or other translation function. If we also ensure that gettext() and ugettext() can both accept UTF-8 bytestrings and Unicode strings, we avoid accidents on input as well, which I gather might have been the main motivation. That fixes all the main usability problems, as I understand it.

I actually don't care much about the input of the gettext functions, as I use ASCII for that for better translation tooling interop. It definitely isn't the point I intended to make with this ticket.

comment:7 by Malcolm Tredinnick, 17 years ago

Resolution: invalid
Status: newclosed

Allowing bytestrings is a useful feature and is there by design. It means that code which uses only ASCII requires essentially no porting, for example.

I'll probably fix the input case at some point, but since it's not the point of this ticket, I'll close it. Thanks for putting the necessary thinking into the changes, but we do want to keep bytestring support; it's not pragmatic to insist everybody always uses Unicode.

comment:8 by Malcolm Tredinnick, 17 years ago

Resolution: invalid
Status: closedreopened

comment:9 by Malcolm Tredinnick, 17 years ago

Resolution: wontfix
Status: reopenedclosed
Note: See TracTickets for help on using tickets.
Back to Top