Opened 17 years ago
Closed 17 years ago
#5056 closed (wontfix)
gettext/ngettext do not accept UTF-8 and Unicode input strings.
Reported by: | Owned by: | Malcolm Tredinnick | |
---|---|---|---|
Component: | Internationalization | Version: | dev |
Severity: | Keywords: | gettext unicode | |
Cc: | bugs@… | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Which basically breaks things when you have non-ASCII chars in your localizations and try to use them with _()
. Patch follows.
Attachments (5)
Change History (14)
by , 17 years ago
Attachment: | ugettext.diff added |
---|
comment:1 by , 17 years ago
Has patch: | set |
---|
And BTW, any string will look like a unicode string on the Django traceback pages.
I.e. I had a string returned from _()
that was in fact a UTF-8 encoded byte-string, but was displayed in the "Locals Vars" section as u"bla bla"
. What was actually in the local vars, though, was a byte string.
comment:2 by , 17 years ago
Cc: | added |
---|---|
Keywords: | gettext unicode added |
While were at it, doesn't the python docs generally recommend to use
import __builtin__ __builtin__._ = stuff
instead of the current
__builtins__['_'] = stuff
(see: http://docs.python.org/ref/naming.html , and http://www.mail-archive.com/pypy-dev@codespeak.net/msg02678.html)
comment:3 by , 17 years ago
Summary: | _() is still mapped to gettext instead of ugettext after unicode merge → gettext/ngettext should no longer be available after unicode merge |
---|
Okay, it was pointed out to me that the implicit _()
function is supposed to go away according to #2920.
So in that case I'd say the point of this ticket is slightly broader: the non-unicode versions of the gettext functions make no sense in a unicode framework/application. So I'd suggest simply aliasing ugettext
to gettext
and ungettext
to ngettext
. Still providing the raw byte-string versions of those functions is prone to break the app in possibly unexpected ways: it just takes a translation using non-ASCII letters, and all the sudden you have a UnicodeDecodingError
.
I'm attaching a patch that does this and also simplifies the related code a bit. Of course, it still fixes the implicit _()
builtin until that is actually removed.
follow-up: 6 comment:4 by , 17 years ago
Summary: | gettext/ngettext should no longer be available after unicode merge → gettext/ngettext do not accept UTF-8 and Unicode input strings. |
---|---|
Triage Stage: | Unreviewed → Accepted |
This can't go in as it is.
The problem is return types: ugettext() returns a Unicode string, gettext() returns a UTF-8 encoded bytestring. This makes a difference in some places (e.g. __str__
and __unicode__
methods). I would also, in general, be reluctant to change the return types to differ from the similarly named Python standard library functions to avoid any confusion for users, but it turns out that "neatness" urge is supported by the functionality requirements.
Fixing #2920 removes the ability for people to accidentally shoot themselves in the foot on output. They will need to explicitly import one or other translation function. If we also ensure that gettext() and ugettext() can both accept UTF-8 bytestrings and Unicode strings, we avoid accidents on input as well, which I gather might have been the main motivation. That fixes all the main usability problems, as I understand it.
Are there other usability issues you were trying to fix?
(Changing the title to reflect the more refined problem we are fixing. Removing has_patch checkbox, since all the current patches are not appropriate.)
comment:5 by , 17 years ago
Has patch: | unset |
---|
comment:6 by , 17 years ago
Replying to mtredinnick:
This can't go in as it is.
The problem is return types: ugettext() returns a Unicode string, gettext() returns a UTF-8 encoded bytestring. This makes a difference in some places (e.g.
__str__
and__unicode__
methods).
Oh, I thought that after the unicode merge, throwing byte-strings around the framework was undesirable. If that is not the case, you can close this ticket.
However, I'd like to emphasize that I think allowing/encouraging the use of byte-strings is a bad idea and asking for trouble. Even if you make all Django code accept both UTF-8 byte-strings and unicode interchangeably, there are still tons of ways in normal Python usage where this will break left and right. I think I know about this stuff, and the problem that caused me file this ticket and write the patch really threw me off for a couple of hours (the fact that bytestrings are rendered as unicode representations on the debug screen did not help ;-).
In particular for translations, where most gettext()
calls start out returning just ASCII or unicode (basically the untranslated string), but when things actually get translated, all the sudden you have UTF-8 encoded bytestrings, which inevitably breaks things. Keeping the byte-string versions of the gettext functions available in a unicode application is not a feature, it's a design bug IMHO.
I would also, in general, be reluctant to change the return types to differ from the similarly named Python standard library functions to avoid any confusion for users, but it turns out that "neatness" urge is supported by the functionality requirements.
Note that the naming of the gettext functions in the standard library is due to it having to support both unicode and non-unicode applications. The actual gettext
C library doesn't actually have a uxxx
function for every other function. Personally, I see no problem with a framework defaulting to the unicode variants, and thus encouraging (or even forcing) the user to do the right thing. (I suspect we both know how many developers are ignorant about unicode and encodings.)
Fixing #2920 removes the ability for people to accidentally shoot themselves in the foot on output. They will need to explicitly import one or other translation function. If we also ensure that gettext() and ugettext() can both accept UTF-8 bytestrings and Unicode strings, we avoid accidents on input as well, which I gather might have been the main motivation. That fixes all the main usability problems, as I understand it.
I actually don't care much about the input of the gettext functions, as I use ASCII for that for better translation tooling interop. It definitely isn't the point I intended to make with this ticket.
comment:7 by , 17 years ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
Allowing bytestrings is a useful feature and is there by design. It means that code which uses only ASCII requires essentially no porting, for example.
I'll probably fix the input case at some point, but since it's not the point of this ticket, I'll close it. Thanks for putting the necessary thinking into the changes, but we do want to keep bytestring support; it's not pragmatic to insist everybody always uses Unicode.
comment:8 by , 17 years ago
Resolution: | invalid |
---|---|
Status: | closed → reopened |
comment:9 by , 17 years ago
Resolution: | → wontfix |
---|---|
Status: | reopened → closed |
Map _() to ugettext