Opened 18 years ago

Closed 18 years ago

Last modified 17 years ago

#924 closed defect (duplicate)

[patch] String filters (lower, upper, capfirst etc.) don't work with international strings

Reported by: Maniac <Maniac@…> Owned by: hugo
Component: contrib.admin Version:
Severity: normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

String filters operate on 8-bit strigns but Python's useful string functions only work with international characters in unicode strings. UTF8ed strings are not changed.

Patch follows.

This, however, will break the patch in ticket 393 where it assumes a string passed to a filter to be 8-bit str. But I beleive this whole thing should be unicode.

Attachments (5)

924.1.diff (608 bytes ) - added by Maniac <Maniac@…> 18 years ago.
Patch: operate on unicode strings
924.2.diff (3.4 KB ) - added by Maniac <Maniac@…> 18 years ago.
Patch: convert str-unicode-str where needed
924.3.diff (3.5 KB ) - added by Maniac <Maniac@…> 18 years ago.
Updated to current trunk (was broken by fixing #1110)
924.4.diff (3.3 KB ) - added by Maniac <Maniac@…> 18 years ago.
Patch for magic-removal branch
924.5.diff (3.3 KB ) - added by Maniac@… 18 years ago.
Updated to current trunk

Download all attachments as: .zip

Change History (24)

by Maniac <Maniac@…>, 18 years ago

Attachment: 924.1.diff added

Patch: operate on unicode strings

comment:1 by hugo, 18 years ago

Actually this will break much more, I think. We should decide on wether Django should use byte strings or should use unicode strings internally (I would really prefer to change to unicode-only before 1.0, because that's a much saner environment internally than the current utf-8 bytestrings) and then do it completely for the whole system. With your patch programmers would have to keep in mind where there code is running - is it in a filter, it sees unicode, is it outside, it sees utf-8 strings.

So for the filters, I would say that if a filter needs the unicode handling for it's string function to work correctly, it should itself convert to unicode and convert back - at least for so long as we keep Django running on utf-8 bytestrings.

comment:2 by Maniac <Maniac@…>, 18 years ago

I can see that it would break something only if people make custom filters that do explicit str(). But it will be broken anyway in the long run as you suggest unicodisation. So why not earlier? Also I think that such thing can (and should) be made step-by-step instead of making it in the form of a huge patch, but this is a personal preference :-)

Considering you proposal to change specific functions... This means that this rather badly readable conversion will be in 16 places.

comment:3 by Maniac <Maniac@…>, 18 years ago

BTW... I should confess that I never run any test because I just don't know how :-(. Could someone point me to a doc or describe shortly how to run Django's tests?

comment:4 by hugo, 18 years ago

The breakage is somewhere else: you will call code in your filter code. Not all filters are necessarily just using local code. If you call code from outside, that code - if it is Django code - will currently assume that strings are utf-8 encoded bytestrings. But your filters will handle unicode strings, so you will either have to do conversion yourself (so you could have done them in the first place already instead of changing the filter interface) or you will pass them on unchanged and then code outside will break.

A -1 from me on this change. Unicode vs. utf-8 needs to be handled consistently and the current - mostly consistent - way is to have utf-8 bytestrings and to convert to unicode yourself. It's not as if every filter _requires_ unicode strings, it's only those that use functions that themselves are unicode aware, so in the current situation it would be "the right thing to do"[tm] to add the conversions and back-conversions to exactly those filters.

The other thing - possible unicodeization - would first need a round of discussion on the list and quite some talking over, because it's a major step. And if it is done, it should be done _complete_. But that's most definitely outside the scope of this ticket ;-)

Running tests: you need a settings file that has the database set (and it should be a specific testing database, as tables are created content added and removed) and then you can go to django_src/tests and do ./runtests with that settings file active.

comment:5 by Maniac <Maniac@…>, 18 years ago

I got your point. My thought were that if a filter accepts a str and uses str's methods then it will be happy with unicode also since these types are very similar by interface. But I haven't really looked if some filter does something specific to str...

Ok. I'll redo the patch with a decorator converting from and to DEFAULT_CHARSET and hook it to every string function that needs it.

comment:6 by Maniac <Maniac@…>, 18 years ago

Ouch... I really should have looked more closely before. There are plenty of explicit str(value) all over the filters code. Then it won't be that fast and simple :-)

comment:7 by Maniac <Maniac@…>, 18 years ago

BTW, string functions do everywhere str(value) whereas it would be better, imho, to do repr(value). That way one could use string filters on model instances in templates. Is there a reason for str() here?

comment:8 by Tom Tobin <korpios@…>, 18 years ago

Maniac, see #710; models are supposed to use __str__ for their string values now, and __repr__ goes back to being a proper representation useful for, e.g., debugging. (The old way will still work, since str() will just grab the value of __repr__ if there is no __str__.)

by Maniac <Maniac@…>, 18 years ago

Attachment: 924.2.diff added

Patch: convert str-unicode-str where needed

comment:9 by Maniac <Maniac@…>, 18 years ago

Here's the patch as discussed with Hugo.

Just one glitch. There is a filter 'title' which uses regexp ([a-z])'([A-Z]). Does anyone knows if it can be internationalised? I mean, are there general 'capital letter' and 'small letter' in Python regexp?

comment:10 by Maniac <Maniac@…>, 18 years ago

Forgot to add... The patch is real and working. Mentioned glitch is not breaking anything it just doesn't work for non-ascii as it never was.

comment:11 by Maniac <Maniac@…>, 18 years ago

Summary: String filters (lower, upper, capfirst etc.) don't work with international strings[patch] String filters (lower, upper, capfirst etc.) don't work with international strings

Just read the FAQ. Adding [patch] to summary. Sorry for spamming :-(

comment:12 by Maniac <Maniac@…>, 18 years ago

Uhm... Not wanting to be annoying, just need to clarify... This ticket was inactive for some time, is it just lack of time or there's something broken in my patch that should be fixed?

comment:13 by hugo, 18 years ago

Owner: changed from Adrian Holovaty to hugo

by Maniac <Maniac@…>, 18 years ago

Attachment: 924.3.diff added

Updated to current trunk (was broken by fixing #1110)

comment:14 by hugo, 18 years ago

There is an open discussion on wether we switch to full unicode - so this patch won't be needed. Either this or a full unicodefication should go in.

comment:15 by Maniac <Maniac@…>, 18 years ago

Hugo, you said in this very ticket earlier that unicodization is out of scope of it. And this it's true because that thing is huge. Then I made this small patch that completely solves one specific problem that exists today (people already do step on it as seen in django-users). Commiting it won't hurt anyone and will make the world a little better until the bright unicode future comes. Isn't it? :-)

comment:16 by hugo, 18 years ago

Actually unicodefication is out of the scope of this _ticket_, but not so huge as not to do it :-)

At the moment unicodefication is under analysis - if the result is that we will do it, it will definitely be something in the near future. So it wouldn't make sense to apply this to later rip it out again. If the result is to not do it, I am all for adding this as an alternative. But please let us first analyse wether to do unicodefication or not, and if doing it, on the time frame in which to do it.

by Maniac <Maniac@…>, 18 years ago

Attachment: 924.4.diff added

Patch for magic-removal branch

comment:17 by Andrey <aela@…>, 18 years ago

Unicodification is a Good Thing and even Python-3000 is going to make all strings unicode. Better do it now guys, or you are risking to get buried in bug reports some day. =)

comment:18 by Adrian Holovaty, 18 years ago

Resolution: duplicate
Status: newclosed

Closing in favor of #2489.

by Maniac@…, 18 years ago

Attachment: 924.5.diff added

Updated to current trunk

comment:19 by Chris Beaven, 17 years ago

#393 fixes this too.

Note: See TracTickets for help on using tickets.
Back to Top