Code

Opened 8 years ago

Closed 8 years ago

Last modified 7 years ago

#924 closed defect (duplicate)

[patch] String filters (lower, upper, capfirst etc.) don't work with international strings

Reported by: Maniac <Maniac@…> Owned by: hugo
Component: contrib.admin Version:
Severity: normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description

String filters operate on 8-bit strigns but Python's useful string functions only work with international characters in unicode strings. UTF8ed strings are not changed.

Patch follows.

This, however, will break the patch in ticket 393 where it assumes a string passed to a filter to be 8-bit str. But I beleive this whole thing should be unicode.

Attachments (5)

924.1.diff (608 bytes) - added by Maniac <Maniac@…> 8 years ago.
Patch: operate on unicode strings
924.2.diff (3.4 KB) - added by Maniac <Maniac@…> 8 years ago.
Patch: convert str-unicode-str where needed
924.3.diff (3.5 KB) - added by Maniac <Maniac@…> 8 years ago.
Updated to current trunk (was broken by fixing #1110)
924.4.diff (3.3 KB) - added by Maniac <Maniac@…> 8 years ago.
Patch for magic-removal branch
924.5.diff (3.3 KB) - added by Maniac@… 8 years ago.
Updated to current trunk

Download all attachments as: .zip

Change History (24)

Changed 8 years ago by Maniac <Maniac@…>

Patch: operate on unicode strings

comment:1 Changed 8 years ago by hugo

Actually this will break much more, I think. We should decide on wether Django should use byte strings or should use unicode strings internally (I would really prefer to change to unicode-only before 1.0, because that's a much saner environment internally than the current utf-8 bytestrings) and then do it completely for the whole system. With your patch programmers would have to keep in mind where there code is running - is it in a filter, it sees unicode, is it outside, it sees utf-8 strings.

So for the filters, I would say that if a filter needs the unicode handling for it's string function to work correctly, it should itself convert to unicode and convert back - at least for so long as we keep Django running on utf-8 bytestrings.

comment:2 Changed 8 years ago by Maniac <Maniac@…>

I can see that it would break something only if people make custom filters that do explicit str(). But it will be broken anyway in the long run as you suggest unicodisation. So why not earlier? Also I think that such thing can (and should) be made step-by-step instead of making it in the form of a huge patch, but this is a personal preference :-)

Considering you proposal to change specific functions... This means that this rather badly readable conversion will be in 16 places.

comment:3 Changed 8 years ago by Maniac <Maniac@…>

BTW... I should confess that I never run any test because I just don't know how :-(. Could someone point me to a doc or describe shortly how to run Django's tests?

comment:4 Changed 8 years ago by hugo

The breakage is somewhere else: you will call code in your filter code. Not all filters are necessarily just using local code. If you call code from outside, that code - if it is Django code - will currently assume that strings are utf-8 encoded bytestrings. But your filters will handle unicode strings, so you will either have to do conversion yourself (so you could have done them in the first place already instead of changing the filter interface) or you will pass them on unchanged and then code outside will break.

A -1 from me on this change. Unicode vs. utf-8 needs to be handled consistently and the current - mostly consistent - way is to have utf-8 bytestrings and to convert to unicode yourself. It's not as if every filter _requires_ unicode strings, it's only those that use functions that themselves are unicode aware, so in the current situation it would be "the right thing to do"[tm] to add the conversions and back-conversions to exactly those filters.

The other thing - possible unicodeization - would first need a round of discussion on the list and quite some talking over, because it's a major step. And if it is done, it should be done _complete_. But that's most definitely outside the scope of this ticket ;-)

Running tests: you need a settings file that has the database set (and it should be a specific testing database, as tables are created content added and removed) and then you can go to django_src/tests and do ./runtests with that settings file active.

comment:5 Changed 8 years ago by Maniac <Maniac@…>

I got your point. My thought were that if a filter accepts a str and uses str's methods then it will be happy with unicode also since these types are very similar by interface. But I haven't really looked if some filter does something specific to str...

Ok. I'll redo the patch with a decorator converting from and to DEFAULT_CHARSET and hook it to every string function that needs it.

comment:6 Changed 8 years ago by Maniac <Maniac@…>

Ouch... I really should have looked more closely before. There are plenty of explicit str(value) all over the filters code. Then it won't be that fast and simple :-)

comment:7 Changed 8 years ago by Maniac <Maniac@…>

BTW, string functions do everywhere str(value) whereas it would be better, imho, to do repr(value). That way one could use string filters on model instances in templates. Is there a reason for str() here?

comment:8 Changed 8 years ago by Tom Tobin <korpios@…>

Maniac, see #710; models are supposed to use __str__ for their string values now, and __repr__ goes back to being a proper representation useful for, e.g., debugging. (The old way will still work, since str() will just grab the value of __repr__ if there is no __str__.)

Changed 8 years ago by Maniac <Maniac@…>

Patch: convert str-unicode-str where needed

comment:9 Changed 8 years ago by Maniac <Maniac@…>

Here's the patch as discussed with Hugo.

Just one glitch. There is a filter 'title' which uses regexp ([a-z])'([A-Z]). Does anyone knows if it can be internationalised? I mean, are there general 'capital letter' and 'small letter' in Python regexp?

comment:10 Changed 8 years ago by Maniac <Maniac@…>

Forgot to add... The patch is real and working. Mentioned glitch is not breaking anything it just doesn't work for non-ascii as it never was.

comment:11 Changed 8 years ago by Maniac <Maniac@…>

  • Summary changed from String filters (lower, upper, capfirst etc.) don't work with international strings to [patch] String filters (lower, upper, capfirst etc.) don't work with international strings

Just read the FAQ. Adding [patch] to summary. Sorry for spamming :-(

comment:12 Changed 8 years ago by Maniac <Maniac@…>

Uhm... Not wanting to be annoying, just need to clarify... This ticket was inactive for some time, is it just lack of time or there's something broken in my patch that should be fixed?

comment:13 Changed 8 years ago by hugo

  • Owner changed from adrian to hugo

Changed 8 years ago by Maniac <Maniac@…>

Updated to current trunk (was broken by fixing #1110)

comment:14 Changed 8 years ago by hugo

There is an open discussion on wether we switch to full unicode - so this patch won't be needed. Either this or a full unicodefication should go in.

comment:15 Changed 8 years ago by Maniac <Maniac@…>

Hugo, you said in this very ticket earlier that unicodization is out of scope of it. And this it's true because that thing is huge. Then I made this small patch that completely solves one specific problem that exists today (people already do step on it as seen in django-users). Commiting it won't hurt anyone and will make the world a little better until the bright unicode future comes. Isn't it? :-)

comment:16 Changed 8 years ago by hugo

Actually unicodefication is out of the scope of this _ticket_, but not so huge as not to do it :-)

At the moment unicodefication is under analysis - if the result is that we will do it, it will definitely be something in the near future. So it wouldn't make sense to apply this to later rip it out again. If the result is to not do it, I am all for adding this as an alternative. But please let us first analyse wether to do unicodefication or not, and if doing it, on the time frame in which to do it.

Changed 8 years ago by Maniac <Maniac@…>

Patch for magic-removal branch

comment:17 Changed 8 years ago by Andrey <aela@…>

Unicodification is a Good Thing and even Python-3000 is going to make all strings unicode. Better do it now guys, or you are risking to get buried in bug reports some day. =)

comment:18 Changed 8 years ago by adrian

  • Resolution set to duplicate
  • Status changed from new to closed

Closing in favor of #2489.

Changed 8 years ago by Maniac@…

Updated to current trunk

comment:19 Changed 7 years ago by SmileyChris

#393 fixes this too.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.