Opened 18 years ago
Closed 18 years ago
#4662 closed (fixed)
[unicode] truncate_html_words doesn't work for non-latin characters
Description ¶
django.utils.text.truncate_html_words (and for this matter, 'truncatwords_html' filter) uses regular expression to find words. The pattern is "[A-Za-z0-9]" that doesn't match non-latin words. It should be replaced with "\w" and compiled with re.UNICODE flag.
Patch follows. There is a small backwards incompatible change of behavior since "\w" matches underscores while "[A-Za-z0-9]" doesn't and thus it now considers "_" as words. I believe this is not very bad since isolated underscores are rare in human-oriented texts.
P.S. This whole thing makes sense only for unicode branch since regular expressions in unicode mode don't work correctly for byte strings anyway.
Change History (5)
by , 18 years ago
comment:1 by , 18 years ago
Comment from my original patch regarding unicode: http://code.djangoproject.com/ticket/2027#comment:6
Do we need to compile with re.UNICODE
?
If you were worried about backwards compatibility, then you could use [^\W_]
, but I don't really think that's necessary ;P
comment:2 by , 18 years ago
Do we need to compile with re.UNICODE?
Yes. Because otherwise \w means only [A-Za-z0-9_]. re.UNICODE switches re to use unicode db to get the notion of "letter".
comment:3 by , 18 years ago
Triage Stage: | Unreviewed → Ready for checkin |
---|
Oh, I actually missed the fact you were already compiling with re.U
. :)
comment:4 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Patch