Opened 17 years ago
Closed 17 years ago
#4662 closed (fixed)
[unicode] truncate_html_words doesn't work for non-latin characters
Reported by: | Owned by: | Jacob | |
---|---|---|---|
Component: | Uncategorized | Version: | unicode |
Severity: | Keywords: | ||
Cc: | Triage Stage: | Ready for checkin | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
django.utils.text.truncate_html_words (and for this matter, 'truncatwords_html' filter) uses regular expression to find words. The pattern is "[A-Za-z0-9]" that doesn't match non-latin words. It should be replaced with "\w" and compiled with re.UNICODE flag.
Patch follows. There is a small backwards incompatible change of behavior since "\w" matches underscores while "[A-Za-z0-9]" doesn't and thus it now considers "_" as words. I believe this is not very bad since isolated underscores are rare in human-oriented texts.
P.S. This whole thing makes sense only for unicode branch since regular expressions in unicode mode don't work correctly for byte strings anyway.
Attachments (1)
Change History (5)
by , 17 years ago
comment:1 by , 17 years ago
Comment from my original patch regarding unicode: http://code.djangoproject.com/ticket/2027#comment:6
Do we need to compile with re.UNICODE
?
If you were worried about backwards compatibility, then you could use [^\W_]
, but I don't really think that's necessary ;P
comment:2 by , 17 years ago
Do we need to compile with re.UNICODE?
Yes. Because otherwise \w means only [A-Za-z0-9_]. re.UNICODE switches re to use unicode db to get the notion of "letter".
comment:3 by , 17 years ago
Triage Stage: | Unreviewed → Ready for checkin |
---|
Oh, I actually missed the fact you were already compiling with re.U
. :)
comment:4 by , 17 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Patch