[unicode] truncate_html_words doesn't work for non-latin characters
|Reported by:||Ivan Sagalaev <Maniac@…>||Owned by:||jacob|
|Cc:||Triage Stage:||Ready for checkin|
|Has patch:||yes||Needs documentation:||no|
|Needs tests:||no||Patch needs improvement:||no|
django.utils.text.truncate_html_words (and for this matter, 'truncatwords_html' filter) uses regular expression to find words. The pattern is "[A-Za-z0-9]" that doesn't match non-latin words. It should be replaced with "\w" and compiled with re.UNICODE flag.
Patch follows. There is a small backwards incompatible change of behavior since "\w" matches underscores while "[A-Za-z0-9]" doesn't and thus it now considers "_" as words. I believe this is not very bad since isolated underscores are rare in human-oriented texts.
P.S. This whole thing makes sense only for unicode branch since regular expressions in unicode mode don't work correctly for byte strings anyway.
Change History (5)
Changed 7 years ago by Ivan Sagalaev <Maniac@…>
comment:1 Changed 7 years ago by SmileyChris
- Needs documentation unset
- Needs tests unset
- Patch needs improvement unset
comment:3 Changed 7 years ago by SmileyChris
- Triage Stage changed from Unreviewed to Ready for checkin