Opened 18 years ago

Closed 18 years ago

#4662 closed (fixed)

[unicode] truncate_html_words doesn't work for non-latin characters

Reported by: Ivan Sagalaev <Maniac@…> Owned by: Jacob
Component: Uncategorized Version: unicode
Severity: Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

django.utils.text.truncate_html_words (and for this matter, 'truncatwords_html' filter) uses regular expression to find words. The pattern is "[A-Za-z0-9]" that doesn't match non-latin words. It should be replaced with "\w" and compiled with re.UNICODE flag.

Patch follows. There is a small backwards incompatible change of behavior since "\w" matches underscores while "[A-Za-z0-9]" doesn't and thus it now considers "_" as words. I believe this is not very bad since isolated underscores are rare in human-oriented texts.

P.S. This whole thing makes sense only for unicode branch since regular expressions in unicode mode don't work correctly for byte strings anyway.

Change History (5)

by Ivan Sagalaev <Maniac@…>, 18 years ago

Attachment: 4662.diff added

Patch

comment:1 by Chris Beaven, 18 years ago

Comment from my original patch regarding unicode: http://code.djangoproject.com/ticket/2027#comment:6

Do we need to compile with re.UNICODE?

If you were worried about backwards compatibility, then you could use [^\W_], but I don't really think that's necessary ;P

comment:2 by Ivan Sagalaev <Maniac@…>, 18 years ago

Do we need to compile with re.UNICODE?

Yes. Because otherwise \w means only [A-Za-z0-9_]. re.UNICODE switches re to use unicode db to get the notion of "letter".

comment:3 by Chris Beaven, 18 years ago

Triage Stage: UnreviewedReady for checkin

Oh, I actually missed the fact you were already compiling with re.U. :)

comment:4 by Malcolm Tredinnick, 18 years ago

Resolution: fixed
Status: newclosed

(In [5533]) unicode: Fixed #4662 -- Fixed a remaining ASCII assumption in
truncatewords_html(). Thanks, Ivan Sagalaev.

Note: See TracTickets for help on using tickets.
Back to Top