Context Navigation

← Previous Ticket
Next Ticket →

#4662 closed (fixed)

[unicode] truncate_html_words doesn't work for non-latin characters

Reported by:	Ivan Sagalaev <Maniac@…>	Owned by:	Jacob
Component:	Uncategorized	Version:	unicode
Severity:		Keywords:
Cc:		Triage Stage:	Ready for checkin
Has patch:	yes	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description

django.utils.text.truncate_html_words (and for this matter, 'truncatwords_html' filter) uses regular expression to find words. The pattern is "[A-Za-z0-9]" that doesn't match non-latin words. It should be replaced with "\w" and compiled with re.UNICODE flag.

Patch follows. There is a small backwards incompatible change of behavior since "\w" matches underscores while "[A-Za-z0-9]" doesn't and thus it now considers "_" as words. I believe this is not very bad since isolated underscores are rare in human-oriented texts.

P.S. This whole thing makes sense only for unicode branch since regular expressions in unicode mode don't work correctly for byte strings anyway.

Attachments (1)

4662.diff (589 bytes ) - added by Ivan Sagalaev <Maniac@…> 17 years ago.: Patch

Download all attachments as: .zip

Change History (5)

by Ivan Sagalaev <Maniac@…>, 17 years ago

Attachment:	4662.diff added

Patch

comment:1 by Chris Beaven, 17 years ago

Comment from my original patch regarding unicode: http://code.djangoproject.com/ticket/2027#comment:6

Do we need to compile with re.UNICODE?

If you were worried about backwards compatibility, then you could use [^\W_], but I don't really think that's necessary ;P

comment:2 by Ivan Sagalaev <Maniac@…>, 17 years ago

Do we need to compile with re.UNICODE?

Yes. Because otherwise \w means only [A-Za-z0-9_]. re.UNICODE switches re to use unicode db to get the notion of "letter".

comment:3 by Chris Beaven, 17 years ago

Triage Stage:	Unreviewed → Ready for checkin

Oh, I actually missed the fact you were already compiling with re.U. :)

comment:4 by Malcolm Tredinnick, 17 years ago

Resolution:	→ fixed
Status:	new → closed

(In [5533]) unicode: Fixed #4662 -- Fixed a remaining ASCII assumption in
truncatewords_html(). Thanks, Ivan Sagalaev.

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues

Context Navigation

#4662 closed (fixed)

[unicode] truncate_html_words doesn't work for non-latin characters

Description

Attachments (1)

Change History (5)

by Ivan Sagalaev <Maniac@…>, 17 years ago

comment:1 by Chris Beaven, 17 years ago

comment:2 by Ivan Sagalaev <Maniac@…>, 17 years ago

comment:3 by Chris Beaven, 17 years ago

comment:4 by Malcolm Tredinnick, 17 years ago

Download in other formats:

Django Links

Learn More

Get Involved

Get Help

Follow Us

Support Us