id	summary	reporter	owner	description	type	status	component	version	severity	resolution	keywords	cc	stage	has_patch	needs_docs	needs_tests	needs_better_patch	easy	ui_ux
4662	[unicode] truncate_html_words doesn't work for non-latin characters	Ivan Sagalaev <Maniac@…>	Jacob	"django.utils.text.truncate_html_words (and for this matter, 'truncatwords_html' filter) uses regular expression to find words. The pattern is ""[A-Za-z0-9]"" that doesn't match non-latin words. It should be replaced with ""\w"" and compiled with re.UNICODE flag.

Patch follows. There is a small backwards incompatible change of behavior since ""\w"" matches underscores while ""[A-Za-z0-9]"" doesn't and thus it now considers ""_"" as words. I believe this is not very bad since isolated underscores are rare in human-oriented texts.

P.S. This whole thing makes sense only for unicode branch since regular expressions in unicode mode don't work correctly for byte strings anyway."		closed	Uncategorized	unicode		fixed			Ready for checkin	1	0	0	0	0	0