Django

Code

Ticket #4662 (closed: fixed)

Opened 1 year ago

Last modified 1 year ago

[unicode] truncate_html_words doesn't work for non-latin characters

Reported by: Ivan Sagalaev <Maniac@SoftwareManiacs.Org> Assigned to: jacob
Milestone: Component: Uncategorized
Version: unicode Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: 1 Needs documentation: 0
Needs tests: 0 Patch needs improvement: 0

Description

django.utils.text.truncate_html_words (and for this matter, 'truncatwords_html' filter) uses regular expression to find words. The pattern is "[A-Za-z0-9]" that doesn't match non-latin words. It should be replaced with "\w" and compiled with re.UNICODE flag.

Patch follows. There is a small backwards incompatible change of behavior since "\w" matches underscores while "[A-Za-z0-9]" doesn't and thus it now considers "_" as words. I believe this is not very bad since isolated underscores are rare in human-oriented texts.

P.S. This whole thing makes sense only for unicode branch since regular expressions in unicode mode don't work correctly for byte strings anyway.

Attachments

4662.diff (0.6 kB) - added by Ivan Sagalaev <Maniac@SoftwareManiacs.Org> on 06/22/07 06:47:15.
Patch

Change History

06/22/07 06:47:15 changed by Ivan Sagalaev <Maniac@SoftwareManiacs.Org>

  • attachment 4662.diff added.

Patch

06/22/07 08:45:47 changed by SmileyChris

  • needs_better_patch changed.
  • needs_tests changed.
  • needs_docs changed.

Comment from my original patch regarding unicode: http://code.djangoproject.com/ticket/2027#comment:6

Do we need to compile with re.UNICODE?

If you were worried about backwards compatibility, then you could use [^\W_], but I don't really think that's necessary ;P

06/22/07 09:43:24 changed by Ivan Sagalaev <Maniac@SoftwareManiacs.Org>

Do we need to compile with re.UNICODE?

Yes. Because otherwise \w means only [A-Za-z0-9_]. re.UNICODE switches re to use unicode db to get the notion of "letter".

06/22/07 17:31:07 changed by SmileyChris

  • stage changed from Unreviewed to Ready for checkin.

Oh, I actually missed the fact you were already compiling with re.U. :)

06/25/07 08:11:10 changed by mtredinnick

  • status changed from new to closed.
  • resolution set to fixed.

(In [5533]) unicode: Fixed #4662 -- Fixed a remaining ASCII assumption in truncatewords_html(). Thanks, Ivan Sagalaev.


Add/Change #4662 ([unicode] truncate_html_words doesn't work for non-latin characters)




Change Properties
Action