Opened 9 years ago

Closed 9 years ago

#2027 closed defect (fixed)

[patch] truncatewords filter can invalidate your HTML

Reported by: ubernostrum Owned by: adrian
Component: Template system Version:
Severity: normal Keywords: truncatewords templates
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description

The truncatewords filter is not HTML-aware, which means that when it's used to clip out part of a block of HTML text (say, for purposes of providing an excerpt), it can cut off at text in the middle of a tag. This will, usually, invalidate the HTML and can easily cause severe layout problems.

Perhaps an alternative filter could be added which is HTML-safe?

Attachments (5)

truncatewords_html.patch (5.1 KB) - added by SmileyChris 9 years ago.
HTML aware word truncation
truncatewords_html.2.patch (5.2 KB) - added by SmileyChris 9 years ago.
This time I'll upload the proper HTML-aware truncation patch
truncatewords_html.3.patch (5.1 KB) - added by SmileyChris 9 years ago.
No, uh I should actually use the unit tests I created... this one works.
better_truncatewords_html.patch (4.6 KB) - added by SmileyChris 9 years ago.
This is a much better (and shorter) HTML truncater
better_truncatewords_html.2.patch (4.6 KB) - added by SmileyChris 9 years ago.
Teensy change ensuring that self-closing tags are properly identified

Download all attachments as: .zip

Change History (14)

Changed 9 years ago by SmileyChris

HTML aware word truncation

comment:1 Changed 9 years ago by SmileyChris

  • Summary changed from truncatewords filter can invalidate your HTML to [patch] truncatewords filter can invalidate your HTML

Here's a basic HTML-aware word truncation filter.

It only counts words outside of tags and HTML comments.
It only closes off tags if they were closed off (properly) in the HTML being filtered.

Changed 9 years ago by SmileyChris

This time I'll upload the proper HTML-aware truncation patch

Changed 9 years ago by SmileyChris

No, uh I should actually use the unit tests I created... this one works.

comment:2 Changed 9 years ago by SmileyChris

And speaking of unit tests... Django's unit tests don't actually work that well. When creating the unit tests, I found that:

>>> truncatewords_html('<p>one <a href="#">two - three <br>four</a> five</p>', 5)
'<p>one <a href="#">two - three <br>four</a> five...</p>'

and

>>> truncatewords_html('<p>one <a href="#">two - three <br>four</a> five</p>', 5)
'<p>one <a href="#">two - three <br>four</a> five</p>'

both validate... (changing a letter still causes a unit test fail)

So it's not actually a very good test now, is it?

comment:3 Changed 9 years ago by adrian

That truncate_html_words() function in the patch looks wayyyyyy long and complex. Is there any way you can simplify it, perhaps by doing more of it in regular expressions rather than iterating over each character of the string?

comment:4 Changed 9 years ago by SmileyChris

Yep, it was. After a sleep, here's a much better version.

Changed 9 years ago by SmileyChris

This is a much better (and shorter) HTML truncater

comment:5 Changed 9 years ago by SmileyChris

Actually, it's not that much shorter... but it is better :)

I still need to iterate through the string (to check for open tags and to get the right truncation point) but I do so with a regular expression like Adrian suggested.

Changed 9 years ago by SmileyChris

Teensy change ensuring that self-closing tags are properly identified

comment:6 Changed 9 years ago by SmileyChris

Just a thought, the patch doesn't consider unicode at the moment - should it? Easy change if it is a consideration:

  1. Compile re_words with re.UNICODE
  1. In re_words, replace '[A-Za-z0-9]' with '\w' (should probably just be that anyway)

comment:7 Changed 9 years ago by Simon G. <dev@…>

  • Keywords truncatewords templates added
  • Triage Stage changed from Unreviewed to Ready for checkin

I believe this is ready for check-in, pending a core decision on SmileyChris's coment above re: unicode.

comment:8 Changed 9 years ago by mtredinnick

I'll check in what's here, since all text processing has to be unicode audited at some point. I'm not convinced this is as efficient as it could be, but we can revisit that later if it's a problem, since just having something that works is a good idea right now. Let's have people test it in practice and get some feedback; it doesn't affect any existing code, after all.

comment:9 Changed 9 years ago by mtredinnick

  • Resolution set to fixed
  • Status changed from new to closed

(In [4468]) Fixed #2027 -- added truncatewords_html filter that respects HTML tags whilst
truncating. Patch from SmileyChris.

Note: See TracTickets for help on using tickets.
Back to Top