Opened 10 years ago

Closed 10 years ago

#2027 closed defect (fixed)

[patch] truncatewords filter can invalidate your HTML

Reported by: James Bennett Owned by: Adrian Holovaty
Component: Template system Version:
Severity: normal Keywords: truncatewords templates
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description

The truncatewords filter is not HTML-aware, which means that when it's used to clip out part of a block of HTML text (say, for purposes of providing an excerpt), it can cut off at text in the middle of a tag. This will, usually, invalidate the HTML and can easily cause severe layout problems.

Perhaps an alternative filter could be added which is HTML-safe?

Attachments (5)

truncatewords_html.patch (5.1 KB) - added by Chris Beaven 10 years ago.
HTML aware word truncation
truncatewords_html.2.patch (5.2 KB) - added by Chris Beaven 10 years ago.
This time I'll upload the proper HTML-aware truncation patch
truncatewords_html.3.patch (5.1 KB) - added by Chris Beaven 10 years ago.
No, uh I should actually use the unit tests I created... this one works.
better_truncatewords_html.patch (4.6 KB) - added by Chris Beaven 10 years ago.
This is a much better (and shorter) HTML truncater
better_truncatewords_html.2.patch (4.6 KB) - added by Chris Beaven 10 years ago.
Teensy change ensuring that self-closing tags are properly identified

Download all attachments as: .zip

Change History (14)

Changed 10 years ago by Chris Beaven

Attachment: truncatewords_html.patch added

HTML aware word truncation

comment:1 Changed 10 years ago by Chris Beaven

Summary: truncatewords filter can invalidate your HTML[patch] truncatewords filter can invalidate your HTML

Here's a basic HTML-aware word truncation filter.

It only counts words outside of tags and HTML comments.
It only closes off tags if they were closed off (properly) in the HTML being filtered.

Changed 10 years ago by Chris Beaven

Attachment: truncatewords_html.2.patch added

This time I'll upload the proper HTML-aware truncation patch

Changed 10 years ago by Chris Beaven

Attachment: truncatewords_html.3.patch added

No, uh I should actually use the unit tests I created... this one works.

comment:2 Changed 10 years ago by Chris Beaven

And speaking of unit tests... Django's unit tests don't actually work that well. When creating the unit tests, I found that:

>>> truncatewords_html('<p>one <a href="#">two - three <br>four</a> five</p>', 5)
'<p>one <a href="#">two - three <br>four</a> five...</p>'

and

>>> truncatewords_html('<p>one <a href="#">two - three <br>four</a> five</p>', 5)
'<p>one <a href="#">two - three <br>four</a> five</p>'

both validate... (changing a letter still causes a unit test fail)

So it's not actually a very good test now, is it?

comment:3 Changed 10 years ago by Adrian Holovaty

That truncate_html_words() function in the patch looks wayyyyyy long and complex. Is there any way you can simplify it, perhaps by doing more of it in regular expressions rather than iterating over each character of the string?

comment:4 Changed 10 years ago by Chris Beaven

Yep, it was. After a sleep, here's a much better version.

Changed 10 years ago by Chris Beaven

This is a much better (and shorter) HTML truncater

comment:5 Changed 10 years ago by Chris Beaven

Actually, it's not that much shorter... but it is better :)

I still need to iterate through the string (to check for open tags and to get the right truncation point) but I do so with a regular expression like Adrian suggested.

Changed 10 years ago by Chris Beaven

Teensy change ensuring that self-closing tags are properly identified

comment:6 Changed 10 years ago by Chris Beaven

Just a thought, the patch doesn't consider unicode at the moment - should it? Easy change if it is a consideration:

  1. Compile re_words with re.UNICODE
  1. In re_words, replace '[A-Za-z0-9]' with '\w' (should probably just be that anyway)

comment:7 Changed 10 years ago by Simon G. <dev@…>

Keywords: truncatewords templates added
Triage Stage: UnreviewedReady for checkin

I believe this is ready for check-in, pending a core decision on SmileyChris's coment above re: unicode.

comment:8 Changed 10 years ago by Malcolm Tredinnick

I'll check in what's here, since all text processing has to be unicode audited at some point. I'm not convinced this is as efficient as it could be, but we can revisit that later if it's a problem, since just having something that works is a good idea right now. Let's have people test it in practice and get some feedback; it doesn't affect any existing code, after all.

comment:9 Changed 10 years ago by Malcolm Tredinnick

Resolution: fixed
Status: newclosed

(In [4468]) Fixed #2027 -- added truncatewords_html filter that respects HTML tags whilst
truncating. Patch from SmileyChris.

Note: See TracTickets for help on using tickets.
Back to Top