Opened 18 years ago

Closed 18 years ago

#2027 closed defect (fixed)

[patch] truncatewords filter can invalidate your HTML

Reported by: James Bennett Owned by: Adrian Holovaty
Component: Template system Version:
Severity: normal Keywords: truncatewords templates
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The truncatewords filter is not HTML-aware, which means that when it's used to clip out part of a block of HTML text (say, for purposes of providing an excerpt), it can cut off at text in the middle of a tag. This will, usually, invalidate the HTML and can easily cause severe layout problems.

Perhaps an alternative filter could be added which is HTML-safe?

Attachments (5)

truncatewords_html.patch (5.1 KB ) - added by Chris Beaven 18 years ago.
HTML aware word truncation
truncatewords_html.2.patch (5.2 KB ) - added by Chris Beaven 18 years ago.
This time I'll upload the proper HTML-aware truncation patch
truncatewords_html.3.patch (5.1 KB ) - added by Chris Beaven 18 years ago.
No, uh I should actually use the unit tests I created... this one works.
better_truncatewords_html.patch (4.6 KB ) - added by Chris Beaven 18 years ago.
This is a much better (and shorter) HTML truncater
better_truncatewords_html.2.patch (4.6 KB ) - added by Chris Beaven 18 years ago.
Teensy change ensuring that self-closing tags are properly identified

Download all attachments as: .zip

Change History (14)

by Chris Beaven, 18 years ago

Attachment: truncatewords_html.patch added

HTML aware word truncation

comment:1 by Chris Beaven, 18 years ago

Summary: truncatewords filter can invalidate your HTML[patch] truncatewords filter can invalidate your HTML

Here's a basic HTML-aware word truncation filter.

It only counts words outside of tags and HTML comments.
It only closes off tags if they were closed off (properly) in the HTML being filtered.

by Chris Beaven, 18 years ago

Attachment: truncatewords_html.2.patch added

This time I'll upload the proper HTML-aware truncation patch

by Chris Beaven, 18 years ago

Attachment: truncatewords_html.3.patch added

No, uh I should actually use the unit tests I created... this one works.

comment:2 by Chris Beaven, 18 years ago

And speaking of unit tests... Django's unit tests don't actually work that well. When creating the unit tests, I found that:

>>> truncatewords_html('<p>one <a href="#">two - three <br>four</a> five</p>', 5)
'<p>one <a href="#">two - three <br>four</a> five...</p>'

and

>>> truncatewords_html('<p>one <a href="#">two - three <br>four</a> five</p>', 5)
'<p>one <a href="#">two - three <br>four</a> five</p>'

both validate... (changing a letter still causes a unit test fail)

So it's not actually a very good test now, is it?

comment:3 by Adrian Holovaty, 18 years ago

That truncate_html_words() function in the patch looks wayyyyyy long and complex. Is there any way you can simplify it, perhaps by doing more of it in regular expressions rather than iterating over each character of the string?

comment:4 by Chris Beaven, 18 years ago

Yep, it was. After a sleep, here's a much better version.

by Chris Beaven, 18 years ago

This is a much better (and shorter) HTML truncater

comment:5 by Chris Beaven, 18 years ago

Actually, it's not that much shorter... but it is better :)

I still need to iterate through the string (to check for open tags and to get the right truncation point) but I do so with a regular expression like Adrian suggested.

by Chris Beaven, 18 years ago

Teensy change ensuring that self-closing tags are properly identified

comment:6 by Chris Beaven, 18 years ago

Just a thought, the patch doesn't consider unicode at the moment - should it? Easy change if it is a consideration:

  1. Compile re_words with re.UNICODE
  1. In re_words, replace '[A-Za-z0-9]' with '\w' (should probably just be that anyway)

comment:7 by Simon G. <dev@…>, 18 years ago

Keywords: truncatewords templates added
Triage Stage: UnreviewedReady for checkin

I believe this is ready for check-in, pending a core decision on SmileyChris's coment above re: unicode.

comment:8 by Malcolm Tredinnick, 18 years ago

I'll check in what's here, since all text processing has to be unicode audited at some point. I'm not convinced this is as efficient as it could be, but we can revisit that later if it's a problem, since just having something that works is a good idea right now. Let's have people test it in practice and get some feedback; it doesn't affect any existing code, after all.

comment:9 by Malcolm Tredinnick, 18 years ago

Resolution: fixed
Status: newclosed

(In [4468]) Fixed #2027 -- added truncatewords_html filter that respects HTML tags whilst
truncating. Patch from SmileyChris.

Note: See TracTickets for help on using tickets.
Back to Top