Django

Code

Ticket #2027 (closed: fixed)

Opened 2 years ago

Last modified 2 years ago

[patch] truncatewords filter can invalidate your HTML

Reported by: ubernostrum Assigned to: adrian
Milestone: Component: Template system
Version: Keywords: truncatewords templates
Cc: Triage Stage: Ready for checkin
Has patch: 1 Needs documentation: 0
Needs tests: 0 Patch needs improvement: 0

Description

The truncatewords filter is not HTML-aware, which means that when it's used to clip out part of a block of HTML text (say, for purposes of providing an excerpt), it can cut off at text in the middle of a tag. This will, usually, invalidate the HTML and can easily cause severe layout problems.

Perhaps an alternative filter could be added which is HTML-safe?

Attachments

truncatewords_html.patch (5.1 kB) - added by SmileyChris on 05/30/06 23:06:50.
HTML aware word truncation
truncatewords_html.2.patch (5.2 kB) - added by SmileyChris on 05/30/06 23:31:32.
This time I'll upload the proper HTML-aware truncation patch
truncatewords_html.3.patch (5.1 kB) - added by SmileyChris on 05/30/06 23:37:53.
No, uh I should actually use the unit tests I created... this one works.
better_truncatewords_html.patch (4.6 kB) - added by SmileyChris on 06/02/06 03:07:05.
This is a much better (and shorter) HTML truncater
better_truncatewords_html.2.patch (4.6 kB) - added by SmileyChris on 06/02/06 03:18:21.
Teensy change ensuring that self-closing tags are properly identified

Change History

05/30/06 23:06:50 changed by SmileyChris

  • attachment truncatewords_html.patch added.

HTML aware word truncation

05/30/06 23:11:50 changed by SmileyChris

  • summary changed from truncatewords filter can invalidate your HTML to [patch] truncatewords filter can invalidate your HTML.

Here's a basic HTML-aware word truncation filter.

It only counts words outside of tags and HTML comments. It only closes off tags if they were closed off (properly) in the HTML being filtered.

05/30/06 23:31:32 changed by SmileyChris

  • attachment truncatewords_html.2.patch added.

This time I'll upload the proper HTML-aware truncation patch

05/30/06 23:37:53 changed by SmileyChris

  • attachment truncatewords_html.3.patch added.

No, uh I should actually use the unit tests I created... this one works.

05/30/06 23:45:19 changed by SmileyChris

And speaking of unit tests... Django's unit tests don't actually work that well. When creating the unit tests, I found that:

>>> truncatewords_html('<p>one <a href="#">two - three <br>four</a> five</p>', 5)
'<p>one <a href="#">two - three <br>four</a> five...</p>'

and

>>> truncatewords_html('<p>one <a href="#">two - three <br>four</a> five</p>', 5)
'<p>one <a href="#">two - three <br>four</a> five</p>'

both validate... (changing a letter still causes a unit test fail)

So it's not actually a very good test now, is it?

05/31/06 23:40:42 changed by adrian

That truncate_html_words() function in the patch looks wayyyyyy long and complex. Is there any way you can simplify it, perhaps by doing more of it in regular expressions rather than iterating over each character of the string?

06/02/06 03:06:33 changed by SmileyChris

Yep, it was. After a sleep, here's a much better version.

06/02/06 03:07:05 changed by SmileyChris

  • attachment better_truncatewords_html.patch added.

This is a much better (and shorter) HTML truncater

06/02/06 03:17:40 changed by SmileyChris

Actually, it's not that much shorter... but it is better :)

I still need to iterate through the string (to check for open tags and to get the right truncation point) but I do so with a regular expression like Adrian suggested.

06/02/06 03:18:21 changed by SmileyChris

  • attachment better_truncatewords_html.2.patch added.

Teensy change ensuring that self-closing tags are properly identified

06/02/06 03:46:20 changed by SmileyChris

Just a thought, the patch doesn't consider unicode at the moment - should it? Easy change if it is a consideration:

1. Compile re_words with re.UNICODE

2. In re_words, replace '[A-Za-z0-9]' with '\w' (should probably just be that anyway)

02/02/07 07:23:43 changed by Simon G. <dev@simon.net.nz>

  • keywords set to truncatewords templates.
  • stage changed from Unreviewed to Ready for checkin.

I believe this is ready for check-in, pending a core decision on SmileyChris?'s coment above re: unicode.

02/09/07 20:49:36 changed by mtredinnick

I'll check in what's here, since all text processing has to be unicode audited at some point. I'm not convinced this is as efficient as it could be, but we can revisit that later if it's a problem, since just having something that works is a good idea right now. Let's have people test it in practice and get some feedback; it doesn't affect any existing code, after all.

02/09/07 20:51:27 changed by mtredinnick

  • status changed from new to closed.
  • resolution set to fixed.

(In [4468]) Fixed #2027 -- added truncatewords_html filter that respects HTML tags whilst truncating. Patch from SmileyChris?.


Add/Change #2027 ([patch] truncatewords filter can invalidate your HTML)




Change Properties
Action