Opened 18 years ago
Closed 18 years ago
#2027 closed defect (fixed)
[patch] truncatewords filter can invalidate your HTML
Reported by: | James Bennett | Owned by: | Adrian Holovaty |
---|---|---|---|
Component: | Template system | Version: | |
Severity: | normal | Keywords: | truncatewords templates |
Cc: | Triage Stage: | Ready for checkin | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
The truncatewords filter is not HTML-aware, which means that when it's used to clip out part of a block of HTML text (say, for purposes of providing an excerpt), it can cut off at text in the middle of a tag. This will, usually, invalidate the HTML and can easily cause severe layout problems.
Perhaps an alternative filter could be added which is HTML-safe?
Attachments (5)
Change History (14)
by , 18 years ago
Attachment: | truncatewords_html.patch added |
---|
comment:1 by , 18 years ago
Summary: | truncatewords filter can invalidate your HTML → [patch] truncatewords filter can invalidate your HTML |
---|
Here's a basic HTML-aware word truncation filter.
It only counts words outside of tags and HTML comments.
It only closes off tags if they were closed off (properly) in the HTML being filtered.
by , 18 years ago
Attachment: | truncatewords_html.2.patch added |
---|
This time I'll upload the proper HTML-aware truncation patch
by , 18 years ago
Attachment: | truncatewords_html.3.patch added |
---|
No, uh I should actually use the unit tests I created... this one works.
comment:2 by , 18 years ago
And speaking of unit tests... Django's unit tests don't actually work that well. When creating the unit tests, I found that:
>>> truncatewords_html('<p>one <a href="#">two - three <br>four</a> five</p>', 5) '<p>one <a href="#">two - three <br>four</a> five...</p>'
and
>>> truncatewords_html('<p>one <a href="#">two - three <br>four</a> five</p>', 5) '<p>one <a href="#">two - three <br>four</a> five</p>'
both validate... (changing a letter still causes a unit test fail)
So it's not actually a very good test now, is it?
comment:3 by , 18 years ago
That truncate_html_words()
function in the patch looks wayyyyyy long and complex. Is there any way you can simplify it, perhaps by doing more of it in regular expressions rather than iterating over each character of the string?
by , 18 years ago
Attachment: | better_truncatewords_html.patch added |
---|
This is a much better (and shorter) HTML truncater
comment:5 by , 18 years ago
Actually, it's not that much shorter... but it is better :)
I still need to iterate through the string (to check for open tags and to get the right truncation point) but I do so with a regular expression like Adrian suggested.
by , 18 years ago
Attachment: | better_truncatewords_html.2.patch added |
---|
Teensy change ensuring that self-closing tags are properly identified
comment:6 by , 18 years ago
Just a thought, the patch doesn't consider unicode at the moment - should it? Easy change if it is a consideration:
- Compile
re_words
withre.UNICODE
- In
re_words
, replace '[A-Za-z0-9]
' with '\w
' (should probably just be that anyway)
comment:7 by , 18 years ago
Keywords: | truncatewords templates added |
---|---|
Triage Stage: | Unreviewed → Ready for checkin |
I believe this is ready for check-in, pending a core decision on SmileyChris's coment above re: unicode.
comment:8 by , 18 years ago
I'll check in what's here, since all text processing has to be unicode audited at some point. I'm not convinced this is as efficient as it could be, but we can revisit that later if it's a problem, since just having something that works is a good idea right now. Let's have people test it in practice and get some feedback; it doesn't affect any existing code, after all.
comment:9 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
HTML aware word truncation