Opened 10 years ago

Closed 19 months ago

#1919 closed Bug (wontfix)

filter truncatewords is inefficient and destroys white space

Reported by: derelm Owned by: Chris Wilson
Component: Template system Version: master
Severity: Normal Keywords: sprintsept14
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

i tried to chain truncatewords with markdown, but this doesn't work as truncatewords removes newlines from the input string.
newlines are crucial for markdown and possibly other markup-parsers.

Attachments (7)

truncate_words_preserving_whitespace.diff (1.2 KB) - added by Tom Tobin <korpios@…> 10 years ago.
patch against r2991 (trunk); cause truncatewords filter to preserve whitespace
truncate_words_preserving_whites.diff (962 bytes) - added by didier Belot <dib@…> 9 years ago.
patch against trunk rev. 5783, simplified
1919-1.diff (2.5 KB) - added by Matt McClanahan 9 years ago.
Add a regression test
truncate_tests.diff (2.0 KB) - added by deryck 9 years ago.
Tests for django.utils.text.trunacte_*.
truncate.py (2.5 KB) - added by Chris Beaven 9 years ago.
truncate_words.diff (1.8 KB) - added by arien <regexbot@…> 9 years ago.
truncate_words that does what it says on the package; includes tests
truncate_words_faster.diff (2.0 KB) - added by arien <regexbot@…> 9 years ago.
faster version of the same idea; includes additional tests for trailing whitespace handling

Download all attachments as: .zip

Change History (35)

Changed 10 years ago by Tom Tobin <korpios@…>

patch against r2991 (trunk); cause truncatewords filter to preserve whitespace

comment:1 Changed 10 years ago by Tom Tobin <korpios@…>

I've attached a patch that will cause the truncatewords filter to preserve all whitespace (including leading and trailing whitespace).

comment:2 Changed 10 years ago by (none)

milestone: Version 1.0

Milestone Version 1.0 deleted

comment:3 Changed 10 years ago by Chris Beaven

Triage Stage: UnreviewedReady for checkin

comment:4 Changed 10 years ago by Chris Beaven

Has patch: set

comment:5 Changed 10 years ago by Malcolm Tredinnick

Owner: changed from Adrian Holovaty to Malcolm Tredinnick
Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

I'm going to bounce this patch back for improvement (I'm just doing a "check in the obviously correct" pass at the moment, so I don't want to stop and fix it right now).

I think you can get away with just one call to re.split() without the subsequent re.search() every time through the list. After the call to re.split(), every second element in the returned list is the whitespace. So you can iterate through the "words" list (in the patch's terminology) two elements at a time: the first one will be a word, the second the whitespace. That avoids the need to call whitespace_re.search() each time. It will improve performance a bit (and reduce the line count, too, I suspect).

A test of this would be nice, too, since it's not immediately clear that it works 100%.

comment:6 Changed 10 years ago by Chris Beaven

Needs tests: set

Changed 9 years ago by didier Belot <dib@…>

patch against trunk rev. 5783, simplified

comment:7 Changed 9 years ago by didier Belot <dib@…>

Yes, the word_re in my patch can be as simple as the first below

# split at words boundaries
word_re = re.compile(r'(\w+)',re.UNICODE) 

# split at **markdown** or _such marked_  up words
word_re = re.compile(r'((?:\*|_)*\w+(?:\*|_)*)',re.UNICODE) 

Anyway, it was just for the fun, as I don't think this is the way to go: chaining truncatewords to markdown will break anyway! I Think it's better to use markdon first, then chain to htmltruncatewords.

What sort of test are needed for such a patch ?

Changed 9 years ago by Matt McClanahan

Attachment: 1919-1.diff added

Add a regression test

comment:8 Changed 9 years ago by Matt McClanahan

Needs tests: unset
Patch needs improvement: unset

comment:9 Changed 9 years ago by Matt McClanahan

Keywords: sprintsept14 added

comment:10 Changed 9 years ago by James Bennett

Resolution: fixed
Status: newclosed

An HTML-safe truncation filter was added some time ago, so the solution here is to do the markup filtering before truncating.

comment:11 Changed 9 years ago by Chris Beaven

Resolution: fixed
Status: closedreopened

I disagree that it's been fixed. truncate_words should preserve existing white space. We have a patch here and one in #4655. Either one of those seems to do the job.

comment:12 Changed 9 years ago by deryck

I agree with James that this should be closed. Just use truncate_html_words to preserve space. Changing truncate_words will affect the performance of that function.

Not closing it again, though, because I'm attaching tests to show truncate_html_words will do what's needed and just so django.utils.text.truncate_* gets a little test coverage.

Changed 9 years ago by deryck

Attachment: truncate_tests.diff added

Tests for django.utils.text.trunacte_*.

comment:13 Changed 9 years ago by anonymous

Rubbish about affecting performance. My version is faster than the current one anyway!
Run truncate.py from a manage.py shell if you want a test comparison

Changed 9 years ago by Chris Beaven

Attachment: truncate.py added

comment:14 Changed 9 years ago by Chris Beaven

Stupid basic HTTP auth. Twas my anonymous comment.

Changed 9 years ago by arien <regexbot@…>

Attachment: truncate_words.diff added

truncate_words that does what it says on the package; includes tests

Changed 9 years ago by arien <regexbot@…>

Attachment: truncate_words_faster.diff added

faster version of the same idea; includes additional tests for trailing whitespace handling

comment:15 Changed 9 years ago by arien <regexbot@…>

The last patch contains a version of trunacte_words that preserves all whitespace before the nth "word" (sequence of non-blanks); see the tests for exact behaviour.

(This version of truncate_words differs from the previous version in that it doesn't use re.UNICODE for the regex and that it updates the counter by hand instead of using enumerate. This patch also contains a couple more tests for trailing whitespace handling.)

This version is somewhat slower than the current version when the string contains few words or is to be truncated after a large fraction of the words in the string, but is faster when truncating after a small fraction of the words in a strings containing many words. YMMV.

I tried to a version using re.split as well (splitting on and capturing non-whitespace, with max_split=length) but it consistently performed worse than the attached version IIRC.

One comment about testing the performance of the current version and alternatives to each other: be sure to compare like with like, i.e., make sure your benchmarking code looks like this:

from django.utils.text import truncate_words
from django.utils.encoding import force_unicode
from django.utils.functional import allow_lazy

def my_truncate_words(s, num):
    length = int(num)
    s = force_unicode(s)
    # code to return a Unicode string
my_truncate_words = allow_lazy(my_truncate_words, unicode)

# other versions follow same pattern

comment:16 Changed 9 years ago by Chris Beaven

Looks good, arien. Only question is that I'm not sure this is the correct behaviour:

>>> truncatewords(u'  Text with leading and trailing whitespace  ', 6)
u'  Text with leading and trailing whitespace ...'

It seems like the filter should be counting the words, not the spaces between them. So if there's 6 words, it should just output the original string.
Thoughts?

comment:17 Changed 9 years ago by arien <regexbot@…>

I think you can argue either way. The approach I took was to truncate immediately after the nth word if the the input string had any characters following it. The reason: "Truncates a string after a certain number of words." ;-)

What seem to be proposing is to return a string that has no more than n words in it. (I think that'll mean counting spaces and not words. We'll see...) Does this look reasonable?

>>> truncatewords(u'  Double-spaced  text.  ', 2) # and above
u'  Double-spaced  text.  '
>>> truncatewords(u'  Double-spaced  text.  ', 1)
u'  Double-spaced  ...'
>>> truncatewords(u'  Double-spaced  text.  ', 0) # and below
u'  ...'

comment:18 Changed 9 years ago by Chris Beaven

I agree, it could be taken either way. Don't take my word as the way it should be ;)

How I see that it should work: If the string has less than or equal to the number of words, then the original string should be returned.
Otherwise the string should be truncated to the Xth word and appended with ' ...' (one space then an elipsis).

comment:19 Changed 6 years ago by Adam Nelson

Patch needs improvement: set

Trunk has changed since patch was uploaded.

comment:20 Changed 6 years ago by Łukasz Rekucki

Severity: minorNormal
Type: enhancementBug

comment:21 Changed 5 years ago by Aymeric Augustin

UI/UX: unset

Change UI/UX from NULL to False.

comment:22 Changed 5 years ago by Aymeric Augustin

Easy pickings: unset

Change Easy pickings from NULL to False.

comment:23 Changed 4 years ago by Aymeric Augustin

Status: reopenednew

comment:24 Changed 3 years ago by Chris Wilson

Owner: changed from nobody to Chris Wilson
Status: newassigned

I've merged arien's patch and used SmileyChris' comments to create a pull request for review.

comment:25 Changed 3 years ago by Chris Wilson

Summary: filter truncatewords wipes newlines from string, so not chainable with markup filtersfilter truncatewords is inefficient and destroys white space

comment:26 Changed 2 years ago by Tim Graham

I left comments for improvement on the PR. Please uncheck "Patch needs improvement" when you update it, thanks.

comment:27 Changed 19 months ago by tanner

I think this bug (actually feature request) can be closed as the change would be backwards incompatible
and I don't see a need for a new filter "truncatewords_white" (pick any name).

comment:28 Changed 19 months ago by Tim Graham

Resolution: wontfix
Status: assignedclosed
Note: See TracTickets for help on using tickets.
Back to Top