Opened 18 years ago

Closed 10 years ago

#1919 closed Bug (wontfix)

filter truncatewords is inefficient and destroys white space

Reported by: derelm Owned by: Chris Wilson
Component: Template system Version: dev
Severity: Normal Keywords: sprintsept14
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

i tried to chain truncatewords with markdown, but this doesn't work as truncatewords removes newlines from the input string.
newlines are crucial for markdown and possibly other markup-parsers.

Attachments (7)

truncate_words_preserving_whitespace.diff (1.2 KB ) - added by Tom Tobin <korpios@…> 18 years ago.
patch against r2991 (trunk); cause truncatewords filter to preserve whitespace
truncate_words_preserving_whites.diff (962 bytes ) - added by didier Belot <dib@…> 17 years ago.
patch against trunk rev. 5783, simplified
1919-1.diff (2.5 KB ) - added by Matt McClanahan 17 years ago.
Add a regression test
truncate_tests.diff (2.0 KB ) - added by deryck 17 years ago.
Tests for django.utils.text.trunacte_*.
truncate.py (2.5 KB ) - added by Chris Beaven 17 years ago.
truncate_words.diff (1.8 KB ) - added by arien <regexbot@…> 17 years ago.
truncate_words that does what it says on the package; includes tests
truncate_words_faster.diff (2.0 KB ) - added by arien <regexbot@…> 17 years ago.
faster version of the same idea; includes additional tests for trailing whitespace handling

Download all attachments as: .zip

Change History (35)

by Tom Tobin <korpios@…>, 18 years ago

patch against r2991 (trunk); cause truncatewords filter to preserve whitespace

comment:1 by Tom Tobin <korpios@…>, 18 years ago

I've attached a patch that will cause the truncatewords filter to preserve all whitespace (including leading and trailing whitespace).

comment:2 by (none), 18 years ago

milestone: Version 1.0

Milestone Version 1.0 deleted

comment:3 by Chris Beaven, 18 years ago

Triage Stage: UnreviewedReady for checkin

comment:4 by Chris Beaven, 18 years ago

Has patch: set

comment:5 by Malcolm Tredinnick, 18 years ago

Owner: changed from Adrian Holovaty to Malcolm Tredinnick
Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

I'm going to bounce this patch back for improvement (I'm just doing a "check in the obviously correct" pass at the moment, so I don't want to stop and fix it right now).

I think you can get away with just one call to re.split() without the subsequent re.search() every time through the list. After the call to re.split(), every second element in the returned list is the whitespace. So you can iterate through the "words" list (in the patch's terminology) two elements at a time: the first one will be a word, the second the whitespace. That avoids the need to call whitespace_re.search() each time. It will improve performance a bit (and reduce the line count, too, I suspect).

A test of this would be nice, too, since it's not immediately clear that it works 100%.

comment:6 by Chris Beaven, 18 years ago

Needs tests: set

by didier Belot <dib@…>, 17 years ago

patch against trunk rev. 5783, simplified

comment:7 by didier Belot <dib@…>, 17 years ago

Yes, the word_re in my patch can be as simple as the first below

# split at words boundaries
word_re = re.compile(r'(\w+)',re.UNICODE) 

# split at **markdown** or _such marked_  up words
word_re = re.compile(r'((?:\*|_)*\w+(?:\*|_)*)',re.UNICODE) 

Anyway, it was just for the fun, as I don't think this is the way to go: chaining truncatewords to markdown will break anyway! I Think it's better to use markdon first, then chain to htmltruncatewords.

What sort of test are needed for such a patch ?

by Matt McClanahan, 17 years ago

Attachment: 1919-1.diff added

Add a regression test

comment:8 by Matt McClanahan, 17 years ago

Needs tests: unset
Patch needs improvement: unset

comment:9 by Matt McClanahan, 17 years ago

Keywords: sprintsept14 added

comment:10 by James Bennett, 17 years ago

Resolution: fixed
Status: newclosed

An HTML-safe truncation filter was added some time ago, so the solution here is to do the markup filtering before truncating.

comment:11 by Chris Beaven, 17 years ago

Resolution: fixed
Status: closedreopened

I disagree that it's been fixed. truncate_words should preserve existing white space. We have a patch here and one in #4655. Either one of those seems to do the job.

comment:12 by deryck, 17 years ago

I agree with James that this should be closed. Just use truncate_html_words to preserve space. Changing truncate_words will affect the performance of that function.

Not closing it again, though, because I'm attaching tests to show truncate_html_words will do what's needed and just so django.utils.text.truncate_* gets a little test coverage.

by deryck, 17 years ago

Attachment: truncate_tests.diff added

Tests for django.utils.text.trunacte_*.

comment:13 by anonymous, 17 years ago

Rubbish about affecting performance. My version is faster than the current one anyway!
Run truncate.py from a manage.py shell if you want a test comparison

by Chris Beaven, 17 years ago

Attachment: truncate.py added

comment:14 by Chris Beaven, 17 years ago

Stupid basic HTTP auth. Twas my anonymous comment.

by arien <regexbot@…>, 17 years ago

Attachment: truncate_words.diff added

truncate_words that does what it says on the package; includes tests

by arien <regexbot@…>, 17 years ago

Attachment: truncate_words_faster.diff added

faster version of the same idea; includes additional tests for trailing whitespace handling

comment:15 by arien <regexbot@…>, 17 years ago

The last patch contains a version of trunacte_words that preserves all whitespace before the nth "word" (sequence of non-blanks); see the tests for exact behaviour.

(This version of truncate_words differs from the previous version in that it doesn't use re.UNICODE for the regex and that it updates the counter by hand instead of using enumerate. This patch also contains a couple more tests for trailing whitespace handling.)

This version is somewhat slower than the current version when the string contains few words or is to be truncated after a large fraction of the words in the string, but is faster when truncating after a small fraction of the words in a strings containing many words. YMMV.

I tried to a version using re.split as well (splitting on and capturing non-whitespace, with max_split=length) but it consistently performed worse than the attached version IIRC.

One comment about testing the performance of the current version and alternatives to each other: be sure to compare like with like, i.e., make sure your benchmarking code looks like this:

from django.utils.text import truncate_words
from django.utils.encoding import force_unicode
from django.utils.functional import allow_lazy

def my_truncate_words(s, num):
    length = int(num)
    s = force_unicode(s)
    # code to return a Unicode string
my_truncate_words = allow_lazy(my_truncate_words, unicode)

# other versions follow same pattern

comment:16 by Chris Beaven, 17 years ago

Looks good, arien. Only question is that I'm not sure this is the correct behaviour:

>>> truncatewords(u'  Text with leading and trailing whitespace  ', 6)
u'  Text with leading and trailing whitespace ...'

It seems like the filter should be counting the words, not the spaces between them. So if there's 6 words, it should just output the original string.
Thoughts?

comment:17 by arien <regexbot@…>, 17 years ago

I think you can argue either way. The approach I took was to truncate immediately after the nth word if the the input string had any characters following it. The reason: "Truncates a string after a certain number of words." ;-)

What seem to be proposing is to return a string that has no more than n words in it. (I think that'll mean counting spaces and not words. We'll see...) Does this look reasonable?

>>> truncatewords(u'  Double-spaced  text.  ', 2) # and above
u'  Double-spaced  text.  '
>>> truncatewords(u'  Double-spaced  text.  ', 1)
u'  Double-spaced  ...'
>>> truncatewords(u'  Double-spaced  text.  ', 0) # and below
u'  ...'

comment:18 by Chris Beaven, 17 years ago

I agree, it could be taken either way. Don't take my word as the way it should be ;)

How I see that it should work: If the string has less than or equal to the number of words, then the original string should be returned.
Otherwise the string should be truncated to the Xth word and appended with ' ...' (one space then an elipsis).

comment:19 by Adam Nelson, 14 years ago

Patch needs improvement: set

Trunk has changed since patch was uploaded.

comment:20 by Łukasz Rekucki, 14 years ago

Severity: minorNormal
Type: enhancementBug

comment:21 by Aymeric Augustin, 13 years ago

UI/UX: unset

Change UI/UX from NULL to False.

comment:22 by Aymeric Augustin, 13 years ago

Easy pickings: unset

Change Easy pickings from NULL to False.

comment:23 by Aymeric Augustin, 12 years ago

Status: reopenednew

comment:24 by Chris Wilson, 11 years ago

Owner: changed from nobody to Chris Wilson
Status: newassigned

I've merged arien's patch and used SmileyChris' comments to create a pull request for review.

comment:25 by Chris Wilson, 11 years ago

Summary: filter truncatewords wipes newlines from string, so not chainable with markup filtersfilter truncatewords is inefficient and destroys white space

comment:26 by Tim Graham, 10 years ago

I left comments for improvement on the PR. Please uncheck "Patch needs improvement" when you update it, thanks.

comment:27 by tanner, 10 years ago

I think this bug (actually feature request) can be closed as the change would be backwards incompatible
and I don't see a need for a new filter "truncatewords_white" (pick any name).

comment:28 by Tim Graham, 10 years ago

Resolution: wontfix
Status: assignedclosed
Note: See TracTickets for help on using tickets.
Back to Top