Code

Opened 8 years ago

Last modified 11 months ago

#1919 assigned Bug

filter truncatewords is inefficient and destroys white space

Reported by: derelm Owned by: gcc
Component: Template system Version: master
Severity: Normal Keywords: sprintsept14
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

i tried to chain truncatewords with markdown, but this doesn't work as truncatewords removes newlines from the input string.
newlines are crucial for markdown and possibly other markup-parsers.

Attachments (7)

truncate_words_preserving_whitespace.diff (1.2 KB) - added by Tom Tobin <korpios@…> 8 years ago.
patch against r2991 (trunk); cause truncatewords filter to preserve whitespace
truncate_words_preserving_whites.diff (962 bytes) - added by didier Belot <dib@…> 7 years ago.
patch against trunk rev. 5783, simplified
1919-1.diff (2.5 KB) - added by mattmcc 7 years ago.
Add a regression test
truncate_tests.diff (2.0 KB) - added by deryck 7 years ago.
Tests for django.utils.text.trunacte_*.
truncate.py (2.5 KB) - added by SmileyChris 7 years ago.
truncate_words.diff (1.8 KB) - added by arien <regexbot@…> 6 years ago.
truncate_words that does what it says on the package; includes tests
truncate_words_faster.diff (2.0 KB) - added by arien <regexbot@…> 6 years ago.
faster version of the same idea; includes additional tests for trailing whitespace handling

Download all attachments as: .zip

Change History (32)

Changed 8 years ago by Tom Tobin <korpios@…>

patch against r2991 (trunk); cause truncatewords filter to preserve whitespace

comment:1 Changed 8 years ago by Tom Tobin <korpios@…>

I've attached a patch that will cause the truncatewords filter to preserve all whitespace (including leading and trailing whitespace).

comment:2 Changed 7 years ago by anonymous

  • milestone Version 1.0 deleted

Milestone Version 1.0 deleted

comment:3 Changed 7 years ago by SmileyChris

  • Triage Stage changed from Unreviewed to Ready for checkin

comment:4 Changed 7 years ago by SmileyChris

  • Has patch set

comment:5 Changed 7 years ago by mtredinnick

  • Owner changed from adrian to mtredinnick
  • Patch needs improvement set
  • Triage Stage changed from Ready for checkin to Accepted

I'm going to bounce this patch back for improvement (I'm just doing a "check in the obviously correct" pass at the moment, so I don't want to stop and fix it right now).

I think you can get away with just one call to re.split() without the subsequent re.search() every time through the list. After the call to re.split(), every second element in the returned list is the whitespace. So you can iterate through the "words" list (in the patch's terminology) two elements at a time: the first one will be a word, the second the whitespace. That avoids the need to call whitespace_re.search() each time. It will improve performance a bit (and reduce the line count, too, I suspect).

A test of this would be nice, too, since it's not immediately clear that it works 100%.

comment:6 Changed 7 years ago by SmileyChris

  • Needs tests set

Changed 7 years ago by didier Belot <dib@…>

patch against trunk rev. 5783, simplified

comment:7 Changed 7 years ago by didier Belot <dib@…>

Yes, the word_re in my patch can be as simple as the first below

# split at words boundaries
word_re = re.compile(r'(\w+)',re.UNICODE) 

# split at **markdown** or _such marked_  up words
word_re = re.compile(r'((?:\*|_)*\w+(?:\*|_)*)',re.UNICODE) 

Anyway, it was just for the fun, as I don't think this is the way to go: chaining truncatewords to markdown will break anyway! I Think it's better to use markdon first, then chain to htmltruncatewords.

What sort of test are needed for such a patch ?

Changed 7 years ago by mattmcc

Add a regression test

comment:8 Changed 7 years ago by mattmcc

  • Needs tests unset
  • Patch needs improvement unset

comment:9 Changed 7 years ago by mattmcc

  • Keywords sprintsept14 added

comment:10 Changed 7 years ago by ubernostrum

  • Resolution set to fixed
  • Status changed from new to closed

An HTML-safe truncation filter was added some time ago, so the solution here is to do the markup filtering before truncating.

comment:11 Changed 7 years ago by SmileyChris

  • Resolution fixed deleted
  • Status changed from closed to reopened

I disagree that it's been fixed. truncate_words should preserve existing white space. We have a patch here and one in #4655. Either one of those seems to do the job.

comment:12 Changed 7 years ago by deryck

I agree with James that this should be closed. Just use truncate_html_words to preserve space. Changing truncate_words will affect the performance of that function.

Not closing it again, though, because I'm attaching tests to show truncate_html_words will do what's needed and just so django.utils.text.truncate_* gets a little test coverage.

Changed 7 years ago by deryck

Tests for django.utils.text.trunacte_*.

comment:13 Changed 7 years ago by anonymous

Rubbish about affecting performance. My version is faster than the current one anyway!
Run truncate.py from a manage.py shell if you want a test comparison

Changed 7 years ago by SmileyChris

comment:14 Changed 7 years ago by SmileyChris

Stupid basic HTTP auth. Twas my anonymous comment.

Changed 6 years ago by arien <regexbot@…>

truncate_words that does what it says on the package; includes tests

Changed 6 years ago by arien <regexbot@…>

faster version of the same idea; includes additional tests for trailing whitespace handling

comment:15 Changed 6 years ago by arien <regexbot@…>

The last patch contains a version of trunacte_words that preserves all whitespace before the nth "word" (sequence of non-blanks); see the tests for exact behaviour.

(This version of truncate_words differs from the previous version in that it doesn't use re.UNICODE for the regex and that it updates the counter by hand instead of using enumerate. This patch also contains a couple more tests for trailing whitespace handling.)

This version is somewhat slower than the current version when the string contains few words or is to be truncated after a large fraction of the words in the string, but is faster when truncating after a small fraction of the words in a strings containing many words. YMMV.

I tried to a version using re.split as well (splitting on and capturing non-whitespace, with max_split=length) but it consistently performed worse than the attached version IIRC.

One comment about testing the performance of the current version and alternatives to each other: be sure to compare like with like, i.e., make sure your benchmarking code looks like this:

from django.utils.text import truncate_words
from django.utils.encoding import force_unicode
from django.utils.functional import allow_lazy

def my_truncate_words(s, num):
    length = int(num)
    s = force_unicode(s)
    # code to return a Unicode string
my_truncate_words = allow_lazy(my_truncate_words, unicode)

# other versions follow same pattern

comment:16 Changed 6 years ago by SmileyChris

Looks good, arien. Only question is that I'm not sure this is the correct behaviour:

>>> truncatewords(u'  Text with leading and trailing whitespace  ', 6)
u'  Text with leading and trailing whitespace ...'

It seems like the filter should be counting the words, not the spaces between them. So if there's 6 words, it should just output the original string.
Thoughts?

comment:17 Changed 6 years ago by arien <regexbot@…>

I think you can argue either way. The approach I took was to truncate immediately after the nth word if the the input string had any characters following it. The reason: "Truncates a string after a certain number of words." ;-)

What seem to be proposing is to return a string that has no more than n words in it. (I think that'll mean counting spaces and not words. We'll see...) Does this look reasonable?

>>> truncatewords(u'  Double-spaced  text.  ', 2) # and above
u'  Double-spaced  text.  '
>>> truncatewords(u'  Double-spaced  text.  ', 1)
u'  Double-spaced  ...'
>>> truncatewords(u'  Double-spaced  text.  ', 0) # and below
u'  ...'

comment:18 Changed 6 years ago by SmileyChris

I agree, it could be taken either way. Don't take my word as the way it should be ;)

How I see that it should work: If the string has less than or equal to the number of words, then the original string should be returned.
Otherwise the string should be truncated to the Xth word and appended with ' ...' (one space then an elipsis).

comment:19 Changed 4 years ago by adamnelson

  • Patch needs improvement set

Trunk has changed since patch was uploaded.

comment:20 Changed 3 years ago by lrekucki

  • Severity changed from minor to Normal
  • Type changed from enhancement to Bug

comment:21 Changed 2 years ago by aaugustin

  • UI/UX unset

Change UI/UX from NULL to False.

comment:22 Changed 2 years ago by aaugustin

  • Easy pickings unset

Change Easy pickings from NULL to False.

comment:23 Changed 13 months ago by aaugustin

  • Status changed from reopened to new

comment:24 Changed 11 months ago by gcc

  • Owner changed from nobody to gcc
  • Status changed from new to assigned

I've merged arien's patch and used SmileyChris' comments to create a pull request for review.

comment:25 Changed 11 months ago by gcc

  • Summary changed from filter truncatewords wipes newlines from string, so not chainable with markup filters to filter truncatewords is inefficient and destroys white space

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as assigned
The owner will be changed from gcc to anonymous. Next status will be 'assigned'
The ticket will be disowned. Next status will be 'new'
as The resolution will be set. Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.