Opened 19 years ago
Closed 10 years ago
#1919 closed Bug (wontfix)
filter truncatewords is inefficient and destroys white space
Reported by: | derelm | Owned by: | Chris Wilson |
---|---|---|---|
Component: | Template system | Version: | dev |
Severity: | Normal | Keywords: | sprintsept14 |
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
i tried to chain truncatewords with markdown, but this doesn't work as truncatewords removes newlines from the input string.
newlines are crucial for markdown and possibly other markup-parsers.
Attachments (7)
Change History (35)
by , 19 years ago
Attachment: | truncate_words_preserving_whitespace.diff added |
---|
comment:1 by , 19 years ago
I've attached a patch that will cause the truncatewords
filter to preserve all whitespace (including leading and trailing whitespace).
comment:3 by , 18 years ago
Triage Stage: | Unreviewed → Ready for checkin |
---|
comment:4 by , 18 years ago
Has patch: | set |
---|
comment:5 by , 18 years ago
Owner: | changed from | to
---|---|
Patch needs improvement: | set |
Triage Stage: | Ready for checkin → Accepted |
I'm going to bounce this patch back for improvement (I'm just doing a "check in the obviously correct" pass at the moment, so I don't want to stop and fix it right now).
I think you can get away with just one call to re.split() without the subsequent re.search() every time through the list. After the call to re.split(), every second element in the returned list is the whitespace. So you can iterate through the "words" list (in the patch's terminology) two elements at a time: the first one will be a word, the second the whitespace. That avoids the need to call whitespace_re.search() each time. It will improve performance a bit (and reduce the line count, too, I suspect).
A test of this would be nice, too, since it's not immediately clear that it works 100%.
comment:6 by , 18 years ago
Needs tests: | set |
---|
by , 17 years ago
Attachment: | truncate_words_preserving_whites.diff added |
---|
patch against trunk rev. 5783, simplified
comment:7 by , 17 years ago
Yes, the word_re in my patch can be as simple as the first below
# split at words boundaries word_re = re.compile(r'(\w+)',re.UNICODE) # split at **markdown** or _such marked_ up words word_re = re.compile(r'((?:\*|_)*\w+(?:\*|_)*)',re.UNICODE)
Anyway, it was just for the fun, as I don't think this is the way to go: chaining truncatewords to markdown will break anyway! I Think it's better to use markdon first, then chain to htmltruncatewords.
What sort of test are needed for such a patch ?
comment:8 by , 17 years ago
Needs tests: | unset |
---|---|
Patch needs improvement: | unset |
comment:9 by , 17 years ago
Keywords: | sprintsept14 added |
---|
comment:10 by , 17 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
An HTML-safe truncation filter was added some time ago, so the solution here is to do the markup filtering before truncating.
comment:11 by , 17 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
I disagree that it's been fixed. truncate_words
should preserve existing white space. We have a patch here and one in #4655. Either one of those seems to do the job.
comment:12 by , 17 years ago
I agree with James that this should be closed. Just use truncate_html_words to preserve space. Changing truncate_words will affect the performance of that function.
Not closing it again, though, because I'm attaching tests to show truncate_html_words will do what's needed and just so django.utils.text.truncate_* gets a little test coverage.
comment:13 by , 17 years ago
Rubbish about affecting performance. My version is faster than the current one anyway!
Run truncate.py
from a manage.py shell
if you want a test comparison
by , 17 years ago
Attachment: | truncate.py added |
---|
by , 17 years ago
Attachment: | truncate_words.diff added |
---|
truncate_words that does what it says on the package; includes tests
by , 17 years ago
Attachment: | truncate_words_faster.diff added |
---|
faster version of the same idea; includes additional tests for trailing whitespace handling
comment:15 by , 17 years ago
The last patch contains a version of trunacte_words that preserves all whitespace before the nth "word" (sequence of non-blanks); see the tests for exact behaviour.
(This version of truncate_words differs from the previous version in that it doesn't use re.UNICODE for the regex and that it updates the counter by hand instead of using enumerate. This patch also contains a couple more tests for trailing whitespace handling.)
This version is somewhat slower than the current version when the string contains few words or is to be truncated after a large fraction of the words in the string, but is faster when truncating after a small fraction of the words in a strings containing many words. YMMV.
I tried to a version using re.split as well (splitting on and capturing non-whitespace, with max_split=length) but it consistently performed worse than the attached version IIRC.
One comment about testing the performance of the current version and alternatives to each other: be sure to compare like with like, i.e., make sure your benchmarking code looks like this:
from django.utils.text import truncate_words from django.utils.encoding import force_unicode from django.utils.functional import allow_lazy def my_truncate_words(s, num): length = int(num) s = force_unicode(s) # code to return a Unicode string my_truncate_words = allow_lazy(my_truncate_words, unicode) # other versions follow same pattern
comment:16 by , 17 years ago
Looks good, arien. Only question is that I'm not sure this is the correct behaviour:
>>> truncatewords(u' Text with leading and trailing whitespace ', 6) u' Text with leading and trailing whitespace ...'
It seems like the filter should be counting the words, not the spaces between them. So if there's 6 words, it should just output the original string.
Thoughts?
comment:17 by , 17 years ago
I think you can argue either way. The approach I took was to truncate immediately after the nth word if the the input string had any characters following it. The reason: "Truncates a string after a certain number of words."
;-)
What seem to be proposing is to return a string that has no more than n words in it. (I think that'll mean counting spaces and not words. We'll see...) Does this look reasonable?
>>> truncatewords(u' Double-spaced text. ', 2) # and above u' Double-spaced text. ' >>> truncatewords(u' Double-spaced text. ', 1) u' Double-spaced ...' >>> truncatewords(u' Double-spaced text. ', 0) # and below u' ...'
comment:18 by , 17 years ago
I agree, it could be taken either way. Don't take my word as the way it should be ;)
How I see that it should work: If the string has less than or equal to the number of words, then the original string should be returned.
Otherwise the string should be truncated to the Xth word and appended with ' ...'
(one space then an elipsis).
comment:19 by , 14 years ago
Patch needs improvement: | set |
---|
Trunk has changed since patch was uploaded.
comment:20 by , 14 years ago
Severity: | minor → Normal |
---|---|
Type: | enhancement → Bug |
comment:23 by , 12 years ago
Status: | reopened → new |
---|
comment:24 by , 12 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
I've merged arien's patch and used SmileyChris' comments to create a pull request for review.
comment:25 by , 12 years ago
Summary: | filter truncatewords wipes newlines from string, so not chainable with markup filters → filter truncatewords is inefficient and destroys white space |
---|
comment:26 by , 10 years ago
I left comments for improvement on the PR. Please uncheck "Patch needs improvement" when you update it, thanks.
comment:27 by , 10 years ago
I think this bug (actually feature request) can be closed as the change would be backwards incompatible
and I don't see a need for a new filter "truncatewords_white" (pick any name).
comment:28 by , 10 years ago
Resolution: | → wontfix |
---|---|
Status: | assigned → closed |
patch against r2991 (trunk); cause truncatewords filter to preserve whitespace