Opened 5 years ago
Last modified 9 months ago
#30686 closed Bug
Improve utils.text.Truncator &co to use a full HTML parser. — at Version 7
Reported by: | Thomas Hooper | Owned by: | nobody |
---|---|---|---|
Component: | Utilities | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Matthias Kestenholz | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
Original description:
I'm using Truncator.chars to truncate wikis, and it sometimes truncates in the middle of " entities, resulting in '<p>some text &qu</p>'
This is a limitation of the regex based implementation (which has had security issues, and presents an intractable problem).
Better to move to use a HTML parser, for Truncate, and strip_tags(), via html5lib and bleach.
Change History (8)
comment:1 by , 5 years ago
Description: | modified (diff) |
---|
follow-up: 4 comment:2 by , 5 years ago
comment:3 by , 5 years ago
I think now that the security release are out let's just add bleach as dependency on master and be done with it?
comment:4 by , 5 years ago
Here's an example https://repl.it/@tdhooper/Django-truncate-entities-bug
comment:5 by , 5 years ago
btw I confused truncator
with strip_tags
. So in this case the answer would be to rewrite the parser using html5lib
, while split_tags
would use bleach
which in turn then uses html5lib
as well.
comment:6 by , 5 years ago
Looks like it can be fixed with this regex change https://github.com/django/django/pull/11633/files
by , 5 years ago
Attachment: | possible-html5lib-truncator-implementation.patch added |
---|
Example implemetation of _truncate_html() using html5lib, by Florian Apolloner
comment:7 by , 5 years ago
Description: | modified (diff) |
---|---|
Summary: | Truncator.chars splits HTML entities → Improve utils.text.Truncator &co to use a full HTML parser. |
Triage Stage: | Unreviewed → Accepted |
Version: | 2.2 → master |
Right, good news is this isn't a regression from 7f65974f8219729c047fbbf8cd5cc9d80faefe77.
- The new example case fails on v2.2.3 &co.
- The suggestion for the regex change is in the part not changed as part of 7f65974f8219729c047fbbf8cd5cc9d80faefe77. (Which is why the new case fails, I suppose :)
I don't want to accept a tweaking of the regex here. Rather, we should move to using html5lib
as Florian suggests.
Possibly this would entail small changes in behaviour around edge cases, to be called out in release notes, but
would be a big win overall.
This has previously been discussed by the Security Team as the required way forward.
I've updated the title/description and will Accept accordingly.
I've attached an initial WIP patch by Florian of an html5lib
implementation of the core _truncate_html()
method.
An implementation of strip_tags()
using bleach
would go something like:
bleach.clean(text, tags=[], strip=True, strip_comments=True)
Thomas, would taking on making changes like these be something you'd be willing/keen to do? If so, I'm very happy to input to assist in any way. :)
Hi Thomas. Any chance of an example string (hopefully minimal) that creates the behaviour so we can have a look?