Changes between Version 1 and Version 7 of Ticket #30686


Ignore:
Timestamp:
Aug 7, 2019, 4:17:47 AM (5 years ago)
Author:
Carlton Gibson
Comment:

Right, good news is this isn't a regression from 7f65974f8219729c047fbbf8cd5cc9d80faefe77.

  • The new example case fails on v2.2.3 &co.
  • The suggestion for the regex change is in the part not changed as part of 7f65974f8219729c047fbbf8cd5cc9d80faefe77. (Which is why the new case fails, I suppose :)

I don't want to accept a tweaking of the regex here. Rather, we should move to using html5lib as Florian suggests. Possibly this would entail small changes in behaviour around edge cases, to be called out in release notes, but would be a big win overall.

This has previously been discussed by the Security Team as the required way forward. I've updated the title/description and will Accept accordingly.

I've attached an initial WIP patch by Florian of an html5lib implementation of the core _truncate_html() method.

An implementation of strip_tags() using bleach would go something like:

bleach.clean(text, tags=[], strip=True, strip_comments=True)

Thomas, would taking on making changes like these be something you'd be willing/keen to do? If so, I'm very happy to input to assist in any way. :)

Legend:

Unmodified
Added
Removed
Modified
  • Ticket #30686

    • Property Triage Stage UnreviewedAccepted
    • Property Summary Truncator.chars splits HTML entitiesImprove utils.text.Truncator &co to use a full HTML parser.
    • Property Version 2.2master
  • Ticket #30686 – Description

    v1 v7  
    1 I'm using Truncator.chars to truncate wikis, and it sometimes truncates in the middle of &quot; entities, resulting in '<p>some text &qu</p>'
     1Original description:
     2
     3> I'm using Truncator.chars to truncate wikis, and it sometimes truncates in the middle of &quot; entities, resulting in '<p>some text &qu</p>'
     4
     5This is a limitation of the regex based implementation (which has had security issues, and presents an intractable problem).
     6
     7Better to move to use a HTML parser, for Truncate, and strip_tags(), via html5lib and bleach.
Back to Top