Opened 5 years ago

Closed 9 months ago

Last modified 8 months ago

#30686 closed Bug (fixed)

Improve utils.text.Truncator &co to use a full HTML parser.

Reported by: Thomas Hooper Owned by: David Smith
Component: Utilities Version: dev
Severity: Normal Keywords:
Cc: Matthias Kestenholz Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Carlton Gibson)

Original description:

I'm using Truncator.chars to truncate wikis, and it sometimes truncates in the middle of &quot; entities, resulting in '<p>some text &qu</p>'

This is a limitation of the regex based implementation (which has had security issues, and presents an intractable problem).

Better to move to use a HTML parser, for Truncate, and strip_tags(), via html5lib and bleach.

Attachments (1)

possible-html5lib-truncator-implementation.patch (4.6 KB ) - added by Carlton Gibson 5 years ago.
Example implemetation of _truncate_html() using html5lib, by Florian Apolloner

Download all attachments as: .zip

Change History (40)

comment:1 by Thomas Hooper, 5 years ago

Description: modified (diff)

comment:2 by Carlton Gibson, 5 years ago

Hi Thomas. Any chance of an example string (hopefully minimal) that creates the behaviour so we can have a look?

comment:3 by Florian Apolloner, 5 years ago

I think now that the security release are out let's just add bleach as dependency on master and be done with it?

in reply to:  2 comment:4 by Thomas Hooper, 5 years ago

comment:5 by Florian Apolloner, 5 years ago

btw I confused truncator with strip_tags. So in this case the answer would be to rewrite the parser using html5lib, while split_tags would use bleach which in turn then uses html5lib as well.

comment:6 by Thomas Hooper, 5 years ago

Looks like it can be fixed with this regex change https://github.com/django/django/pull/11633/files

by Carlton Gibson, 5 years ago

Example implemetation of _truncate_html() using html5lib, by Florian Apolloner

comment:7 by Carlton Gibson, 5 years ago

Description: modified (diff)
Summary: Truncator.chars splits HTML entitiesImprove utils.text.Truncator &co to use a full HTML parser.
Triage Stage: UnreviewedAccepted
Version: 2.2master

Right, good news is this isn't a regression from 7f65974f8219729c047fbbf8cd5cc9d80faefe77.

  • The new example case fails on v2.2.3 &co.
  • The suggestion for the regex change is in the part not changed as part of 7f65974f8219729c047fbbf8cd5cc9d80faefe77. (Which is why the new case fails, I suppose :)

I don't want to accept a tweaking of the regex here. Rather, we should move to using html5lib as Florian suggests.
Possibly this would entail small changes in behaviour around edge cases, to be called out in release notes, but
would be a big win overall.

This has previously been discussed by the Security Team as the required way forward.
I've updated the title/description and will Accept accordingly.

I've attached an initial WIP patch by Florian of an html5lib implementation of the core _truncate_html() method.

An implementation of strip_tags() using bleach would go something like:

bleach.clean(text, tags=[], strip=True, strip_comments=True)

Thomas, would taking on making changes like these be something you'd be willing/keen to do? If so, I'm very happy to input to assist in any way. :)


comment:8 by Thomas Hooper, 5 years ago

Hi Carlton, that would be fun, but this is bigger than I have time for now. It looks like you all have it in hand.

comment:9 by Claude Paroz, 5 years ago

Do we want to make both html5lib and bleach required dependencies of Django?
html5lib latest release is now 20 months ago, and when I read issues like https://github.com/html5lib/html5lib-python/issues/419 without any maintainer feedback, I'm a bit worried. What about the security report workflow for those libs? What if a security issue is discovered in html5 lib and the maintainers are unresponsive? Sorry to sound a bit negative, but I think those questions must be asked.

comment:10 by Carlton Gibson, 5 years ago

Yep Claude, absolutely.

I think there's two difficulties we could face:

  • trying to successfully sanitize HTML with regexes.
  • (Help) Make sure html5lib-python is maintained.

The first of these is intractable. The second not. 🙂

I've put out some feelers to try and find out more.

  • This is pressing for Python and pip now, not for us for a while yet.
  • If we look at https://github.com/html5lib/html5lib-python/issues/361 it seems there's some money on the table from tidelift potentially.
  • We COULD allocate some time in a pinch I think.
  • AND it's just (even with the emphasis, cough) a wrapper around the underlying C library, so whilst 20 months seems a long time, I'm not sure the release cadence is really an issue.

BUT, yes, absolutely. Let's hammer this out properly before we commit. 👍
I will open a mailing list thread when I know more.

Last edited 5 years ago by Carlton Gibson (previous) (diff)

comment:11 by Carlton Gibson, 5 years ago

AND it's just (even with the emphasis, cough) a wrapper around the underlying C library, so whilst 20 months seems a long time, I'm not sure the release cadence is really an issue.

OK, that last one isn't at all true. (Looking at the source it's the entire implementation.)

Update: I had `lxml` in mind.

Last edited 5 years ago by Carlton Gibson (previous) (diff)

comment:12 by Claude Paroz, 5 years ago

To be clear, I'm also convinced parsing is more reliable than regexes. I just think we have to double-think before adding a dependency, because as the name implies, we depend on it and therefore we must be able to trust its maintainers. Some guarantees about the security process and serious bugs fixing should be obtained. Without that, we are just outsourcing problems.

comment:13 by Carlton Gibson, 5 years ago

@Claude: 💯👍 Totally agree.

comment:14 by Carlton Gibson, 5 years ago

Duplicate in #30700, with failing test case provided.

I've tried contacting maintainers of HTML5lib with no success.

I've re-opened https://github.com/django/django/pull/11633 (original regex based suggestion) so we can at least assess it as a possible stop-gap.

comment:15 by Carlton Gibson, 5 years ago

Cc: Jon Dufresne added

Paging Jon, to ask his opinion on this.

Hey Jon, I see you've made a number of PRs to both html5lib, and bleach.

To me, at this point, html5lib essentially looks unmaintained. I don't have personal capacity to give to it, as cool as it is as a project. Arguably we (Fellows) could allocate it _some_ time, since we spend a fair bit already messing around with regexes but that would be small, and we couldn't take it on whole, so can I ask your thoughts?

Is html5lib in trouble? If so, as a user, what are your plans, if any? And from that, what do you think about Django adopting it? What's the alternative?

Thanks for the thought and insight.

comment:16 by Jon Dufresne, 5 years ago

To me, at this point, html5lib essentially looks unmaintained.

I agree with this observation. The previous main maintainer looks to have stopped working on the project. Responses to issues and PRs have stopped.

Is html5lib in trouble? If so, as a user, what are your plans, if any? And from that, what do you think about Django adopting it? What's the alternative?

For my own projects, I'll probably continue using html5lib until its staleness creates an observable bug for me. I haven't hit that point yet.

Bleach, on the other hand, looks like maintenance has slowed, but not stopped. I believe they have vendored html5lib to allow them to make changes internally. FWIW, I also still use Bleach.

---

I'm not familiar with all the details of this ticket, but would the stdlib HTML parser be sufficient?

https://docs.python.org/3/library/html.parser.html

comment:17 by Carlton Gibson, 5 years ago

Hi Jon,

Thank you for the comments. I will email Will, the maintainer of Bleach, and ask his thoughts too. Bleach has slowed down, but that's because it's Stable/Mature now I would have thought.

...would the stdlib HTML parser be sufficient?

Yes. Maybe. Ideally we just thought to bring in Bleach, and with it html5lib since, in theory, that's already working code. (Florian already had a Truncate prototype...)

Anyhow... will follow-up.

Last edited 5 years ago by Carlton Gibson (previous) (diff)

comment:18 by David Smith, 22 months ago

Owner: changed from nobody to David Smith
Status: newassigned

comment:19 by Carlton Gibson, 22 months ago

Cc: Matthias Kestenholz added

Adding some detail after the last post, since you're looking at it David.

There was a discussion (with various folks from html5lib, and Mozilla, and ...) about whether html5lib could be put on a better footing.
I'm not sure how that panned out in the medium term. (I didn't check what the rhythm looks like now.)

There was also talk about whether bleach (or an alternate) could build off html5ever which is the HTML parser from the Mozilla servo project.

That would be pretty cool, but it was clearly a lot of work, and then 2020 happened, so...

The other candidate in this space is Matthias' html-sanitizer: https://github.com/matthiask/html-sanitizer — which is built on lxml.

That's just to lay down the notes I had gathered. I'm not sure the way forward, but hopefully it's helpful.
Very open to ideas though! Thanks for picking it up.

Last edited 22 months ago by Carlton Gibson (previous) (diff)

comment:20 by Jon Dufresne, 22 months ago

Cc: Jon Dufresne removed

comment:21 by Matthias Kestenholz, 22 months ago

Hi all

lxml is quite a heavy dependency. It works very well but you'll wait for the compilation a long time if you do not have wheels. (see https://pypi.org/project/lxml/#files) I think Python packaging is almost a non-issue these days except when it comes to transitive dependencies, and I wouldn't want to be in charge of specifying and updating the supported range of lxml versions. That being said, I encountered almost no breaking changes in lxml since ~2009, I use lxml in almost all projects and I can heartily recommend it to anyone.

I'm sure that the regex-based solution has some problems; I'm sorry to admit I haven't read the full thread but I just cannot imagine a situation where using |strip_tags without |safe would lead to a security issue, and why would you want to combine these? There's no point to mark a string as safe after stripping all tags. So it's only about the fact that the output sometimes isn't nice, something which may be fixed by converting as many entities to their unicode equivalents as possible and only truncating afterwards?

Last but not least: I haven't benchmarked it ever, but I have the suspicion that running bleach or html-sanitizer during rendering may be wasteful in terms of CPU cycles. I only ever use the sanitizer when saving, never when rendering. |strip_tags is obviously applied when rendering and performs well enough in many situations.

So, to me strip_tags is a clear case of a simple implementation with "worse is better" characteristics.

I truly hope this is helpful and not just a cold shower (sorry for using "just" here)

Thanks,
Matthias

comment:22 by Carlton Gibson, 22 months ago

Hey Matthias — that's a very useful input. Thank you for your insight.

So, to me strip_tags is a clear case of ​a simple implementation with "worse is better" characteristics.

Let, me review what happened here tomorrow (it was a long while ago) but assuming it makes sense, wontfix + We're not accepting any complications to the algorithm — use ... if you need more sophistication may be the neatest way all round.

Last edited 22 months ago by Carlton Gibson (previous) (diff)

comment:23 by David Smith, 22 months ago

PR

I was thinking about Jon's suggestion of using the HTMLParser from the standard library. Since the last comments on this ticket Adam Johnson has also written a blog post on Truncating HTML with Python's HTMLParser which helped inspire my PR, see blog post. (I'd cc Adam as I've mentioned his name, but not sure how to do that?!)

While my PR still needs more work I thought it worth sharing as it may be helpful to Carlton when reviewing tomorrow.

comment:24 by Carlton Gibson, 22 months ago

Hey David — great stuff. With that in play I won't rush to resolve.

Thinking about Matthias' comment:

Last but not least: I haven't benchmarked it ever, but I have the suspicion that...

Could I ask you to do a minimal timeit/pyperf comparison, so we can get a rough measure on the table? (It doesn't need to be perfect.)

comment:25 by David Smith, 21 months ago

Has patch: set

comment:26 by Carlton Gibson, 20 months ago

Patch needs improvement: set

comment:27 by David Smith, 16 months ago

Patch needs improvement: unset

comment:28 by Mariusz Felisiak <felisiak.mariusz@…>, 16 months ago

In 6f1b8c00:

Refs #30686 -- Moved add_truncation_text() helper to a module level.

comment:29 by Mariusz Felisiak <felisiak.mariusz@…>, 16 months ago

In 1d0dfc0:

Refs #30686 -- Moved Parser.SELF_CLOSING_TAGS to django.utils.html.VOID_ELEMENTS

comment:30 by Mariusz Felisiak, 16 months ago

Patch needs improvement: set

Based on comment.

comment:31 by Sage Abdullah, 15 months ago

Patch needs improvement: unset

The PR seems to have been rebased and updated since the last update on this ticket. Unsetting "Patch needs improvement" if that's OK.

comment:32 by Sage Abdullah, 15 months ago

Patch needs improvement: set

Oops, sorry, should've looked more closely. The comment linked above hasn't been resolved.

comment:33 by David Smith, 9 months ago

Patch needs improvement: unset

comment:34 by Mariusz Felisiak <felisiak.mariusz@…>, 9 months ago

In 48a46939:

Refs #30686 -- Improved test coverage of Truncator.

comment:35 by Mariusz Felisiak <felisiak.mariusz@…>, 9 months ago

In 70f39e4:

Refs #30686 -- Fixed text truncation for negative or zero lengths.

comment:36 by Mariusz Felisiak, 9 months ago

Triage Stage: AcceptedReady for checkin

comment:37 by Mariusz Felisiak <felisiak.mariusz@…>, 9 months ago

Resolution: fixed
Status: assignedclosed

In 6ee37ad:

Fixed #30686 -- Used Python HTMLParser in utils.text.Truncator.

comment:38 by GitHub <noreply@…>, 9 months ago

In 3cadeea:

Refs #30686 -- Removed unused regexes in django.utils.text.

Unused since 6ee37ada3241ed263d8d1c2901b030d964cbd161.

comment:39 by GitHub <noreply@…>, 8 months ago

In 95ae3783:

Refs #30686 -- Made django.utils.html.VOID_ELEMENTS a frozenset.

Note: See TracTickets for help on using tickets.
Back to Top