Opened 3 years ago

Closed 2 years ago

#19237 closed Bug (fixed)

The use of > in single or double quoted attributes in strip_tags

Reported by: chris.khoo@… Owned by: khoomeister
Component: Utilities Version: master
Severity: Release blocker Keywords:
Cc: thiderman, simon@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

At the moment, if you run django.utils.html.strip_tags with '>' in single or double quoted attributes, it won't work.

For example:
In [2]: strip_tags('Some <p onclick="$(\'#id\').html(\'<i>input</i>\')">text</p> here')
Out[2]: u'Some input\')">text here'

Should return "Some text here"

Already patched it with tests in my github repo - khoomeister/django but would like to go through the proper processes!

Chris

Attachments (8)

19237-1.diff (1.4 KB) - added by claudep 3 years ago.
Updated regex
problem_string.py (1.6 KB) - added by simon29 2 years ago.
Problem string which causes an infinite loop
strip_tags_parser.diff (1.2 KB) - added by simon29 2 years ago.
Parser based version of strip_tags filter
19237-parser-2.diff (3.3 KB) - added by claudep 2 years ago.
HTMLParser-based (still failing)
19237-parser-3.diff (3.4 KB) - added by claudep 2 years ago.
HTMLParser: Don't swallow unclosed tags
19237-parser-4.diff (1.7 KB) - added by claudep 2 years ago.
19237-parser-5.diff (2.4 KB) - added by claudep 2 years ago.
19237-parser-6.diff (2.6 KB) - added by claudep 2 years ago.
No content encoding, but warning in docs (more backwards-compatible)

Download all attachments as: .zip

Change History (48)

comment:1 Changed 3 years ago by khoomeister

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

comment:2 Changed 3 years ago by anonymous

  • Owner changed from nobody to anonymous
  • Status changed from new to assigned

comment:3 Changed 3 years ago by Chris Khoo <chris.khoo@…>

  • Owner changed from anonymous to Chris Khoo <chris.khoo@…>
  • Status changed from assigned to new

comment:4 Changed 3 years ago by Chris Khoo <chris.khoo@…>

  • Owner changed from Chris Khoo <chris.khoo@…> to anonymous
  • Status changed from new to assigned

comment:5 Changed 3 years ago by khoomeister

  • Owner changed from anonymous to khoomeister
  • Status changed from assigned to new

comment:6 Changed 3 years ago by khoomeister

  • Status changed from new to assigned

comment:7 Changed 3 years ago by khoomeister

Doh - sorry, this is my first time using this!

comment:8 Changed 3 years ago by claudep

  • Component changed from Uncategorized to Core (Other)
  • Triage Stage changed from Unreviewed to Accepted
  • Type changed from Uncategorized to Bug

comment:9 Changed 3 years ago by khoomeister

Current code only works with Python 2.7 - updated it to work with Python 2.5 & 2.6

Link to commit: https://github.com/khoomeister/django/commit/da1ba5fd79c28f0643a1eb0ce18feb51dfe4a499

comment:10 Changed 3 years ago by thiderman

  • Cc thiderman added

I can confirm that this works and the tests pass on Python 2.6.5.

comment:11 Changed 3 years ago by claudep

  • Triage Stage changed from Accepted to Ready for checkin

comment:12 Changed 3 years ago by Claude Paroz <claude@…>

  • Resolution set to fixed
  • Status changed from assigned to closed

In bf1871d874a371ad0ae6c7e098e7665a468dca16:

Fixed #19237 -- Improved strip_tags utility

The previous pattern didn't properly addressed cases where '>'
was present inside quoted tag content.

comment:13 Changed 3 years ago by Claude Paroz <claude@…>

In 9efe1a7210ee161d5688f66a759bcd8d89d33142:

[1.5.x] Fixed #19237 -- Improved strip_tags utility

The previous pattern didn't properly addressed cases where '>'
was present inside quoted tag content.
Backport of bf1871d87 from master.

comment:14 Changed 3 years ago by Pablo Recio <pablo.recioquijano@…>

  • Resolution fixed deleted
  • Status changed from closed to new

I just ran out a really basic case where this fails:

>>> from django.utils.html import strip_tags
>>> strip_tags('<strong>foo</strong><a href="http://example.com">bar</a>')
u'bar'

But the result should be 'foobar'. I think this should be fixed for the 1.5 release, should I mark it as blocker?

comment:15 Changed 3 years ago by claudep

  • Easy pickings unset
  • Patch needs improvement set
  • Severity changed from Normal to Release blocker
  • Triage Stage changed from Ready for checkin to Accepted

Yes, either we quickly find the flaw in the new regex or we should revert the commit.

Changed 3 years ago by claudep

Updated regex

comment:16 Changed 3 years ago by claudep

  • Patch needs improvement unset

I may have found it. Regex gurus needed :-)

comment:17 Changed 3 years ago by Pablo Recio <pablo.recioquijano@…>

I'm not a regex guru, but it seems good for me. Although, I feel that this stuff should be either reverted or adding a really big batch of tests for releasing this into 1.5. Just my humble thought.

comment:18 Changed 3 years ago by claudep

It seems the current agreement among Django devs is to revert the change for 1.5 and switch to a non-regex technique for 1.6 (or maybe fallback to regex when content is not valid HTML).

comment:19 Changed 3 years ago by Claude Paroz <claude@…>

In 20ac33100cd20c17d1ceab075d96698974cc4778:

Partially revert 9efe1a721, strip_tags improvements

The new regex seems not stable enough for being released. Stripping
with regex might need reevaluation for the next release.
Refs #19237.

comment:20 Changed 3 years ago by Claude Paroz <claude@…>

In d7504a3d7b8645bdb979bab7ded0e9a9b6dccd0e:

Improved regex in strip_tags

Thanks Pablo Recio for the report. Refs #19237.

comment:21 Changed 3 years ago by claudep

  • Severity changed from Release blocker to Normal

So I reverted the fix in 1.5, and commit my proposed fix in 1.6.

Surely, the next step is to provide more tests. I have no religion about the use of regex or an HTML parser to strip tags, only tests will convince me!

comment:22 Changed 3 years ago by Claude Paroz <claude@…>

In 20ac33100cd20c17d1ceab075d96698974cc4778:

Partially revert 9efe1a721, strip_tags improvements

The new regex seems not stable enough for being released. Stripping
with regex might need reevaluation for the next release.
Refs #19237.

comment:23 Changed 3 years ago by jacob

FTR, the new regex committed in bf1871d874a371ad0ae6c7e098e7665a468dca16 has severe performance issues on longer strings, please consider this a veto on re-committing it.

We should probably avoid using regexes here entirely; I think HTMLParser might be a better choice.

comment:24 Changed 3 years ago by jacob

As does the "improved" version in d7504a3d7b8645bdb979bab7ded0e9a9b6dccd0e; both are broken on larger strings.

comment:25 Changed 2 years ago by aaugustin

  • Component changed from Core (Other) to Utilities

comment:26 Changed 2 years ago by aaugustin

  • Severity changed from Normal to Release blocker

Reading the comment, it isn't clear to me that the regex currently used in master doesn't suffer from performance issues. Marking as a release blocker until this is sorted out.

comment:27 Changed 2 years ago by Claude Paroz <claude@…>

In a01361b5ae9df5220c0ce7b42c5ff36094d5718c:

Added more tests for strip_tags utility

Refs #19237.

comment:28 Changed 2 years ago by claudep

I just committed tests that show the current implementation has apparently no performance issues on longer content. Anyone is still free to try a parser-based implementation, of course!

comment:29 Changed 2 years ago by simon29

Finally, after nearly two weeks I've found the reason why my servers kept crashing on a high traffic, multi server production site :-(

This new regex is prone to catastrophic backtracking; which, when given the right string, spins the CPU into an infinite loop. Performance issues is an understatement. Sorry, but this regex literally crashes sites and absolutely cannot be included.
http://www.regular-expressions.info/catastrophic.html
http://stackoverflow.com/questions/6230489/100-cpu-usage-with-a-regexp-depending-on-input-length

We need to revert this change until we have a parser-based implementation. I'll attach a sample "problem string" for y'all to play with.

Changed 2 years ago by simon29

Problem string which causes an infinite loop

comment:30 Changed 2 years ago by simon29

  • Cc simon@… added

I've attached a simple patch using HTMLParser from stdlib. It's the only way, as any regex based solution won't properly support multiple attribute tags anyway.

Changed 2 years ago by simon29

Parser based version of strip_tags filter

comment:31 Changed 2 years ago by simon29

  • Easy pickings set

comment:32 Changed 2 years ago by claudep

I wonder which code version you are using, because in your attached test (problem_string.py), you are still using the regex which was clearly broken and which has been fixed in [d7504a3d7b8645].

Thanks for your HTMLParser-based version, I will test it ASAP.

Changed 2 years ago by claudep

HTMLParser-based (still failing)

comment:33 Changed 2 years ago by claudep

  • Easy pickings unset
  • Patch needs improvement set

I just posted an improved version of Simon's patch. So far, I found one important different behaviour with the HTMLParser-based solution: unterminated/not-escaped < are removed with any following content. I'm not sure that this is an acceptable regression.

Tests with both file-based content (84Kb and 8Kb) show that the parsing version is somewhat slower than the regex one, but I think this remains acceptable.

And FWIW, I'm still waiting to receive a string which really fails with the current regex.

Changed 2 years ago by claudep

HTMLParser: Don't swallow unclosed tags

comment:34 Changed 2 years ago by claudep

  • Patch needs improvement unset

I think I found a solution for unterminated/not-escaped < in that latest patch.

comment:35 Changed 2 years ago by simon29

My mistake, I had the previous regex. The new one works flawlessly; and yes, a compiled "C" regex will always outperform a python parser. Unless we can find a string to break the new one I'm now leaning in favour of the regex solution.

comment:36 Changed 2 years ago by aaugustin

  • Triage Stage changed from Accepted to Ready for checkin

The patch looks good to me.

Even though no one managed to break the current regex yet, it's a fragile technique, and bugs could easily escalate into security issues. So let's play safe!

comment:37 Changed 2 years ago by Claude Paroz <claude@…>

  • Resolution set to fixed
  • Status changed from new to closed

In dc51ec8bc214cf60ebb99732363624c23df8005f:

Fixed #19237 -- Used HTML parser to strip tags

The regex method used until now for the strip_tags utility is fast,
but subject to flaws and security issues. Consensus and good
practice lead use to use a slower but safer method.

comment:38 Changed 2 years ago by claudep

  • Resolution fixed deleted
  • Status changed from closed to new

HTMLParser behaviour changed between 2.6-2.7, 3.2-3.3. So now we have probably no other choice than ignoring content after any unclosed tag. That means the behaviour is changed comparing to the regex-based solution.

I'm joining the proposed fix.

Changed 2 years ago by claudep

comment:39 Changed 2 years ago by claudep

Another patch after discussing with Anssi on IRC. Backwards-incompatible due to < and > encoding, but still safer and less risk of content disappearing...

Changed 2 years ago by claudep

Changed 2 years ago by claudep

No content encoding, but warning in docs (more backwards-compatible)

comment:40 Changed 2 years ago by Claude Paroz <claude@…>

  • Resolution set to fixed
  • Status changed from new to closed

In b664cb818d2e5896df2763299ea2c61a9af069a8:

Fixed #19237 (again) - Made strip_tags consistent between Python versions

Note: See TracTickets for help on using tickets.
Back to Top