Opened 11 years ago

Closed 11 years ago

#19237 closed Bug (fixed)

The use of > in single or double quoted attributes in strip_tags

Reported by: chris.khoo@… Owned by: Chris Khoo
Component: Utilities Version: dev
Severity: Release blocker Keywords:
Cc: thiderman, simon@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

At the moment, if you run django.utils.html.strip_tags with '>' in single or double quoted attributes, it won't work.

For example:
In [2]: strip_tags('Some <p onclick="$(\'#id\').html(\'<i>input</i>\')">text</p> here')
Out[2]: u'Some input\')">text here'

Should return "Some text here"

Already patched it with tests in my github repo - khoomeister/django but would like to go through the proper processes!

Chris

Attachments (8)

19237-1.diff (1.4 KB ) - added by Claude Paroz 11 years ago.
Updated regex
problem_string.py (1.6 KB ) - added by Simon Litchfield 11 years ago.
Problem string which causes an infinite loop
strip_tags_parser.diff (1.2 KB ) - added by Simon Litchfield 11 years ago.
Parser based version of strip_tags filter
19237-parser-2.diff (3.3 KB ) - added by Claude Paroz 11 years ago.
HTMLParser-based (still failing)
19237-parser-3.diff (3.4 KB ) - added by Claude Paroz 11 years ago.
HTMLParser: Don't swallow unclosed tags
19237-parser-4.diff (1.7 KB ) - added by Claude Paroz 11 years ago.
19237-parser-5.diff (2.4 KB ) - added by Claude Paroz 11 years ago.
19237-parser-6.diff (2.6 KB ) - added by Claude Paroz 11 years ago.
No content encoding, but warning in docs (more backwards-compatible)

Download all attachments as: .zip

Change History (48)

comment:1 by Chris Khoo, 11 years ago

comment:2 by anonymous, 11 years ago

Owner: changed from nobody to anonymous
Status: newassigned

comment:3 by Chris Khoo <chris.khoo@…>, 11 years ago

Owner: changed from anonymous to Chris Khoo <chris.khoo@…>
Status: assignednew

comment:4 by Chris Khoo <chris.khoo@…>, 11 years ago

Owner: changed from Chris Khoo <chris.khoo@…> to anonymous
Status: newassigned

comment:5 by Chris Khoo, 11 years ago

Owner: changed from anonymous to Chris Khoo
Status: assignednew

comment:6 by Chris Khoo, 11 years ago

Status: newassigned

comment:7 by Chris Khoo, 11 years ago

Doh - sorry, this is my first time using this!

comment:8 by Claude Paroz, 11 years ago

Component: UncategorizedCore (Other)
Triage Stage: UnreviewedAccepted
Type: UncategorizedBug

comment:9 by Chris Khoo, 11 years ago

Current code only works with Python 2.7 - updated it to work with Python 2.5 & 2.6

Link to commit: https://github.com/khoomeister/django/commit/da1ba5fd79c28f0643a1eb0ce18feb51dfe4a499

comment:10 by thiderman, 11 years ago

Cc: thiderman added

I can confirm that this works and the tests pass on Python 2.6.5.

comment:11 by Claude Paroz, 11 years ago

Triage Stage: AcceptedReady for checkin

comment:12 by Claude Paroz <claude@…>, 11 years ago

Resolution: fixed
Status: assignedclosed

In bf1871d874a371ad0ae6c7e098e7665a468dca16:

Fixed #19237 -- Improved strip_tags utility

The previous pattern didn't properly addressed cases where '>'
was present inside quoted tag content.

comment:13 by Claude Paroz <claude@…>, 11 years ago

In 9efe1a7210ee161d5688f66a759bcd8d89d33142:

[1.5.x] Fixed #19237 -- Improved strip_tags utility

The previous pattern didn't properly addressed cases where '>'
was present inside quoted tag content.
Backport of bf1871d87 from master.

comment:14 by Pablo Recio <pablo.recioquijano@…>, 11 years ago

Resolution: fixed
Status: closednew

I just ran out a really basic case where this fails:

>>> from django.utils.html import strip_tags
>>> strip_tags('<strong>foo</strong><a href="http://example.com">bar</a>')
u'bar'

But the result should be 'foobar'. I think this should be fixed for the 1.5 release, should I mark it as blocker?

comment:15 by Claude Paroz, 11 years ago

Easy pickings: unset
Patch needs improvement: set
Severity: NormalRelease blocker
Triage Stage: Ready for checkinAccepted

Yes, either we quickly find the flaw in the new regex or we should revert the commit.

by Claude Paroz, 11 years ago

Attachment: 19237-1.diff added

Updated regex

comment:16 by Claude Paroz, 11 years ago

Patch needs improvement: unset

I may have found it. Regex gurus needed :-)

comment:17 by Pablo Recio <pablo.recioquijano@…>, 11 years ago

I'm not a regex guru, but it seems good for me. Although, I feel that this stuff should be either reverted or adding a really big batch of tests for releasing this into 1.5. Just my humble thought.

comment:18 by Claude Paroz, 11 years ago

It seems the current agreement among Django devs is to revert the change for 1.5 and switch to a non-regex technique for 1.6 (or maybe fallback to regex when content is not valid HTML).

comment:19 by Claude Paroz <claude@…>, 11 years ago

In 20ac33100cd20c17d1ceab075d96698974cc4778:

Partially revert 9efe1a721, strip_tags improvements

The new regex seems not stable enough for being released. Stripping
with regex might need reevaluation for the next release.
Refs #19237.

comment:20 by Claude Paroz <claude@…>, 11 years ago

In d7504a3d7b8645bdb979bab7ded0e9a9b6dccd0e:

Improved regex in strip_tags

Thanks Pablo Recio for the report. Refs #19237.

comment:21 by Claude Paroz, 11 years ago

Severity: Release blockerNormal

So I reverted the fix in 1.5, and commit my proposed fix in 1.6.

Surely, the next step is to provide more tests. I have no religion about the use of regex or an HTML parser to strip tags, only tests will convince me!

comment:22 by Claude Paroz <claude@…>, 11 years ago

In 20ac33100cd20c17d1ceab075d96698974cc4778:

Partially revert 9efe1a721, strip_tags improvements

The new regex seems not stable enough for being released. Stripping
with regex might need reevaluation for the next release.
Refs #19237.

comment:23 by Jacob, 11 years ago

FTR, the new regex committed in bf1871d874a371ad0ae6c7e098e7665a468dca16 has severe performance issues on longer strings, please consider this a veto on re-committing it.

We should probably avoid using regexes here entirely; I think HTMLParser might be a better choice.

comment:24 by Jacob, 11 years ago

As does the "improved" version in d7504a3d7b8645bdb979bab7ded0e9a9b6dccd0e; both are broken on larger strings.

comment:25 by Aymeric Augustin, 11 years ago

Component: Core (Other)Utilities

comment:26 by Aymeric Augustin, 11 years ago

Severity: NormalRelease blocker

Reading the comment, it isn't clear to me that the regex currently used in master doesn't suffer from performance issues. Marking as a release blocker until this is sorted out.

comment:27 by Claude Paroz <claude@…>, 11 years ago

In a01361b5ae9df5220c0ce7b42c5ff36094d5718c:

Added more tests for strip_tags utility

Refs #19237.

comment:28 by Claude Paroz, 11 years ago

I just committed tests that show the current implementation has apparently no performance issues on longer content. Anyone is still free to try a parser-based implementation, of course!

comment:29 by Simon Litchfield, 11 years ago

Finally, after nearly two weeks I've found the reason why my servers kept crashing on a high traffic, multi server production site :-(

This new regex is prone to catastrophic backtracking; which, when given the right string, spins the CPU into an infinite loop. Performance issues is an understatement. Sorry, but this regex literally crashes sites and absolutely cannot be included.
http://www.regular-expressions.info/catastrophic.html
http://stackoverflow.com/questions/6230489/100-cpu-usage-with-a-regexp-depending-on-input-length

We need to revert this change until we have a parser-based implementation. I'll attach a sample "problem string" for y'all to play with.

by Simon Litchfield, 11 years ago

Attachment: problem_string.py added

Problem string which causes an infinite loop

comment:30 by Simon Litchfield, 11 years ago

Cc: simon@… added

I've attached a simple patch using HTMLParser from stdlib. It's the only way, as any regex based solution won't properly support multiple attribute tags anyway.

by Simon Litchfield, 11 years ago

Attachment: strip_tags_parser.diff added

Parser based version of strip_tags filter

comment:31 by Simon Litchfield, 11 years ago

Easy pickings: set

comment:32 by Claude Paroz, 11 years ago

I wonder which code version you are using, because in your attached test (problem_string.py), you are still using the regex which was clearly broken and which has been fixed in [d7504a3d7b8645].

Thanks for your HTMLParser-based version, I will test it ASAP.

by Claude Paroz, 11 years ago

Attachment: 19237-parser-2.diff added

HTMLParser-based (still failing)

comment:33 by Claude Paroz, 11 years ago

Easy pickings: unset
Patch needs improvement: set

I just posted an improved version of Simon's patch. So far, I found one important different behaviour with the HTMLParser-based solution: unterminated/not-escaped < are removed with any following content. I'm not sure that this is an acceptable regression.

Tests with both file-based content (84Kb and 8Kb) show that the parsing version is somewhat slower than the regex one, but I think this remains acceptable.

And FWIW, I'm still waiting to receive a string which really fails with the current regex.

by Claude Paroz, 11 years ago

Attachment: 19237-parser-3.diff added

HTMLParser: Don't swallow unclosed tags

comment:34 by Claude Paroz, 11 years ago

Patch needs improvement: unset

I think I found a solution for unterminated/not-escaped < in that latest patch.

comment:35 by Simon Litchfield, 11 years ago

My mistake, I had the previous regex. The new one works flawlessly; and yes, a compiled "C" regex will always outperform a python parser. Unless we can find a string to break the new one I'm now leaning in favour of the regex solution.

comment:36 by Aymeric Augustin, 11 years ago

Triage Stage: AcceptedReady for checkin

The patch looks good to me.

Even though no one managed to break the current regex yet, it's a fragile technique, and bugs could easily escalate into security issues. So let's play safe!

comment:37 by Claude Paroz <claude@…>, 11 years ago

Resolution: fixed
Status: newclosed

In dc51ec8bc214cf60ebb99732363624c23df8005f:

Fixed #19237 -- Used HTML parser to strip tags

The regex method used until now for the strip_tags utility is fast,
but subject to flaws and security issues. Consensus and good
practice lead use to use a slower but safer method.

comment:38 by Claude Paroz, 11 years ago

Resolution: fixed
Status: closednew

HTMLParser behaviour changed between 2.6-2.7, 3.2-3.3. So now we have probably no other choice than ignoring content after any unclosed tag. That means the behaviour is changed comparing to the regex-based solution.

I'm joining the proposed fix.

by Claude Paroz, 11 years ago

Attachment: 19237-parser-4.diff added

comment:39 by Claude Paroz, 11 years ago

Another patch after discussing with Anssi on IRC. Backwards-incompatible due to < and > encoding, but still safer and less risk of content disappearing...

by Claude Paroz, 11 years ago

Attachment: 19237-parser-5.diff added

by Claude Paroz, 11 years ago

Attachment: 19237-parser-6.diff added

No content encoding, but warning in docs (more backwards-compatible)

comment:40 by Claude Paroz <claude@…>, 11 years ago

Resolution: fixed
Status: newclosed

In b664cb818d2e5896df2763299ea2c61a9af069a8:

Fixed #19237 (again) - Made strip_tags consistent between Python versions

Note: See TracTickets for help on using tickets.
Back to Top