Opened 12 years ago
Closed 11 years ago
#19237 closed Bug (fixed)
The use of > in single or double quoted attributes in strip_tags
Reported by: | Owned by: | Chris Khoo | |
---|---|---|---|
Component: | Utilities | Version: | dev |
Severity: | Release blocker | Keywords: | |
Cc: | thiderman, simon@… | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
At the moment, if you run django.utils.html.strip_tags with '>' in single or double quoted attributes, it won't work.
For example:
In [2]: strip_tags('Some <p onclick="$(\'#id\').html(\'<i>input</i>\')">text</p> here')
Out[2]: u'Some input\')">text here'
Should return "Some text here"
Already patched it with tests in my github repo - khoomeister/django but would like to go through the proper processes!
Chris
Attachments (8)
Change History (48)
comment:1 by , 12 years ago
comment:2 by , 12 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:3 by , 12 years ago
Owner: | changed from | to
---|---|
Status: | assigned → new |
comment:4 by , 12 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:5 by , 12 years ago
Owner: | changed from | to
---|---|
Status: | assigned → new |
comment:6 by , 12 years ago
Status: | new → assigned |
---|
comment:7 by , 12 years ago
comment:8 by , 12 years ago
Component: | Uncategorized → Core (Other) |
---|---|
Triage Stage: | Unreviewed → Accepted |
Type: | Uncategorized → Bug |
comment:9 by , 12 years ago
Current code only works with Python 2.7 - updated it to work with Python 2.5 & 2.6
Link to commit: https://github.com/khoomeister/django/commit/da1ba5fd79c28f0643a1eb0ce18feb51dfe4a499
comment:10 by , 12 years ago
Cc: | added |
---|
I can confirm that this works and the tests pass on Python 2.6.5.
comment:11 by , 12 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
comment:12 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
comment:14 by , 12 years ago
Resolution: | fixed |
---|---|
Status: | closed → new |
I just ran out a really basic case where this fails:
>>> from django.utils.html import strip_tags >>> strip_tags('<strong>foo</strong><a href="http://example.com">bar</a>') u'bar'
But the result should be 'foobar'. I think this should be fixed for the 1.5 release, should I mark it as blocker?
comment:15 by , 12 years ago
Easy pickings: | unset |
---|---|
Patch needs improvement: | set |
Severity: | Normal → Release blocker |
Triage Stage: | Ready for checkin → Accepted |
Yes, either we quickly find the flaw in the new regex or we should revert the commit.
comment:16 by , 12 years ago
Patch needs improvement: | unset |
---|
I may have found it. Regex gurus needed :-)
comment:17 by , 12 years ago
I'm not a regex guru, but it seems good for me. Although, I feel that this stuff should be either reverted or adding a really big batch of tests for releasing this into 1.5. Just my humble thought.
comment:18 by , 12 years ago
It seems the current agreement among Django devs is to revert the change for 1.5 and switch to a non-regex technique for 1.6 (or maybe fallback to regex when content is not valid HTML).
comment:21 by , 12 years ago
Severity: | Release blocker → Normal |
---|
So I reverted the fix in 1.5, and commit my proposed fix in 1.6.
Surely, the next step is to provide more tests. I have no religion about the use of regex or an HTML parser to strip tags, only tests will convince me!
comment:23 by , 12 years ago
FTR, the new regex committed in bf1871d874a371ad0ae6c7e098e7665a468dca16 has severe performance issues on longer strings, please consider this a veto on re-committing it.
We should probably avoid using regexes here entirely; I think HTMLParser might be a better choice.
comment:24 by , 12 years ago
As does the "improved" version in d7504a3d7b8645bdb979bab7ded0e9a9b6dccd0e; both are broken on larger strings.
comment:25 by , 12 years ago
Component: | Core (Other) → Utilities |
---|
comment:26 by , 12 years ago
Severity: | Normal → Release blocker |
---|
Reading the comment, it isn't clear to me that the regex currently used in master doesn't suffer from performance issues. Marking as a release blocker until this is sorted out.
comment:28 by , 12 years ago
I just committed tests that show the current implementation has apparently no performance issues on longer content. Anyone is still free to try a parser-based implementation, of course!
comment:29 by , 12 years ago
Finally, after nearly two weeks I've found the reason why my servers kept crashing on a high traffic, multi server production site :-(
This new regex is prone to catastrophic backtracking; which, when given the right string, spins the CPU into an infinite loop. Performance issues is an understatement. Sorry, but this regex literally crashes sites and absolutely cannot be included.
http://www.regular-expressions.info/catastrophic.html
http://stackoverflow.com/questions/6230489/100-cpu-usage-with-a-regexp-depending-on-input-length
We need to revert this change until we have a parser-based implementation. I'll attach a sample "problem string" for y'all to play with.
comment:30 by , 12 years ago
Cc: | added |
---|
I've attached a simple patch using HTMLParser from stdlib. It's the only way, as any regex based solution won't properly support multiple attribute tags anyway.
by , 12 years ago
Attachment: | strip_tags_parser.diff added |
---|
Parser based version of strip_tags filter
comment:31 by , 12 years ago
Easy pickings: | set |
---|
comment:32 by , 12 years ago
I wonder which code version you are using, because in your attached test (problem_string.py), you are still using the regex which was clearly broken and which has been fixed in [d7504a3d7b8645].
Thanks for your HTMLParser
-based version, I will test it ASAP.
comment:33 by , 12 years ago
Easy pickings: | unset |
---|---|
Patch needs improvement: | set |
I just posted an improved version of Simon's patch. So far, I found one important different behaviour with the HTMLParser-based solution: unterminated/not-escaped <
are removed with any following content. I'm not sure that this is an acceptable regression.
Tests with both file-based content (84Kb and 8Kb) show that the parsing version is somewhat slower than the regex one, but I think this remains acceptable.
And FWIW, I'm still waiting to receive a string which really fails with the current regex.
comment:34 by , 12 years ago
Patch needs improvement: | unset |
---|
I think I found a solution for unterminated/not-escaped <
in that latest patch.
comment:35 by , 12 years ago
My mistake, I had the previous regex. The new one works flawlessly; and yes, a compiled "C" regex will always outperform a python parser. Unless we can find a string to break the new one I'm now leaning in favour of the regex solution.
comment:36 by , 11 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
The patch looks good to me.
Even though no one managed to break the current regex yet, it's a fragile technique, and bugs could easily escalate into security issues. So let's play safe!
comment:37 by , 11 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:38 by , 11 years ago
Resolution: | fixed |
---|---|
Status: | closed → new |
HTMLParser behaviour changed between 2.6-2.7, 3.2-3.3. So now we have probably no other choice than ignoring content after any unclosed tag. That means the behaviour is changed comparing to the regex-based solution.
I'm joining the proposed fix.
by , 11 years ago
Attachment: | 19237-parser-4.diff added |
---|
comment:39 by , 11 years ago
Another patch after discussing with Anssi on IRC. Backwards-incompatible due to <
and >
encoding, but still safer and less risk of content disappearing...
by , 11 years ago
Attachment: | 19237-parser-5.diff added |
---|
by , 11 years ago
Attachment: | 19237-parser-6.diff added |
---|
No content encoding, but warning in docs (more backwards-compatible)
comment:40 by , 11 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Doh - sorry, this is my first time using this!