#25401 closed Bug (wontfix)
django.utils.html.strip_tags can insert spurious semicolons
Reported by: | Jon Baldivieso | Owned by: | nobody |
---|---|---|---|
Component: | Utilities | Version: | 1.8 |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Accepted | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
In limited circumstances, strip_tags mangles legitimate text, inserting a semicolon before underscores.
from django.utils.html import strip_tags # Good strip_tags("&first_name") >>> '&first_name' # Good strip_tags("first_name<br>") >>> u'first_name' # Bad: semicolon introduced before underscore strip_tags("&first_name<br>") >>> u'&first;_name'
Our use-case is allowing rich emails to be drafted using Markdown; completely safe Markdown urls with query strings can get mangled with this bug.
Change History (7)
comment:3 by , 9 years ago
I haven't looked into this in detail, but I'm not sure this is something we should try to fix. It seems to me the original string isn't valid HTML (the ampersand isn't properly escaped).
comment:4 by , 9 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
strip_tags documentation is now pointing to the bleach
Python lib for a "more robust solution".
>>> import bleach >>> bleach.clean("&first_name<br>", strip=True) u'&first_name'
If you have a not-too-hairy patch which would improve strip_tags
, it might be accepted (reopen in that case), but we are not pursuing a perfect output for this utility.
comment:5 by , 9 years ago
Indeed, failing example is not a valid html piece so there is not much sense in trying to guarantee some sort of valid behaviour with cases like this.
The results are coming directly python's HTMLParser. In this particular case it recognizes '&first' to be the character reference. In the first case ("&first_name"
) it processes the entire string as plain text data without trying to parse it because there are no tags at all.
I'm closing this ticket as invalid since after a closer review it does not look like something that should be addressed within Django.
jbaldivieso, you can try using bleach
library (http://bleach.readthedocs.org/en/latest/), using its clean
and linkify
functions combination it might be possible to resolve your markdown processing issues.
You can also (on your own risk) try to hack django's MLStripper.handle_entityref
(https://github.com/django/django/blob/master/django/utils/html.py#L135).
Of course you can reopen this in case you have use cases where django's strip_tags
misbehaves with valid html data.