#25401 closed Bug (wontfix)
django.utils.html.strip_tags can insert spurious semicolons
| Reported by: | Jon Baldivieso | Owned by: | nobody |
|---|---|---|---|
| Component: | Utilities | Version: | 1.8 |
| Severity: | Normal | Keywords: | |
| Cc: | Triage Stage: | Accepted | |
| Has patch: | no | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description
In limited circumstances, strip_tags mangles legitimate text, inserting a semicolon before underscores.
from django.utils.html import strip_tags
# Good
strip_tags("&first_name")
>>> '&first_name'
# Good
strip_tags("first_name<br>")
>>> u'first_name'
# Bad: semicolon introduced before underscore
strip_tags("&first_name<br>")
>>> u'&first;_name'
Our use-case is allowing rich emails to be drafted using Markdown; completely safe Markdown urls with query strings can get mangled with this bug.
Change History (7)
comment:1 by , 10 years ago
| Triage Stage: | Unreviewed → Accepted |
|---|
comment:3 by , 10 years ago
I haven't looked into this in detail, but I'm not sure this is something we should try to fix. It seems to me the original string isn't valid HTML (the ampersand isn't properly escaped).
comment:4 by , 10 years ago
| Resolution: | → wontfix |
|---|---|
| Status: | new → closed |
strip_tags documentation is now pointing to the bleach Python lib for a "more robust solution".
>>> import bleach
>>> bleach.clean("&first_name<br>", strip=True)
u'&first_name'
If you have a not-too-hairy patch which would improve strip_tags, it might be accepted (reopen in that case), but we are not pursuing a perfect output for this utility.
comment:5 by , 10 years ago
Indeed, failing example is not a valid html piece so there is not much sense in trying to guarantee some sort of valid behaviour with cases like this.
The results are coming directly python's HTMLParser. In this particular case it recognizes '&first' to be the character reference. In the first case ("&first_name") it processes the entire string as plain text data without trying to parse it because there are no tags at all.
I'm closing this ticket as invalid since after a closer review it does not look like something that should be addressed within Django.
jbaldivieso, you can try using bleach library (http://bleach.readthedocs.org/en/latest/), using its clean and linkify functions combination it might be possible to resolve your markdown processing issues.
You can also (on your own risk) try to hack django's MLStripper.handle_entityref (https://github.com/django/django/blob/master/django/utils/html.py#L135).
Of course you can reopen this in case you have use cases where django's strip_tags misbehaves with valid html data.
>>> import django >>> from django.utils.html import strip_tags >>> strip_tags("&first_name<br>") u'&first;_name' >>> django.get_version() '1.9.dev20150914162508'