Opened 9 years ago

Closed 9 years ago

Last modified 9 years ago

#25401 closed Bug (wontfix)

django.utils.html.strip_tags can insert spurious semicolons

Reported by: Jon Baldivieso Owned by: nobody
Component: Utilities Version: 1.8
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

In limited circumstances, strip_tags mangles legitimate text, inserting a semicolon before underscores.

from django.utils.html import strip_tags

# Good
strip_tags("&first_name")
>>> '&first_name'

# Good
strip_tags("first_name<br>")
>>> u'first_name'

# Bad: semicolon introduced before underscore
strip_tags("&first_name<br>")
>>> u'&first;_name'

Our use-case is allowing rich emails to be drafted using Markdown; completely safe Markdown urls with query strings can get mangled with this bug.

Change History (7)

comment:1 by Anton Baklanov, 9 years ago

Triage Stage: UnreviewedAccepted
>>> import django
>>> from django.utils.html import strip_tags
>>> strip_tags("&first_name<br>")
u'&first;_name'
>>> django.get_version()
'1.9.dev20150914162508'

comment:2 by Anton Baklanov, 9 years ago

Can be reproduced on 1.8.x as well.

comment:3 by Tim Graham, 9 years ago

I haven't looked into this in detail, but I'm not sure this is something we should try to fix. It seems to me the original string isn't valid HTML (the ampersand isn't properly escaped).

comment:4 by Claude Paroz, 9 years ago

Resolution: wontfix
Status: newclosed

strip_tags documentation is now pointing to the bleach Python lib for a "more robust solution".

>>> import bleach
>>> bleach.clean("&first_name<br>", strip=True)
u'&amp;first_name'

If you have a not-too-hairy patch which would improve strip_tags, it might be accepted (reopen in that case), but we are not pursuing a perfect output for this utility.

comment:5 by Anton Baklanov, 9 years ago

Indeed, failing example is not a valid html piece so there is not much sense in trying to guarantee some sort of valid behaviour with cases like this.

The results are coming directly python's HTMLParser. In this particular case it recognizes '&first' to be the character reference. In the first case ("&first_name") it processes the entire string as plain text data without trying to parse it because there are no tags at all.

I'm closing this ticket as invalid since after a closer review it does not look like something that should be addressed within Django.

jbaldivieso, you can try using bleach library (http://bleach.readthedocs.org/en/latest/), using its clean and linkify functions combination it might be possible to resolve your markdown processing issues.
You can also (on your own risk) try to hack django's MLStripper.handle_entityref (https://github.com/django/django/blob/master/django/utils/html.py#L135).

Of course you can reopen this in case you have use cases where django's strip_tags misbehaves with valid html data.

comment:6 by Anton Baklanov, 9 years ago

claudep has faster typing skills.

comment:7 by Claude Paroz, 9 years ago

At least, we concur :-)

Note: See TracTickets for help on using tickets.
Back to Top