Opened 16 years ago

Closed 13 years ago

#7267 closed Bug (fixed)

clean_html() bug

Reported by: Nikolay <djangoproject@…> Owned by: nobody
Component: Core (Other) Version: dev
Severity: Normal Keywords: html
Cc: aymeric.augustin@…, Preston Timmons Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The following is reproducible scenario for the issue:

[0:28 /home/nik]$ python
Python 2.5.2 (r252:60911, Apr 17 2008, 13:15:05) 
[GCC 4.2.3 (Debian 4.2.3-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import django
>>> django.utils.html.clean_html("<p>* * *</p><p>.</p>")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/niksite/lib/site-python/django/utils/functional.py", line 239, in wrapper
    return func(*args, **kwargs)
  File "/home/niksite/lib/site-python/django/utils/html.py", line 158, in clean_html
    text = hard_coded_bullets_re.sub(replace_p_tags, text)
  File "/home/niksite/lib/site-python/django/utils/html.py", line 156, in replace_p_tags
    s = s.replace('<p>%s' % d, '<li>')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)
>>> 

Attachments (4)

clean_html_unicode_r7601.patch (1.3 KB ) - added by George Vilches 16 years ago.
Patch against r7601 to fix crash error with clean_html.
7267.patch (2.3 KB ) - added by Aymeric Augustin 13 years ago.
7267.2.patch (2.3 KB ) - added by Aymeric Augustin 13 years ago.
7267.3.patch (2.3 KB ) - added by Aymeric Augustin 13 years ago.

Download all attachments as: .zip

Change History (18)

comment:1 by George Vilches, 16 years ago

Has patch: set

I've generated a patch for this ticket which prevents the crash error, although the output of your input looks strange to me.

>>> import django
>>> django.utils.html.clean_html(u"<p>* * *</p><p>.</p>")
u'<ul>\n<li> * *</li><p>.</li>\n</ul>'

However, I'm comfortable with saying that if it is invalid, that should be treated as a separate bug.

Also, the only thing that might be questionable with this patch is the change to utils/html.py, line 16:

DOTS = [u'&middot;', u'*', u'\xe2\x80\xa2', u'&#149;', u'&bull;', u'&#8226;']

The 3rd entry might not be a safe Unicode conversion. However, all the unit tests pass. Can someone comment?

by George Vilches, 16 years ago

Patch against r7601 to fix crash error with clean_html.

comment:2 by Simon Greenhill, 16 years ago

Triage Stage: UnreviewedReady for checkin

comment:3 by Adrian Holovaty, 16 years ago

Needs tests: set
Triage Stage: Ready for checkinAccepted

Needs tests!

comment:4 by Adrian Holovaty, 16 years ago

Component: ToolsCore framework

comment:5 by Peter Baumgartner, 13 years ago

Severity: Normal
Type: Bug

by Aymeric Augustin, 13 years ago

Attachment: 7267.patch added

comment:6 by Aymeric Augustin, 13 years ago

Cc: aymeric.augustin@… added
Needs tests: unset

comment:7 by Preston Timmons, 13 years ago

Cc: Preston Timmons added
Easy pickings: unset
Patch needs improvement: set

Thanks for updating the patch. The approach looks good but when I apply it I get the following test failure:

........................................................................F.............................
======================================================================
FAIL: test_clean_html (regressiontests.utils.html.TestUtilsHtml)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/pycon2011/djangoenv/src/django/tests/regressiontests/utils/html.py", line 135, in test_clean_html
    self.check_output(f, value, output)
  File "/root/pycon2011/djangoenv/src/django/tests/regressiontests/utils/html.py", line 14, in check_output
    self.assertEqual(function(value), output)
AssertionError: u'<p>I kill<br />whitespace</p>' != u'<p>I kill whitespace</p>'

comment:8 by Aymeric Augustin, 13 years ago

Oops - new patch fixes the test failure.

comment:9 by Aymeric Augustin, 13 years ago

Patch needs improvement: unset

by Aymeric Augustin, 13 years ago

Attachment: 7267.2.patch added

comment:10 by Preston Timmons, 13 years ago

Triage Stage: AcceptedReady for checkin

The new patch looks good. I verified it addresses the bug and passes tests. It applies cleanly against r16094.

comment:11 by Luke Plant, 13 years ago

Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

The issue regarding '\xe2\x80\xa2' does need to be addressed. This is a UTF-8 bytestring sequence for Unicode character 2022. The correct way to convert it to a unicode object is:

    '\xe2\x80\xa2'.decode('utf-8')

which results in u'\u2022', not u'\xe2\x80\xa2'

Version 0, edited 13 years ago by Luke Plant (next)

by Aymeric Augustin, 13 years ago

Attachment: 7267.3.patch added

comment:12 by Aymeric Augustin, 13 years ago

Patch needs improvement: unset

This new patch addresses lukeplant's comment.

comment:13 by Luke Plant, 13 years ago

Triage Stage: AcceptedReady for checkin

comment:14 by Luke Plant, 13 years ago

Resolution: fixed
Status: newclosed

In [16118]:

Fixed #7267 - UnicodeDecodeError in clean_html

Thanks to Nikolay for the report, and gav and aaugustin for the patch.

Note: See TracTickets for help on using tickets.
Back to Top