Code

Opened 6 years ago

Closed 3 years ago

#7267 closed Bug (fixed)

clean_html() bug

Reported by: Nikolay <djangoproject@…> Owned by: nobody
Component: Core (Other) Version: master
Severity: Normal Keywords: html
Cc: aymeric.augustin@…, prestontimmons Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX:

Description

The following is reproducible scenario for the issue:

[0:28 /home/nik]$ python
Python 2.5.2 (r252:60911, Apr 17 2008, 13:15:05) 
[GCC 4.2.3 (Debian 4.2.3-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import django
>>> django.utils.html.clean_html("<p>* * *</p><p>.</p>")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/niksite/lib/site-python/django/utils/functional.py", line 239, in wrapper
    return func(*args, **kwargs)
  File "/home/niksite/lib/site-python/django/utils/html.py", line 158, in clean_html
    text = hard_coded_bullets_re.sub(replace_p_tags, text)
  File "/home/niksite/lib/site-python/django/utils/html.py", line 156, in replace_p_tags
    s = s.replace('<p>%s' % d, '<li>')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)
>>> 

Attachments (4)

clean_html_unicode_r7601.patch (1.3 KB) - added by gav 6 years ago.
Patch against r7601 to fix crash error with clean_html.
7267.patch (2.3 KB) - added by aaugustin 3 years ago.
7267.2.patch (2.3 KB) - added by aaugustin 3 years ago.
7267.3.patch (2.3 KB) - added by aaugustin 3 years ago.

Download all attachments as: .zip

Change History (18)

comment:1 Changed 6 years ago by gav

  • Has patch set
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

I've generated a patch for this ticket which prevents the crash error, although the output of your input looks strange to me.

>>> import django
>>> django.utils.html.clean_html(u"<p>* * *</p><p>.</p>")
u'<ul>\n<li> * *</li><p>.</li>\n</ul>'

However, I'm comfortable with saying that if it is invalid, that should be treated as a separate bug.

Also, the only thing that might be questionable with this patch is the change to utils/html.py, line 16:

DOTS = [u'&middot;', u'*', u'\xe2\x80\xa2', u'&#149;', u'&bull;', u'&#8226;']

The 3rd entry might not be a safe Unicode conversion. However, all the unit tests pass. Can someone comment?

Changed 6 years ago by gav

Patch against r7601 to fix crash error with clean_html.

comment:2 Changed 6 years ago by Simon Greenhill

  • Triage Stage changed from Unreviewed to Ready for checkin

comment:3 Changed 6 years ago by adrian

  • Needs tests set
  • Triage Stage changed from Ready for checkin to Accepted

Needs tests!

comment:4 Changed 6 years ago by adrian

  • Component changed from Tools to Core framework

comment:5 Changed 3 years ago by baumer1122

  • Severity set to Normal
  • Type set to Bug

Changed 3 years ago by aaugustin

comment:6 Changed 3 years ago by aaugustin

  • Cc aymeric.augustin@… added
  • Needs tests unset

comment:7 Changed 3 years ago by prestontimmons

  • Cc prestontimmons added
  • Easy pickings unset
  • Patch needs improvement set

Thanks for updating the patch. The approach looks good but when I apply it I get the following test failure:

........................................................................F.............................
======================================================================
FAIL: test_clean_html (regressiontests.utils.html.TestUtilsHtml)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/pycon2011/djangoenv/src/django/tests/regressiontests/utils/html.py", line 135, in test_clean_html
    self.check_output(f, value, output)
  File "/root/pycon2011/djangoenv/src/django/tests/regressiontests/utils/html.py", line 14, in check_output
    self.assertEqual(function(value), output)
AssertionError: u'<p>I kill<br />whitespace</p>' != u'<p>I kill whitespace</p>'

comment:8 Changed 3 years ago by aaugustin

Oops - new patch fixes the test failure.

comment:9 Changed 3 years ago by aaugustin

  • Patch needs improvement unset

Changed 3 years ago by aaugustin

comment:10 Changed 3 years ago by prestontimmons

  • Triage Stage changed from Accepted to Ready for checkin

The new patch looks good. I verified it addresses the bug and passes tests. It applies cleanly against r16094.

comment:11 Changed 3 years ago by lukeplant

  • Patch needs improvement set
  • Triage Stage changed from Ready for checkin to Accepted

The issue regarding '\xe2\x80\xa2' does need to be addressed. This is a UTF-8 bytestring sequence for Unicode character 2022. The correct way to convert it to a unicode object is:

    '\xe2\x80\xa2'.decode('utf-8')

which results in u'\u2022', not u'\xe2\x80\xa2'

Version 0, edited 3 years ago by lukeplant (next)

Changed 3 years ago by aaugustin

comment:12 Changed 3 years ago by aaugustin

  • Patch needs improvement unset

This new patch addresses lukeplant's comment.

comment:13 Changed 3 years ago by lukeplant

  • Triage Stage changed from Accepted to Ready for checkin

comment:14 Changed 3 years ago by lukeplant

  • Resolution set to fixed
  • Status changed from new to closed

In [16118]:

Fixed #7267 - UnicodeDecodeError in clean_html

Thanks to Nikolay for the report, and gav and aaugustin for the patch.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.