Opened 16 years ago
Closed 13 years ago
#7267 closed Bug (fixed)
clean_html() bug
Reported by: | Owned by: | nobody | |
---|---|---|---|
Component: | Core (Other) | Version: | dev |
Severity: | Normal | Keywords: | html |
Cc: | aymeric.augustin@…, Preston Timmons | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
The following is reproducible scenario for the issue:
[0:28 /home/nik]$ python Python 2.5.2 (r252:60911, Apr 17 2008, 13:15:05) [GCC 4.2.3 (Debian 4.2.3-3)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import django >>> django.utils.html.clean_html("<p>* * *</p><p>.</p>") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/niksite/lib/site-python/django/utils/functional.py", line 239, in wrapper return func(*args, **kwargs) File "/home/niksite/lib/site-python/django/utils/html.py", line 158, in clean_html text = hard_coded_bullets_re.sub(replace_p_tags, text) File "/home/niksite/lib/site-python/django/utils/html.py", line 156, in replace_p_tags s = s.replace('<p>%s' % d, '<li>') UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128) >>>
Attachments (4)
Change History (18)
comment:1 by , 16 years ago
Has patch: | set |
---|
by , 16 years ago
Attachment: | clean_html_unicode_r7601.patch added |
---|
Patch against r7601 to fix crash error with clean_html.
comment:2 by , 16 years ago
Triage Stage: | Unreviewed → Ready for checkin |
---|
comment:3 by , 16 years ago
Needs tests: | set |
---|---|
Triage Stage: | Ready for checkin → Accepted |
Needs tests!
comment:4 by , 16 years ago
Component: | Tools → Core framework |
---|
comment:5 by , 13 years ago
Severity: | → Normal |
---|---|
Type: | → Bug |
by , 13 years ago
Attachment: | 7267.patch added |
---|
comment:6 by , 13 years ago
Cc: | added |
---|---|
Needs tests: | unset |
comment:7 by , 13 years ago
Cc: | added |
---|---|
Easy pickings: | unset |
Patch needs improvement: | set |
Thanks for updating the patch. The approach looks good but when I apply it I get the following test failure:
........................................................................F............................. ====================================================================== FAIL: test_clean_html (regressiontests.utils.html.TestUtilsHtml) ---------------------------------------------------------------------- Traceback (most recent call last): File "/root/pycon2011/djangoenv/src/django/tests/regressiontests/utils/html.py", line 135, in test_clean_html self.check_output(f, value, output) File "/root/pycon2011/djangoenv/src/django/tests/regressiontests/utils/html.py", line 14, in check_output self.assertEqual(function(value), output) AssertionError: u'<p>I kill<br />whitespace</p>' != u'<p>I kill whitespace</p>'
comment:9 by , 13 years ago
Patch needs improvement: | unset |
---|
by , 13 years ago
Attachment: | 7267.2.patch added |
---|
comment:10 by , 13 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
The new patch looks good. I verified it addresses the bug and passes tests. It applies cleanly against r16094.
comment:11 by , 13 years ago
Patch needs improvement: | set |
---|---|
Triage Stage: | Ready for checkin → Accepted |
The issue regarding '\xe2\x80\xa2'
does need to be addressed. This is a UTF-8 bytestring sequence for Unicode character 2022. The correct way to convert it to a unicode object is:
'\xe2\x80\xa2'.decode('utf-8')
which results in u'\u2022'
, not u'\xe2\x80\xa2'
by , 13 years ago
Attachment: | 7267.3.patch added |
---|
comment:12 by , 13 years ago
Patch needs improvement: | unset |
---|
This new patch addresses lukeplant's comment.
comment:13 by , 13 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
I've generated a patch for this ticket which prevents the crash error, although the output of your input looks strange to me.
However, I'm comfortable with saying that if it is invalid, that should be treated as a separate bug.
Also, the only thing that might be questionable with this patch is the change to utils/html.py, line 16:
The 3rd entry might not be a safe Unicode conversion. However, all the unit tests pass. Can someone comment?