Opened 17 years ago
Closed 15 years ago
#7267 closed Bug (fixed)
clean_html() bug
| Reported by: | Owned by: | nobody | |
|---|---|---|---|
| Component: | Core (Other) | Version: | dev |
| Severity: | Normal | Keywords: | html |
| Cc: | aymeric.augustin@…, Preston Timmons | Triage Stage: | Ready for checkin |
| Has patch: | yes | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description
The following is reproducible scenario for the issue:
[0:28 /home/nik]$ python
Python 2.5.2 (r252:60911, Apr 17 2008, 13:15:05)
[GCC 4.2.3 (Debian 4.2.3-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import django
>>> django.utils.html.clean_html("<p>* * *</p><p>.</p>")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/niksite/lib/site-python/django/utils/functional.py", line 239, in wrapper
return func(*args, **kwargs)
File "/home/niksite/lib/site-python/django/utils/html.py", line 158, in clean_html
text = hard_coded_bullets_re.sub(replace_p_tags, text)
File "/home/niksite/lib/site-python/django/utils/html.py", line 156, in replace_p_tags
s = s.replace('<p>%s' % d, '<li>')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)
>>>
Attachments (4)
Change History (18)
comment:1 by , 17 years ago
| Has patch: | set |
|---|
by , 17 years ago
| Attachment: | clean_html_unicode_r7601.patch added |
|---|
Patch against r7601 to fix crash error with clean_html.
comment:2 by , 17 years ago
| Triage Stage: | Unreviewed → Ready for checkin |
|---|
comment:3 by , 17 years ago
| Needs tests: | set |
|---|---|
| Triage Stage: | Ready for checkin → Accepted |
Needs tests!
comment:4 by , 17 years ago
| Component: | Tools → Core framework |
|---|
comment:5 by , 15 years ago
| Severity: | → Normal |
|---|---|
| Type: | → Bug |
by , 15 years ago
| Attachment: | 7267.patch added |
|---|
comment:6 by , 15 years ago
| Cc: | added |
|---|---|
| Needs tests: | unset |
comment:7 by , 15 years ago
| Cc: | added |
|---|---|
| Easy pickings: | unset |
| Patch needs improvement: | set |
Thanks for updating the patch. The approach looks good but when I apply it I get the following test failure:
........................................................................F............................. ====================================================================== FAIL: test_clean_html (regressiontests.utils.html.TestUtilsHtml) ---------------------------------------------------------------------- Traceback (most recent call last): File "/root/pycon2011/djangoenv/src/django/tests/regressiontests/utils/html.py", line 135, in test_clean_html self.check_output(f, value, output) File "/root/pycon2011/djangoenv/src/django/tests/regressiontests/utils/html.py", line 14, in check_output self.assertEqual(function(value), output) AssertionError: u'<p>I kill<br />whitespace</p>' != u'<p>I kill whitespace</p>'
comment:9 by , 15 years ago
| Patch needs improvement: | unset |
|---|
by , 15 years ago
| Attachment: | 7267.2.patch added |
|---|
comment:10 by , 15 years ago
| Triage Stage: | Accepted → Ready for checkin |
|---|
The new patch looks good. I verified it addresses the bug and passes tests. It applies cleanly against r16094.
comment:11 by , 15 years ago
| Patch needs improvement: | set |
|---|---|
| Triage Stage: | Ready for checkin → Accepted |
The issue regarding '\xe2\x80\xa2' does need to be addressed. This is a UTF-8 bytestring sequence for Unicode character 2022. The correct way to convert it to a unicode object is:
'\xe2\x80\xa2'.decode('utf-8')
which results in u'\u2022', not u'\xe2\x80\xa2'
by , 15 years ago
| Attachment: | 7267.3.patch added |
|---|
comment:12 by , 15 years ago
| Patch needs improvement: | unset |
|---|
This new patch addresses lukeplant's comment.
comment:13 by , 15 years ago
| Triage Stage: | Accepted → Ready for checkin |
|---|
I've generated a patch for this ticket which prevents the crash error, although the output of your input looks strange to me.
However, I'm comfortable with saying that if it is invalid, that should be treated as a separate bug.
Also, the only thing that might be questionable with this patch is the change to utils/html.py, line 16:
The 3rd entry might not be a safe Unicode conversion. However, all the unit tests pass. Can someone comment?