Opened 12 years ago

Closed 12 years ago

Last modified 8 years ago

#18239 closed Bug (fixed)

Only use custom subclass of HTMLParser for Python versions with buggy stdlib HTMLParser

Reported by: Carl Meyer Owned by: nobody
Component: Core (Other) Version: 1.3
Severity: Release blocker Keywords:
Cc: Raphaël Hertzog Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Django currently has its own subclass of HTMLParser (in django.utils.html_parser.HTMLParser). It exists in order to patch a bug in the standard library's HTMLParser in Python 2.5 and older versions of 2.6 and 2.7. The bug has been fixed in Python 2.6.8, 2.7.3, and will be fixed in the upcoming 3.3 as well. There are also other fixes in 3.3's HTMLParser which conflict with the patched version in Django, since it relies on numerous undocumented internals.

For better forward-compatibility, we should only use our patched subclass for versions of Python known to contain the bug, and otherwise simply use the standard library's HTMLParser directly.

When we make this change, we can also roll back r17456, as that was simply papering over a breakage due to the modified HTMLParser in 2.6.8 and 2.7.3 - that will no longer be a problem if we don't try to use our subclass with those (and newer) Pythons.

Attachments (1)

01_use_stdlib_htmlparser_when_possible.diff (8.9 KB ) - added by Raphaël Hertzog 12 years ago.
Patch for Django 1.4.1

Download all attachments as: .zip

Change History (10)

comment:1 by Carl Meyer, 12 years ago

(Thanks to Vinay Sajip for discovering and raising this issue.)

comment:2 by Raphaël Hertzog, 12 years ago

Cc: Raphaël Hertzog added

For me the test suite of Django 1.4.1 fails with many invalid HTML parse errors when I run it in Debian Sid with python 2.7.3. Is this bug the same issue?

Example of error:

======================================================================
ERROR: test_count (regressiontests.test_utils.tests.HTMLEqualTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/«PKGBUILDDIR»/tests/regressiontests/test_utils/tests.py", line 396, in test_count
    dom2 = parse_html('<p class="bar">foo</p>')
  File "/«PKGBUILDDIR»/django/test/html.py", line 213, in parse_html
    parser.feed(html)
  File "/usr/lib/python2.7/HTMLParser.py", line 114, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 160, in goahead
    k = self.parse_endtag(i)
  File "/«PKGBUILDDIR»/django/utils/html_parser.py", line 96, in parse_endtag
    self.handle_endtag(tag.lower())
  File "/«PKGBUILDDIR»/django/test/html.py", line 191, in handle_endtag
    tag, self.format_position()))
  File "/«PKGBUILDDIR»/django/test/html.py", line 153, in error
    raise HTMLParseError(msg, self.getpos())
HTMLParseError: Unexpected end tag `p` (Line 1, Column 18), at line 1, column 19

by Raphaël Hertzog, 12 years ago

Patch for Django 1.4.1

comment:3 by Raphaël Hertzog, 12 years ago

Has patch: set
Severity: NormalRelease blocker

Here's a patch that seems to solve the issue for me by doing what the bug description suggest, i.e. use Django's own HTMLParser only with python versions that have the problem. It should be straightforward to adapt it for the development version.

I took the liberty to increase the severity as Django is effectively broken for me on Debian Sid right now.

comment:4 by Claude Paroz, 12 years ago

Python 3.2.3 has the fix also.

comment:5 by Raphaël Hertzog, 12 years ago

I would appreciate some ack/review of a core developer before I upload this patch to debian... but it would be even better if I could just cherry pick the definitive fix from the trunk.

comment:6 by Claude Paroz <claude@…>, 12 years ago

Resolution: fixed
Status: newclosed

In [5c79dd586534bc88ce7dc81c2d781c772d28b121]:

Fixed #18239 -- Subclassed HTMLParser only for selected Python versions

Only Python versions affected by http://bugs.python.org/issue670664
should patch HTMLParser.
Thanks Raphaël Hertzog for the initial patch (for 1.4).

comment:7 by Claude Paroz <claude@…>, 12 years ago

In [57d9ccc4aaef0420f6ba60a26e6af4e83b803ae9]:

[1.4.x] Fixed #18239 -- Subclassed HTMLParser only for selected Python versions

Only Python versions affected by http://bugs.python.org/issue670664
should patch HTMLParser.

comment:8 by Claude Paroz, 12 years ago

Applied to all Python 2.6 in [fcec904e4f3582a45d4d8e309e71e9f0c4d79a0c]

comment:9 by Tim Graham <timograham@…>, 8 years ago

In 2c125bde:

Refs #18239 -- Removed an obsolete workaround for bugs in HTMLParser.

Note: See TracTickets for help on using tickets.
Back to Top