Opened 17 years ago

Closed 11 years ago

#5418 closed New feature (wontfix)

Add assertNoBrokenLinks() to test system

Reported by: Adrian Holovaty Owned by: Unai Zalakain
Component: Testing framework Version: dev
Severity: Normal Keywords: feature
Cc: absoludity@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

It would be convenient and useful if the test system could automatically check every <a href> in a rendered page to make sure the links did not cause 404s.

Obviously, for pages with dozens of links, this could take a long time to run. So I suggest an only_internal keyword argument, which would be True by default and would cause only the *internal* links (i.e., those without an "https?:") to be checked, thereby using Django's internal URL resolving to check the links instead of going over HTTP.

In the documentation for this feature, we should note that people should ensure none of their pages have side effects, because this test will essentially cause all links to be "clicked on."

Attachments (3)

assert_no_broken_links.diff (4.9 KB ) - added by Michael Nelson 17 years ago.
Initial thoughts for feedback.
assert_no_broken_links_with_tests_and_doc.diff (12.7 KB ) - added by Michael Nelson 17 years ago.
New assertNoBrokenLinks (using HTMLParser) with regression tests and docs.
assert_no_broken_links_with_tests_and_doc.2.diff (19.1 KB ) - added by Michael Nelson 17 years ago.
Updated patch that also checks for blank links and internal page links (ie. href="#content")

Download all attachments as: .zip

Change History (35)

comment:1 by Adrian Holovaty, 17 years ago

Triage Stage: UnreviewedAccepted

comment:2 by anonymous, 17 years ago

Owner: changed from nobody to anonymous
Status: newassigned

comment:3 by Michael Nelson, 17 years ago

Owner: changed from anonymous to Michael Nelson
Status: assignednew

comment:4 by Michael Nelson, 17 years ago

Status: newassigned

by Michael Nelson, 17 years ago

Attachment: assert_no_broken_links.diff added

Initial thoughts for feedback.

comment:5 by Michael Nelson, 17 years ago

I've created a patch for an initial solution, but there's a few questions (see doc string for assertNoBrokenLinks()).

The regexes are pretty ugly too. I've tried to use a little indentation to make it clearer, but guessing this goes against PEP08?

Any feedback appreciated.

comment:6 by Michael Radziej <mir@…>, 17 years ago

I haven't reviewed this thoroughly, but line 287

self.fail(u'The URL %s appears to be invalid.')

misses the final % link

comment:7 by Michael Nelson, 17 years ago

Thanks Michael - I'll finish it today (and write the docs/tests) - was just keen to know whether the way I was going about it was ok.

comment:8 by Michael Nelson, 17 years ago

Cc: absoludity@… added
Has patch: set
Resolution: fixed
Status: assignedclosed

I've added a patch with the new assertion and it's own unit tests and docs. For some reason when you click on the attachment, it seems empty, but it's there when you click on the 'raw format' option.

Let me know if there's any improvements to be made!
-Michael

comment:9 by Simon G. <dev@…>, 17 years ago

Resolution: fixed
Status: closedreopened

comment:10 by Fredrik Lundh <fredrik@…>, 17 years ago

Any reason you cannot just use the standard HTMLParser module? It would eliminate all those RE:s, and also get rid of the <a id="">href issue you mentioned. The sample on this page should be helpful, I think:

http://effbot.org/librarybook/htmlparser.htm

comment:11 by Adrian Holovaty, 17 years ago

Patch needs improvement: set

Yes, this patch should be reworked to use HTMLParser instead of regexes. Also, please remove the commented-out print statements in the patch.

comment:12 by Philippe Raoult, 17 years ago

Keywords: feature added

comment:13 by Michael Nelson, 17 years ago

Thanks for the hints and the link. Will try to get the changes in today (au-time).

by Michael Nelson, 17 years ago

New assertNoBrokenLinks (using HTMLParser) with regression tests and docs.

comment:14 by Michael Nelson, 17 years ago

Patch needs improvement: unset

Added new version of assertNoBrokenLinks patch using the HTMLParser, includes regression tests and docs.

I'm not sure what the process is - unchecking patch needs improvement?

comment:15 by Michael Nelson, 17 years ago

Patch needs improvement: set

Actually, hold off on reviewing... just thought of another test case (handling blank href's <a href=""> as the url template tag fails silently and just leaves a blank href.

Not sure if this should actually be an extra arg: ignore_blank_hrefs=False.

by Michael Nelson, 17 years ago

Updated patch that also checks for blank links and internal page links (ie. href="#content")

comment:16 by Michael Nelson, 17 years ago

Patch needs improvement: unset

OK, I've uploaded a further patch that handles blank hrefs as well as verifying that internal page links (href="#content") are not broken. All with their own unit tests.

I didn't overwrite the previous patch in case you don't want the extra functionality.

comment:17 by Kevin Kubasik, 16 years ago

Owner: changed from Michael Nelson to Kevin Kubasik
Status: reopenednew

comment:18 by Eric Holscher, 14 years ago

milestone: 1.3
Triage Stage: AcceptedDesign decision needed

It should be noted that I have implemented this as a third party app. It's useful in Continuous integration and as a quick sanity check of your app.

http://github.com/ericholscher/django-crawler

I'm not sure that this should be in Django itself. It seems like it works fine as an external application, and will advance and be more useful outside of core. Setting this to 1.3 so we can make a decision about it.

comment:19 by Russell Keith-Magee, 14 years ago

Triage Stage: Design decision neededAccepted

To my mind, crawling is a slightly different activity to what this test assertion is proposing.

Crawling is a process of discovering bad links; it requires that there is a tree of links from one page to another.

Link validation as a test case allows you to set up specific test conditions, and check a specific URL as an acceptance test.

Consider the case of an account activation link in an email -- there won't be a link to that URL (so a crawler shouldn't find it), but it should still be validated for invalid links. The URL for the validation link is something that should be different every time it is generated, so writing a test script to check it would be, at the very least, cumbersome.

I think there is a place for both a crawler and a link verifying test assertion.

comment:20 by James Bennett, 14 years ago

milestone: 1.3

1.3 is feature-frozen now.

comment:21 by Gabriel Hurley, 14 years ago

Severity: Normal
Type: New feature

comment:22 by patchhammer, 14 years ago

Easy pickings: unset
Patch needs improvement: set

assert_no_broken_links_with_tests_and_doc.2.diff fails to apply cleanly on to trunk

comment:23 by Aymeric Augustin, 13 years ago

UI/UX: unset

Change UI/UX from NULL to False.

comment:24 by Unai Zalakain, 11 years ago

Owner: changed from Kevin Kubasik to Unai Zalakain
Status: newassigned

comment:25 by Unai Zalakain, 11 years ago

Pull request issued: https://github.com/django/django/pull/1740

Check for broken links in a rendered page. By default only internal links are
checked, if only_internal is False, external links are checked too.

This commit adds the patch, the tests and the docs where the possible side
effects of this test are noted.

comment:26 by Unai Zalakain, 11 years ago

Patch needs improvement: unset
Triage Stage: AcceptedReady for checkin

comment:27 by Tim Graham, 11 years ago

Triage Stage: Ready for checkinAccepted

Someone other than the patch author needs to review and mark as RFC.

comment:28 by Unai Zalakain, 11 years ago

Updated PR with proposed changes.

comment:29 by Aymeric Augustin, 11 years ago

Four years later, I beg to differ with Russell. I don't believe this is sufficiently useful to be added to Django.

Django recommands to generate all URLs in templates with the {% url %}. That guarantees that URLs are valid. (Yes, I can imagine edge cases breaking this guarantee, but it holds for non-pathological URL patterns.) Regarding the "email" example, based on my experience, 95% of problems are going to be in the domain name, not in the path, so this system wouldn't be able to catch them.

Considering the ongoing efforts on Unai's pull request, I'm not going to close the ticket wontfix, but that would have been my choice otherwise.

comment:30 by Unai Zalakain, 11 years ago

I must admit I didn't pick up this issue because I considered it particularly useful but rather because I thought it would be a funny one to work on so I beg you you don't decide whether to accept it or not on the basis of my efforts ;-)

in reply to:  29 comment:31 by Russell Keith-Magee, 11 years ago

Replying to aaugustin:

Four years later, I beg to differ with Russell. I don't believe this is sufficiently useful to be added to Django.

Django recommands to generate all URLs in templates with the {% url %}. That guarantees that URLs are valid. (Yes, I can imagine edge cases breaking this guarantee, but it holds for non-pathological URL patterns.) Regarding the "email" example, based on my experience, 95% of problems are going to be in the domain name, not in the path, so this system wouldn't be able to catch them.

That's a good point - not sure why that didn't occur to me at the time. I'm happy to step back from my "accepted" position on this ticket.

comment:32 by Tim Graham, 11 years ago

Resolution: wontfix
Status: assignedclosed

I'd prefer not to add this as well.

Note: See TracTickets for help on using tickets.
Back to Top