Opened 9 years ago

Closed 2 years ago

#5418 closed New feature (wontfix)

Add assertNoBrokenLinks() to test system

Reported by: Adrian Holovaty Owned by: Unai Zalakain
Component: Testing framework Version: master
Severity: Normal Keywords: feature
Cc: absoludity@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

It would be convenient and useful if the test system could automatically check every <a href> in a rendered page to make sure the links did not cause 404s.

Obviously, for pages with dozens of links, this could take a long time to run. So I suggest an only_internal keyword argument, which would be True by default and would cause only the *internal* links (i.e., those without an "https?:") to be checked, thereby using Django's internal URL resolving to check the links instead of going over HTTP.

In the documentation for this feature, we should note that people should ensure none of their pages have side effects, because this test will essentially cause all links to be "clicked on."

Attachments (3)

assert_no_broken_links.diff (4.9 KB) - added by Michael Nelson 9 years ago.
Initial thoughts for feedback.
assert_no_broken_links_with_tests_and_doc.diff (12.7 KB) - added by Michael Nelson 9 years ago.
New assertNoBrokenLinks (using HTMLParser) with regression tests and docs.
assert_no_broken_links_with_tests_and_doc.2.diff (19.1 KB) - added by Michael Nelson 9 years ago.
Updated patch that also checks for blank links and internal page links (ie. href="#content")

Download all attachments as: .zip

Change History (35)

comment:1 Changed 9 years ago by Adrian Holovaty

Needs documentation: unset
Needs tests: unset
Patch needs improvement: unset
Triage Stage: UnreviewedAccepted

comment:2 Changed 9 years ago by anonymous

Owner: changed from nobody to anonymous
Status: newassigned

comment:3 Changed 9 years ago by Michael Nelson

Owner: changed from anonymous to Michael Nelson
Status: assignednew

comment:4 Changed 9 years ago by Michael Nelson

Status: newassigned

Changed 9 years ago by Michael Nelson

Attachment: assert_no_broken_links.diff added

Initial thoughts for feedback.

comment:5 Changed 9 years ago by Michael Nelson

I've created a patch for an initial solution, but there's a few questions (see doc string for assertNoBrokenLinks()).

The regexes are pretty ugly too. I've tried to use a little indentation to make it clearer, but guessing this goes against PEP08?

Any feedback appreciated.

comment:6 Changed 9 years ago by Michael Radziej <mir@…>

I haven't reviewed this thoroughly, but line 287

self.fail(u'The URL %s appears to be invalid.')

misses the final % link

comment:7 Changed 9 years ago by Michael Nelson

Thanks Michael - I'll finish it today (and write the docs/tests) - was just keen to know whether the way I was going about it was ok.

comment:8 Changed 9 years ago by Michael Nelson

Cc: absoludity@… added
Has patch: set
Resolution: fixed
Status: assignedclosed

I've added a patch with the new assertion and it's own unit tests and docs. For some reason when you click on the attachment, it seems empty, but it's there when you click on the 'raw format' option.

Let me know if there's any improvements to be made!
-Michael

comment:9 Changed 9 years ago by Simon G. <dev@…>

Resolution: fixed
Status: closedreopened

comment:10 Changed 9 years ago by Fredrik Lundh <fredrik@…>

Any reason you cannot just use the standard HTMLParser module? It would eliminate all those RE:s, and also get rid of the <a id="">href issue you mentioned. The sample on this page should be helpful, I think:

http://effbot.org/librarybook/htmlparser.htm

comment:11 Changed 9 years ago by Adrian Holovaty

Patch needs improvement: set

Yes, this patch should be reworked to use HTMLParser instead of regexes. Also, please remove the commented-out print statements in the patch.

comment:12 Changed 9 years ago by Philippe Raoult

Keywords: feature added

comment:13 Changed 9 years ago by Michael Nelson

Thanks for the hints and the link. Will try to get the changes in today (au-time).

Changed 9 years ago by Michael Nelson

New assertNoBrokenLinks (using HTMLParser) with regression tests and docs.

comment:14 Changed 9 years ago by Michael Nelson

Patch needs improvement: unset

Added new version of assertNoBrokenLinks patch using the HTMLParser, includes regression tests and docs.

I'm not sure what the process is - unchecking patch needs improvement?

comment:15 Changed 9 years ago by Michael Nelson

Patch needs improvement: set

Actually, hold off on reviewing... just thought of another test case (handling blank href's <a href=""> as the url template tag fails silently and just leaves a blank href.

Not sure if this should actually be an extra arg: ignore_blank_hrefs=False.

Changed 9 years ago by Michael Nelson

Updated patch that also checks for blank links and internal page links (ie. href="#content")

comment:16 Changed 9 years ago by Michael Nelson

Patch needs improvement: unset

OK, I've uploaded a further patch that handles blank hrefs as well as verifying that internal page links (href="#content") are not broken. All with their own unit tests.

I didn't overwrite the previous patch in case you don't want the extra functionality.

comment:17 Changed 7 years ago by Kevin Kubasik

Owner: changed from Michael Nelson to Kevin Kubasik
Status: reopenednew

comment:18 Changed 6 years ago by Eric Holscher

milestone: 1.3
Triage Stage: AcceptedDesign decision needed

It should be noted that I have implemented this as a third party app. It's useful in Continuous integration and as a quick sanity check of your app.

http://github.com/ericholscher/django-crawler

I'm not sure that this should be in Django itself. It seems like it works fine as an external application, and will advance and be more useful outside of core. Setting this to 1.3 so we can make a decision about it.

comment:19 Changed 6 years ago by Russell Keith-Magee

Triage Stage: Design decision neededAccepted

To my mind, crawling is a slightly different activity to what this test assertion is proposing.

Crawling is a process of discovering bad links; it requires that there is a tree of links from one page to another.

Link validation as a test case allows you to set up specific test conditions, and check a specific URL as an acceptance test.

Consider the case of an account activation link in an email -- there won't be a link to that URL (so a crawler shouldn't find it), but it should still be validated for invalid links. The URL for the validation link is something that should be different every time it is generated, so writing a test script to check it would be, at the very least, cumbersome.

I think there is a place for both a crawler and a link verifying test assertion.

comment:20 Changed 6 years ago by James Bennett

milestone: 1.3

1.3 is feature-frozen now.

comment:21 Changed 5 years ago by Gabriel Hurley

Severity: Normal
Type: New feature

comment:22 Changed 5 years ago by patchhammer

Easy pickings: unset
Patch needs improvement: set

assert_no_broken_links_with_tests_and_doc.2.diff fails to apply cleanly on to trunk

comment:23 Changed 5 years ago by Aymeric Augustin

UI/UX: unset

Change UI/UX from NULL to False.

comment:24 Changed 3 years ago by Unai Zalakain

Owner: changed from Kevin Kubasik to Unai Zalakain
Status: newassigned

comment:25 Changed 3 years ago by Unai Zalakain

Pull request issued: https://github.com/django/django/pull/1740

Check for broken links in a rendered page. By default only internal links are
checked, if only_internal is False, external links are checked too.

This commit adds the patch, the tests and the docs where the possible side
effects of this test are noted.

comment:26 Changed 3 years ago by Unai Zalakain

Patch needs improvement: unset
Triage Stage: AcceptedReady for checkin

comment:27 Changed 3 years ago by Tim Graham

Triage Stage: Ready for checkinAccepted

Someone other than the patch author needs to review and mark as RFC.

comment:28 Changed 3 years ago by Unai Zalakain

Updated PR with proposed changes.

comment:29 Changed 2 years ago by Aymeric Augustin

Four years later, I beg to differ with Russell. I don't believe this is sufficiently useful to be added to Django.

Django recommands to generate all URLs in templates with the {% url %}. That guarantees that URLs are valid. (Yes, I can imagine edge cases breaking this guarantee, but it holds for non-pathological URL patterns.) Regarding the "email" example, based on my experience, 95% of problems are going to be in the domain name, not in the path, so this system wouldn't be able to catch them.

Considering the ongoing efforts on Unai's pull request, I'm not going to close the ticket wontfix, but that would have been my choice otherwise.

comment:30 Changed 2 years ago by Unai Zalakain

I must admit I didn't pick up this issue because I considered it particularly useful but rather because I thought it would be a funny one to work on so I beg you you don't decide whether to accept it or not on the basis of my efforts ;-)

comment:31 in reply to:  29 Changed 2 years ago by Russell Keith-Magee

Replying to aaugustin:

Four years later, I beg to differ with Russell. I don't believe this is sufficiently useful to be added to Django.

Django recommands to generate all URLs in templates with the {% url %}. That guarantees that URLs are valid. (Yes, I can imagine edge cases breaking this guarantee, but it holds for non-pathological URL patterns.) Regarding the "email" example, based on my experience, 95% of problems are going to be in the domain name, not in the path, so this system wouldn't be able to catch them.

That's a good point - not sure why that didn't occur to me at the time. I'm happy to step back from my "accepted" position on this ticket.

comment:32 Changed 2 years ago by Tim Graham

Resolution: wontfix
Status: assignedclosed

I'd prefer not to add this as well.

Note: See TracTickets for help on using tickets.
Back to Top