Opened 8 years ago

Closed 15 months ago

#5418 closed New feature (wontfix)

Add assertNoBrokenLinks() to test system

Reported by: adrian Owned by: unaizalakain
Component: Testing framework Version: master
Severity: Normal Keywords: feature
Cc: absoludity@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

It would be convenient and useful if the test system could automatically check every <a href> in a rendered page to make sure the links did not cause 404s.

Obviously, for pages with dozens of links, this could take a long time to run. So I suggest an only_internal keyword argument, which would be True by default and would cause only the *internal* links (i.e., those without an "https?:") to be checked, thereby using Django's internal URL resolving to check the links instead of going over HTTP.

In the documentation for this feature, we should note that people should ensure none of their pages have side effects, because this test will essentially cause all links to be "clicked on."

Attachments (3)

assert_no_broken_links.diff (4.9 KB) - added by absoludity 8 years ago.
Initial thoughts for feedback.
assert_no_broken_links_with_tests_and_doc.diff (12.7 KB) - added by absoludity 8 years ago.
New assertNoBrokenLinks (using HTMLParser) with regression tests and docs.
assert_no_broken_links_with_tests_and_doc.2.diff (19.1 KB) - added by absoludity 8 years ago.
Updated patch that also checks for blank links and internal page links (ie. href="#content")

Download all attachments as: .zip

Change History (35)

comment:1 Changed 8 years ago by adrian

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Accepted

comment:2 Changed 8 years ago by anonymous

  • Owner changed from nobody to anonymous
  • Status changed from new to assigned

comment:3 Changed 8 years ago by absoludity

  • Owner changed from anonymous to absoludity
  • Status changed from assigned to new

comment:4 Changed 8 years ago by absoludity

  • Status changed from new to assigned

Changed 8 years ago by absoludity

Initial thoughts for feedback.

comment:5 Changed 8 years ago by absoludity

I've created a patch for an initial solution, but there's a few questions (see doc string for assertNoBrokenLinks()).

The regexes are pretty ugly too. I've tried to use a little indentation to make it clearer, but guessing this goes against PEP08?

Any feedback appreciated.

comment:6 Changed 8 years ago by Michael Radziej <mir@…>

I haven't reviewed this thoroughly, but line 287

self.fail(u'The URL %s appears to be invalid.')

misses the final % link

comment:7 Changed 8 years ago by absoludity

Thanks Michael - I'll finish it today (and write the docs/tests) - was just keen to know whether the way I was going about it was ok.

comment:8 Changed 8 years ago by absoludity

  • Cc absoludity@… added
  • Has patch set
  • Resolution set to fixed
  • Status changed from assigned to closed

I've added a patch with the new assertion and it's own unit tests and docs. For some reason when you click on the attachment, it seems empty, but it's there when you click on the 'raw format' option.

Let me know if there's any improvements to be made!
-Michael

comment:9 Changed 8 years ago by Simon G. <dev@…>

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:10 Changed 8 years ago by Fredrik Lundh <fredrik@…>

Any reason you cannot just use the standard HTMLParser module? It would eliminate all those RE:s, and also get rid of the <a id="">href issue you mentioned. The sample on this page should be helpful, I think:

http://effbot.org/librarybook/htmlparser.htm

comment:11 Changed 8 years ago by adrian

  • Patch needs improvement set

Yes, this patch should be reworked to use HTMLParser instead of regexes. Also, please remove the commented-out print statements in the patch.

comment:12 Changed 8 years ago by PhiR

  • Keywords feature added

comment:13 Changed 8 years ago by absoludity

Thanks for the hints and the link. Will try to get the changes in today (au-time).

Changed 8 years ago by absoludity

New assertNoBrokenLinks (using HTMLParser) with regression tests and docs.

comment:14 Changed 8 years ago by absoludity

  • Patch needs improvement unset

Added new version of assertNoBrokenLinks patch using the HTMLParser, includes regression tests and docs.

I'm not sure what the process is - unchecking patch needs improvement?

comment:15 Changed 8 years ago by absoludity

  • Patch needs improvement set

Actually, hold off on reviewing... just thought of another test case (handling blank href's <a href=""> as the url template tag fails silently and just leaves a blank href.

Not sure if this should actually be an extra arg: ignore_blank_hrefs=False.

Changed 8 years ago by absoludity

Updated patch that also checks for blank links and internal page links (ie. href="#content")

comment:16 Changed 8 years ago by absoludity

  • Patch needs improvement unset

OK, I've uploaded a further patch that handles blank hrefs as well as verifying that internal page links (href="#content") are not broken. All with their own unit tests.

I didn't overwrite the previous patch in case you don't want the extra functionality.

comment:17 Changed 6 years ago by kkubasik

  • Owner changed from absoludity to kkubasik
  • Status changed from reopened to new

comment:18 Changed 5 years ago by ericholscher

  • milestone set to 1.3
  • Triage Stage changed from Accepted to Design decision needed

It should be noted that I have implemented this as a third party app. It's useful in Continuous integration and as a quick sanity check of your app.

http://github.com/ericholscher/django-crawler

I'm not sure that this should be in Django itself. It seems like it works fine as an external application, and will advance and be more useful outside of core. Setting this to 1.3 so we can make a decision about it.

comment:19 Changed 5 years ago by russellm

  • Triage Stage changed from Design decision needed to Accepted

To my mind, crawling is a slightly different activity to what this test assertion is proposing.

Crawling is a process of discovering bad links; it requires that there is a tree of links from one page to another.

Link validation as a test case allows you to set up specific test conditions, and check a specific URL as an acceptance test.

Consider the case of an account activation link in an email -- there won't be a link to that URL (so a crawler shouldn't find it), but it should still be validated for invalid links. The URL for the validation link is something that should be different every time it is generated, so writing a test script to check it would be, at the very least, cumbersome.

I think there is a place for both a crawler and a link verifying test assertion.

comment:20 Changed 5 years ago by ubernostrum

  • milestone 1.3 deleted

1.3 is feature-frozen now.

comment:21 Changed 4 years ago by gabrielhurley

  • Severity set to Normal
  • Type set to New feature

comment:22 Changed 4 years ago by patchhammer

  • Easy pickings unset
  • Patch needs improvement set

assert_no_broken_links_with_tests_and_doc.2.diff fails to apply cleanly on to trunk

comment:23 Changed 4 years ago by aaugustin

  • UI/UX unset

Change UI/UX from NULL to False.

comment:24 Changed 23 months ago by unaizalakain

  • Owner changed from kkubasik to unaizalakain
  • Status changed from new to assigned

comment:25 Changed 23 months ago by unaizalakain

Pull request issued: https://github.com/django/django/pull/1740

Check for broken links in a rendered page. By default only internal links are
checked, if only_internal is False, external links are checked too.

This commit adds the patch, the tests and the docs where the possible side
effects of this test are noted.

comment:26 Changed 23 months ago by unaizalakain

  • Patch needs improvement unset
  • Triage Stage changed from Accepted to Ready for checkin

comment:27 Changed 23 months ago by timo

  • Triage Stage changed from Ready for checkin to Accepted

Someone other than the patch author needs to review and mark as RFC.

comment:28 Changed 23 months ago by unaizalakain

Updated PR with proposed changes.

comment:29 follow-up: Changed 15 months ago by aaugustin

Four years later, I beg to differ with Russell. I don't believe this is sufficiently useful to be added to Django.

Django recommands to generate all URLs in templates with the {% url %}. That guarantees that URLs are valid. (Yes, I can imagine edge cases breaking this guarantee, but it holds for non-pathological URL patterns.) Regarding the "email" example, based on my experience, 95% of problems are going to be in the domain name, not in the path, so this system wouldn't be able to catch them.

Considering the ongoing efforts on Unai's pull request, I'm not going to close the ticket wontfix, but that would have been my choice otherwise.

comment:30 Changed 15 months ago by unaizalakain

I must admit I didn't pick up this issue because I considered it particularly useful but rather because I thought it would be a funny one to work on so I beg you you don't decide whether to accept it or not on the basis of my efforts ;-)

comment:31 in reply to: ↑ 29 Changed 15 months ago by russellm

Replying to aaugustin:

Four years later, I beg to differ with Russell. I don't believe this is sufficiently useful to be added to Django.

Django recommands to generate all URLs in templates with the {% url %}. That guarantees that URLs are valid. (Yes, I can imagine edge cases breaking this guarantee, but it holds for non-pathological URL patterns.) Regarding the "email" example, based on my experience, 95% of problems are going to be in the domain name, not in the path, so this system wouldn't be able to catch them.

That's a good point - not sure why that didn't occur to me at the time. I'm happy to step back from my "accepted" position on this ticket.

comment:32 Changed 15 months ago by timo

  • Resolution set to wontfix
  • Status changed from assigned to closed

I'd prefer not to add this as well.

Note: See TracTickets for help on using tickets.
Back to Top