Opened 11 years ago

Closed 11 years ago

#20099 closed New feature (fixed)

Allow filtering 404 reports by user agent

Reported by: Ram Rachum Owned by: nobody
Component: HTTP handling Version: dev
Severity: Normal Keywords:
Cc: Simon Charette Triage Stage: Ready for checkin
Has patch: yes Needs documentation: yes
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I get a lot of 404 reports to my inbox that are caused by various spiders. I'd like to block these. The same way we have a mechanism for blocking 404 reports for certain URLs, we should be able to block 404 reports with user agents that are easily discernible as spiders.

Attachments (5)

patch.patch (1.8 KB ) - added by Ram Rachum 11 years ago.
Patch
0001-Tests-for-user-agent-input-to-is_ignorable_404.patch (1.3 KB ) - added by Ram Rachum 11 years ago.
Tests
20099-test2.diff (1.2 KB ) - added by Claude Paroz 11 years ago.
is_ignorable_404 test
0001-Implement-BrokenLinkEmailsMiddleware.is_request_we_s.patch (5.4 KB ) - added by Ram Rachum 11 years ago.
0001-Implement-BrokenLinkEmailsMiddleware.is_request_we_s.2.patch (6.1 KB ) - added by Ram Rachum 11 years ago.

Download all attachments as: .zip

Change History (28)

comment:1 by Claude Paroz, 11 years ago

Component: UncategorizedHTTP handling
Triage Stage: UnreviewedAccepted

Globally, I'm not thrilled by the proposal. Firstly, redirecting 404 to inboxes is not a best practice, as generally it is better to redirect to a service like Sentry. Secondly, this filtering is depending on projects, there might be people wanting to make stats about 404 generated by spiders. And thirdly, the spider's UA are changing and hard-coding a list in Django would be subject to frequent updates.

However, we should ease middleware customization so as filtering is easier. I would like to counsel people to subclass BrokenLinkEmailsMiddleware and write their own is_ignorable_404 method. Unfortunately, the user agent is not accessible for that method. So I'm accepting this ticket with two goals:

comment:2 by Ram Rachum, 11 years ago

I agree regarding email, it's better to redirect to services like Sentry.

Regarding the first item, do we need anything beyond just adding , user_agent=None to the definition of is_ignorable_404 and sending in the user agent in process_response?

in reply to:  2 comment:3 by Claude Paroz, 11 years ago

Replying to coolRR:

Regarding the first item, do we need anything beyond just adding , user_agent=None to the definition of is_ignorable_404 and sending in the user agent in process_response?

Yes, something like that.

by Ram Rachum, 11 years ago

Attachment: patch.patch added

Patch

comment:4 by Ram Rachum, 11 years ago

Has patch: set

Patch attached. Let me know whether it's good. (I don't work often with patches, this one was generated by Git, I don't know whether you can use it in hg.)

comment:5 by Claude Paroz, 11 years ago

Needs documentation: set
Needs tests: set

Django is in Git (Python is in hg), so yes, the format is right.
Now we still miss tests and docs.

comment:6 by Ram Rachum, 11 years ago

Here's a test. It's pretty lame; I would have preferred a test that creates a subclass but I'm very clueless about Django's test architecture, so I don't know how to do it in a way that won't bother other Django developers.

by Claude Paroz, 11 years ago

Attachment: 20099-test2.diff added

is_ignorable_404 test

comment:7 by Claude Paroz, 11 years ago

Needs tests: unset

Here's how I would have tested that functionality. Of course, testing is often a matter of personal taste (unit-oriented or functionality-oriented). But it's a little more in line with other tests in the same class.

comment:8 by Aymeric Augustin, 11 years ago

At this point, it would make more sense to pass the whole request to is_ignorable_404.

Please provide a single patch including both code, tests and docs changes.

comment:9 by Ram Rachum, 11 years ago

aaugustin: I considered giving the request as an argument so the user could do anything they want, but then I thought that if the user needs so much freedom, maybe they're better off replacing BrokenLinkEmailsMiddleware completely.

comment:10 by Aymeric Augustin, 11 years ago

I'd just like to avoid further tickets asking to add more keyword arguments.

comment:11 by Ram Rachum, 11 years ago

aaugustin: I've submitted a patch which changes the method to take the request. It also moves the existing request-checking logic into that method. I didn't do docs yet, I want to first know whether this approach is acceptable.

comment:12 by Aymeric Augustin, 11 years ago

This is entering bikeshedding territory, the only thing I don't like in this patch is the name is_request_we_should_notify_for; is_ignorable_request would be much more straightforward.

comment:13 by Ram Rachum, 11 years ago

I'm not attached to the name (I felt it was too long but couldn't come up with a shorter one.)

Note though that is_ignorable_request is a negative name while the name I introduced is positive. (I flipped the logic of the method, I think that all those double negatives are really annoying.)

If you can come up with a short and positive name, I'll be happy to hear it.

comment:14 by Aymeric Augustin, 11 years ago

I thought about the double negative, but I felt it was still worth emphasizing the relation with IGNORABLE_404_URLS.

comment:15 by Ram Rachum, 11 years ago

I'm not sure what you're suggesting. If you're suggesting to keep the old name and negativity so it'll be easier for people to understand that this method is related to IGNORABLE_404_URLS, I disagree.

comment:16 by Aymeric Augustin, 11 years ago

#10214 was closed as a duplicate. The patch for this ticket should add a note in the docs, explaining that subclassing the middleware is the recommended way to alter its behavior.

comment:17 by Ram Rachum, 11 years ago

New patch with docs addition attached.

comment:18 by Ram Rachum, 11 years ago

Can anyone review this or merge it in?

comment:19 by Ram Rachum, 11 years ago

Tumbleweed

comment:20 by Ram Rachum, 11 years ago

Added pull request, if that helps:

https://github.com/django/django/pull/1014

If you want, I can rename is_request_we_should_notify_for to is_relevant_request.

comment:21 by Claude Paroz, 11 years ago

Revisiting this issue after some time, here's an alternative approach: https://github.com/django/django/pull/1211

We need a referee now :-)

comment:22 by Simon Charette, 11 years ago

Cc: Simon Charette added
Triage Stage: AcceptedReady for checkin

Left some minor comment on @claudep's PR. Apart from those the patch looks RFC.

comment:23 by Claude Paroz <claude@…>, 11 years ago

Resolution: fixed
Status: newclosed

In f940e564e4623d531eb97a2cf1b116851003f9fd:

Fixed #20099 -- Eased subclassing of BrokenLinkEmailsMiddleware

Thanks Ram Rachum for the report and the initial patch, and Simon
Charette for the review.

Note: See TracTickets for help on using tickets.
Back to Top