Opened 11 years ago

Closed 11 years ago

#20099 closed New feature (fixed)

Allow filtering 404 reports by user agent

Reported by: Ram Rachum Owned by: nobody
Component: HTTP handling Version: dev
Severity: Normal Keywords:
Cc: Simon Charette Triage Stage: Ready for checkin
Has patch: yes Needs documentation: yes
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I get a lot of 404 reports to my inbox that are caused by various spiders. I'd like to block these. The same way we have a mechanism for blocking 404 reports for certain URLs, we should be able to block 404 reports with user agents that are easily discernible as spiders.

Attachments (5)

patch.patch (1.8 KB) - added by Ram Rachum 11 years ago.
Patch
0001-Tests-for-user-agent-input-to-is_ignorable_404.patch (1.3 KB) - added by Ram Rachum 11 years ago.
Tests
20099-test2.diff (1.2 KB) - added by Claude Paroz 11 years ago.
is_ignorable_404 test
0001-Implement-BrokenLinkEmailsMiddleware.is_request_we_s.patch (5.4 KB) - added by Ram Rachum 11 years ago.
0001-Implement-BrokenLinkEmailsMiddleware.is_request_we_s.2.patch (6.1 KB) - added by Ram Rachum 11 years ago.

Download all attachments as: .zip

Change History (28)

comment:1 Changed 11 years ago by Claude Paroz

Component: UncategorizedHTTP handling
Triage Stage: UnreviewedAccepted

Globally, I'm not thrilled by the proposal. Firstly, redirecting 404 to inboxes is not a best practice, as generally it is better to redirect to a service like Sentry. Secondly, this filtering is depending on projects, there might be people wanting to make stats about 404 generated by spiders. And thirdly, the spider's UA are changing and hard-coding a list in Django would be subject to frequent updates.

However, we should ease middleware customization so as filtering is easier. I would like to counsel people to subclass BrokenLinkEmailsMiddleware and write their own is_ignorable_404 method. Unfortunately, the user agent is not accessible for that method. So I'm accepting this ticket with two goals:

comment:2 Changed 11 years ago by Ram Rachum

I agree regarding email, it's better to redirect to services like Sentry.

Regarding the first item, do we need anything beyond just adding , user_agent=None to the definition of is_ignorable_404 and sending in the user agent in process_response?

comment:3 in reply to:  2 Changed 11 years ago by Claude Paroz

Replying to coolRR:

Regarding the first item, do we need anything beyond just adding , user_agent=None to the definition of is_ignorable_404 and sending in the user agent in process_response?

Yes, something like that.

Changed 11 years ago by Ram Rachum

Attachment: patch.patch added

Patch

comment:4 Changed 11 years ago by Ram Rachum

Has patch: set

Patch attached. Let me know whether it's good. (I don't work often with patches, this one was generated by Git, I don't know whether you can use it in hg.)

comment:5 Changed 11 years ago by Claude Paroz

Needs documentation: set
Needs tests: set

Django is in Git (Python is in hg), so yes, the format is right.
Now we still miss tests and docs.

Changed 11 years ago by Ram Rachum

Tests

comment:6 Changed 11 years ago by Ram Rachum

Here's a test. It's pretty lame; I would have preferred a test that creates a subclass but I'm very clueless about Django's test architecture, so I don't know how to do it in a way that won't bother other Django developers.

Changed 11 years ago by Claude Paroz

Attachment: 20099-test2.diff added

is_ignorable_404 test

comment:7 Changed 11 years ago by Claude Paroz

Needs tests: unset

Here's how I would have tested that functionality. Of course, testing is often a matter of personal taste (unit-oriented or functionality-oriented). But it's a little more in line with other tests in the same class.

comment:8 Changed 11 years ago by Aymeric Augustin

At this point, it would make more sense to pass the whole request to is_ignorable_404.

Please provide a single patch including both code, tests and docs changes.

comment:9 Changed 11 years ago by Ram Rachum

aaugustin: I considered giving the request as an argument so the user could do anything they want, but then I thought that if the user needs so much freedom, maybe they're better off replacing BrokenLinkEmailsMiddleware completely.

comment:10 Changed 11 years ago by Aymeric Augustin

I'd just like to avoid further tickets asking to add more keyword arguments.

comment:11 Changed 11 years ago by Ram Rachum

aaugustin: I've submitted a patch which changes the method to take the request. It also moves the existing request-checking logic into that method. I didn't do docs yet, I want to first know whether this approach is acceptable.

comment:12 Changed 11 years ago by Aymeric Augustin

This is entering bikeshedding territory, the only thing I don't like in this patch is the name is_request_we_should_notify_for; is_ignorable_request would be much more straightforward.

comment:13 Changed 11 years ago by Ram Rachum

I'm not attached to the name (I felt it was too long but couldn't come up with a shorter one.)

Note though that is_ignorable_request is a negative name while the name I introduced is positive. (I flipped the logic of the method, I think that all those double negatives are really annoying.)

If you can come up with a short and positive name, I'll be happy to hear it.

comment:14 Changed 11 years ago by Aymeric Augustin

I thought about the double negative, but I felt it was still worth emphasizing the relation with IGNORABLE_404_URLS.

comment:15 Changed 11 years ago by Ram Rachum

I'm not sure what you're suggesting. If you're suggesting to keep the old name and negativity so it'll be easier for people to understand that this method is related to IGNORABLE_404_URLS, I disagree.

comment:16 Changed 11 years ago by Aymeric Augustin

#10214 was closed as a duplicate. The patch for this ticket should add a note in the docs, explaining that subclassing the middleware is the recommended way to alter its behavior.

comment:17 Changed 11 years ago by Ram Rachum

New patch with docs addition attached.

comment:18 Changed 11 years ago by Ram Rachum

Can anyone review this or merge it in?

comment:19 Changed 11 years ago by Ram Rachum

Tumbleweed

comment:20 Changed 11 years ago by Ram Rachum

Added pull request, if that helps:

https://github.com/django/django/pull/1014

If you want, I can rename is_request_we_should_notify_for to is_relevant_request.

comment:21 Changed 11 years ago by Claude Paroz

Revisiting this issue after some time, here's an alternative approach: https://github.com/django/django/pull/1211

We need a referee now :-)

comment:22 Changed 11 years ago by Simon Charette

Cc: Simon Charette added
Triage Stage: AcceptedReady for checkin

Left some minor comment on @claudep's PR. Apart from those the patch looks RFC.

comment:23 Changed 11 years ago by Claude Paroz <claude@…>

Resolution: fixed
Status: newclosed

In f940e564e4623d531eb97a2cf1b116851003f9fd:

Fixed #20099 -- Eased subclassing of BrokenLinkEmailsMiddleware

Thanks Ram Rachum for the report and the initial patch, and Simon
Charette for the review.

Note: See TracTickets for help on using tickets.
Back to Top