Opened 2 years ago

Closed 21 months ago

#20099 closed New feature (fixed)

Allow filtering 404 reports by user agent

Reported by: coolRR Owned by: nobody
Component: HTTP handling Version: master
Severity: Normal Keywords:
Cc: charettes Triage Stage: Ready for checkin
Has patch: yes Needs documentation: yes
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I get a lot of 404 reports to my inbox that are caused by various spiders. I'd like to block these. The same way we have a mechanism for blocking 404 reports for certain URLs, we should be able to block 404 reports with user agents that are easily discernible as spiders.

Attachments (5)

patch.patch (1.8 KB) - added by coolRR 2 years ago.
Patch
0001-Tests-for-user-agent-input-to-is_ignorable_404.patch (1.3 KB) - added by coolRR 2 years ago.
Tests
20099-test2.diff (1.2 KB) - added by claudep 2 years ago.
is_ignorable_404 test
0001-Implement-BrokenLinkEmailsMiddleware.is_request_we_s.patch (5.4 KB) - added by coolRR 2 years ago.
0001-Implement-BrokenLinkEmailsMiddleware.is_request_we_s.2.patch (6.1 KB) - added by coolRR 2 years ago.

Download all attachments as: .zip

Change History (28)

comment:1 Changed 2 years ago by claudep

  • Component changed from Uncategorized to HTTP handling
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Accepted

Globally, I'm not thrilled by the proposal. Firstly, redirecting 404 to inboxes is not a best practice, as generally it is better to redirect to a service like Sentry. Secondly, this filtering is depending on projects, there might be people wanting to make stats about 404 generated by spiders. And thirdly, the spider's UA are changing and hard-coding a list in Django would be subject to frequent updates.

However, we should ease middleware customization so as filtering is easier. I would like to counsel people to subclass BrokenLinkEmailsMiddleware and write their own is_ignorable_404 method. Unfortunately, the user agent is not accessible for that method. So I'm accepting this ticket with two goals:

comment:2 follow-up: Changed 2 years ago by coolRR

I agree regarding email, it's better to redirect to services like Sentry.

Regarding the first item, do we need anything beyond just adding , user_agent=None to the definition of is_ignorable_404 and sending in the user agent in process_response?

comment:3 in reply to: ↑ 2 Changed 2 years ago by claudep

Replying to coolRR:

Regarding the first item, do we need anything beyond just adding , user_agent=None to the definition of is_ignorable_404 and sending in the user agent in process_response?

Yes, something like that.

Changed 2 years ago by coolRR

Patch

comment:4 Changed 2 years ago by coolRR

  • Has patch set

Patch attached. Let me know whether it's good. (I don't work often with patches, this one was generated by Git, I don't know whether you can use it in hg.)

comment:5 Changed 2 years ago by claudep

  • Needs documentation set
  • Needs tests set

Django is in Git (Python is in hg), so yes, the format is right.
Now we still miss tests and docs.

comment:6 Changed 2 years ago by coolRR

Here's a test. It's pretty lame; I would have preferred a test that creates a subclass but I'm very clueless about Django's test architecture, so I don't know how to do it in a way that won't bother other Django developers.

Changed 2 years ago by claudep

is_ignorable_404 test

comment:7 Changed 2 years ago by claudep

  • Needs tests unset

Here's how I would have tested that functionality. Of course, testing is often a matter of personal taste (unit-oriented or functionality-oriented). But it's a little more in line with other tests in the same class.

comment:8 Changed 2 years ago by aaugustin

At this point, it would make more sense to pass the whole request to is_ignorable_404.

Please provide a single patch including both code, tests and docs changes.

comment:9 Changed 2 years ago by coolRR

aaugustin: I considered giving the request as an argument so the user could do anything they want, but then I thought that if the user needs so much freedom, maybe they're better off replacing BrokenLinkEmailsMiddleware completely.

comment:10 Changed 2 years ago by aaugustin

I'd just like to avoid further tickets asking to add more keyword arguments.

comment:11 Changed 2 years ago by coolRR

aaugustin: I've submitted a patch which changes the method to take the request. It also moves the existing request-checking logic into that method. I didn't do docs yet, I want to first know whether this approach is acceptable.

comment:12 Changed 2 years ago by aaugustin

This is entering bikeshedding territory, the only thing I don't like in this patch is the name is_request_we_should_notify_for; is_ignorable_request would be much more straightforward.

comment:13 Changed 2 years ago by coolRR

I'm not attached to the name (I felt it was too long but couldn't come up with a shorter one.)

Note though that is_ignorable_request is a negative name while the name I introduced is positive. (I flipped the logic of the method, I think that all those double negatives are really annoying.)

If you can come up with a short and positive name, I'll be happy to hear it.

comment:14 Changed 2 years ago by aaugustin

I thought about the double negative, but I felt it was still worth emphasizing the relation with IGNORABLE_404_URLS.

comment:15 Changed 2 years ago by coolRR

I'm not sure what you're suggesting. If you're suggesting to keep the old name and negativity so it'll be easier for people to understand that this method is related to IGNORABLE_404_URLS, I disagree.

comment:16 Changed 2 years ago by aaugustin

#10214 was closed as a duplicate. The patch for this ticket should add a note in the docs, explaining that subclassing the middleware is the recommended way to alter its behavior.

comment:17 Changed 2 years ago by coolRR

New patch with docs addition attached.

comment:18 Changed 2 years ago by coolRR

Can anyone review this or merge it in?

comment:19 Changed 23 months ago by coolRR

Tumbleweed

comment:20 Changed 23 months ago by coolRR

Added pull request, if that helps:

https://github.com/django/django/pull/1014

If you want, I can rename is_request_we_should_notify_for to is_relevant_request.

comment:21 Changed 22 months ago by claudep

Revisiting this issue after some time, here's an alternative approach: https://github.com/django/django/pull/1211

We need a referee now :-)

comment:22 Changed 22 months ago by charettes

  • Cc charettes added
  • Triage Stage changed from Accepted to Ready for checkin

Left some minor comment on @claudep's PR. Apart from those the patch looks RFC.

comment:23 Changed 21 months ago by Claude Paroz <claude@…>

  • Resolution set to fixed
  • Status changed from new to closed

In f940e564e4623d531eb97a2cf1b116851003f9fd:

Fixed #20099 -- Eased subclassing of BrokenLinkEmailsMiddleware

Thanks Ram Rachum for the report and the initial patch, and Simon
Charette for the review.

Note: See TracTickets for help on using tickets.
Back to Top