Opened 6 years ago

Closed 5 years ago

Last modified 5 years ago

#11565 closed (wontfix)

Rate Limit Error Emails

Reported by: andrewmintel Owned by: nobody
Component: Core (Other) Version: master
Severity: Keywords:
Cc: simon29 Triage Stage: Design decision needed
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: no
Easy pickings: UI/UX:

Description

We like the emails we receive when an exception occurs on the live site. However, if an error repeatedly occurs (e.g. the database goes down) we get spammed with a huge number of emails.

I'm attaching a patch which will only send an email once every settings.ERROR_EMAIL_RATE_LIMIT minutes for each error. It determines whether it has sent an email about a particular error by taking an MD5 of the traceback and storing it in whatever cache is configured. If the same MD5 already exists in the cache, then it doesn't send an email.

The big problem with this approach is what if your cache starts to produce errors? Certainly memcached is designed so that if the server goes away all keys appear not to be set, rather than an error. I'm not sure about how the database cache backend would work in this situation.

Any suggestions on how to improve the patch are appreciated. If it's ok then I'll produce a patch for the documentation too.

Cheers,
Andrew

Attachments (2)

error_emails_rate_limit.patch (3.6 KB) - added by andrewmintel 6 years ago.
error_email_rate_limit_v2.diff (3.6 KB) - added by simon29 5 years ago.
Falls back to memory if cache isn't available

Download all attachments as: .zip

Change History (10)

comment:1 Changed 6 years ago by lukeplant

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

Surely the answer to your question about the cache throwing errors is to use the cache very defensively? Just put broad try/catch around both calls to the cache, and if anything fails, ignore those errors and make sure you send the e-mail.

comment:2 Changed 6 years ago by andrewmintel

That's a fair point, I'm just wary about using an except block which catches all exceptions. I guess in this case it's the lesser of two evils. I'll produce a better patch for the code and docs today.

Changed 6 years ago by andrewmintel

comment:3 Changed 6 years ago by andrewmintel

I've attached an improved patch including documentation. Tests will take me a bit longer to create as I don't think the current error emailing behavior is tested.

comment:4 Changed 6 years ago by Alex

  • Triage Stage changed from Unreviewed to Design decision needed

comment:5 Changed 5 years ago by piramida

a year passed, design decision still needed.
this feature would make one nice django feature (error emails) actually useful for production without hacks. While the proposed solution is not perfect, a way to limit total error mail count to prevent mail overload on a busy site would be great.

comment:6 Changed 5 years ago by simon29

  • Cc simon29 added
  • Needs tests set

See discussion here

Python logging support- way to go. Arecibo- sounds cool.

Before it even gets there though, I need my errors rate limited. Why bother email, Python logging, Arecibo, or anything else with thousands+ of errors a second when every single one of those errors are identical.

I like Andrew's approach, so I decided to improve on it. Such that if cache is down, we use local memory (limited, of course). Updated patch attached.

Changed 5 years ago by simon29

Falls back to memory if cache isn't available

comment:7 Changed 5 years ago by lukeplant

  • Resolution set to wontfix
  • Status changed from new to closed

Is there any reason you couldn't do this within a custom logging handler, as suggested by Russell on that thread? Why do you need your errors rate limited at that level of the code, rather than at another level? Something is dealing with thousands of errors a second, and with your implementation it is forced to be in-process Python code, while with a custom logging handler even the duplicate detection could be passed off to another process. (or not, whatever you decide).

If we added this, we would also need to decide on the definition of 'duplicate' - you have decided that it is based solely on the traceback, which is fine. But I can see other possibilities - like wanting to know how many times the error is happening per user, for sites with fewer users for example.

To me it seems this is better done within a custom logging handler, so that it can be developed and improved independently of Django itself. That's what Russell clearly thought too, in the thread on django-developers, so I'm going to close WONTFIX.

comment:8 Changed 5 years ago by simon29

Oops, sorry Luke. Somehow I failed to realise the logging stuff had hit trunk already- which, by the way, is superb!

Here's the rate limiter as a logging filter

Note: See TracTickets for help on using tickets.
Back to Top