Opened 7 years ago

Closed 6 years ago

Last modified 6 years ago

#11565 closed (wontfix)

Rate Limit Error Emails

Reported by: Andrew Wilkinson Owned by: nobody
Component: Core (Other) Version: master
Severity: Keywords:
Cc: Simon Litchfield Triage Stage: Design decision needed
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: no
Easy pickings: UI/UX:

Description

We like the emails we receive when an exception occurs on the live site. However, if an error repeatedly occurs (e.g. the database goes down) we get spammed with a huge number of emails.

I'm attaching a patch which will only send an email once every settings.ERROR_EMAIL_RATE_LIMIT minutes for each error. It determines whether it has sent an email about a particular error by taking an MD5 of the traceback and storing it in whatever cache is configured. If the same MD5 already exists in the cache, then it doesn't send an email.

The big problem with this approach is what if your cache starts to produce errors? Certainly memcached is designed so that if the server goes away all keys appear not to be set, rather than an error. I'm not sure about how the database cache backend would work in this situation.

Any suggestions on how to improve the patch are appreciated. If it's ok then I'll produce a patch for the documentation too.

Cheers,
Andrew

Attachments (2)

error_emails_rate_limit.patch (3.6 KB) - added by Andrew Wilkinson 7 years ago.
error_email_rate_limit_v2.diff (3.6 KB) - added by Simon Litchfield 6 years ago.
Falls back to memory if cache isn't available

Download all attachments as: .zip

Change History (10)

comment:1 Changed 7 years ago by Luke Plant

Needs documentation: unset
Needs tests: unset
Patch needs improvement: unset

Surely the answer to your question about the cache throwing errors is to use the cache very defensively? Just put broad try/catch around both calls to the cache, and if anything fails, ignore those errors and make sure you send the e-mail.

comment:2 Changed 7 years ago by Andrew Wilkinson

That's a fair point, I'm just wary about using an except block which catches all exceptions. I guess in this case it's the lesser of two evils. I'll produce a better patch for the code and docs today.

Changed 7 years ago by Andrew Wilkinson

comment:3 Changed 7 years ago by Andrew Wilkinson

I've attached an improved patch including documentation. Tests will take me a bit longer to create as I don't think the current error emailing behavior is tested.

comment:4 Changed 7 years ago by Alex Gaynor

Triage Stage: UnreviewedDesign decision needed

comment:5 Changed 6 years ago by piramida

a year passed, design decision still needed.
this feature would make one nice django feature (error emails) actually useful for production without hacks. While the proposed solution is not perfect, a way to limit total error mail count to prevent mail overload on a busy site would be great.

comment:6 Changed 6 years ago by Simon Litchfield

Cc: Simon Litchfield added
Needs tests: set

See discussion here

Python logging support- way to go. Arecibo- sounds cool.

Before it even gets there though, I need my errors rate limited. Why bother email, Python logging, Arecibo, or anything else with thousands+ of errors a second when every single one of those errors are identical.

I like Andrew's approach, so I decided to improve on it. Such that if cache is down, we use local memory (limited, of course). Updated patch attached.

Changed 6 years ago by Simon Litchfield

Falls back to memory if cache isn't available

comment:7 Changed 6 years ago by Luke Plant

Resolution: wontfix
Status: newclosed

Is there any reason you couldn't do this within a custom logging handler, as suggested by Russell on that thread? Why do you need your errors rate limited at that level of the code, rather than at another level? Something is dealing with thousands of errors a second, and with your implementation it is forced to be in-process Python code, while with a custom logging handler even the duplicate detection could be passed off to another process. (or not, whatever you decide).

If we added this, we would also need to decide on the definition of 'duplicate' - you have decided that it is based solely on the traceback, which is fine. But I can see other possibilities - like wanting to know how many times the error is happening per user, for sites with fewer users for example.

To me it seems this is better done within a custom logging handler, so that it can be developed and improved independently of Django itself. That's what Russell clearly thought too, in the thread on django-developers, so I'm going to close WONTFIX.

comment:8 Changed 6 years ago by Simon Litchfield

Oops, sorry Luke. Somehow I failed to realise the logging stuff had hit trunk already- which, by the way, is superb!

Here's the rate limiter as a logging filter

Note: See TracTickets for help on using tickets.
Back to Top