Opened 11 years ago
Closed 11 years ago
#21952 closed Bug (fixed)
Deadlock while dispatching signals
Reported by: | André Cruz | Owned by: | nobody |
---|---|---|---|
Component: | Core (Other) | Version: | 1.6 |
Severity: | Release blocker | Keywords: | signal deadlock |
Cc: | Florian Apolloner | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
I'm testing an upgrade from Django 1.5.2 to 1.6.1, and came across this deadlock:
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run result = self._run(*self.args, **self.kwargs) File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/http/response.py", line 308, in close signals.request_finished.send(sender=self._handler_class) File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 184, in send for receiver in self._live_receivers(sender): File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 245, in _live_receivers for (receiverkey, r_senderkey), receiver in self.receivers: File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/saferef.py", line 121, in remove function( self ) File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 270, in _remove_receiver with self.lock: File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/hub.py", line 331, in switch return greenlet.switch(self)
While in _live_receivers we have acquired the lock, and we try to acquire it again in _remove_receiver which is called with the lock. I don't know if the Gevent interaction is triggering the issue here, but its locks are supposed to exhibit the same behaviour as the ones from stdlib. I'm using uWSGI with the Gevent loop engine so it spawns one greenlet per request. After this deadlock I ended up with thousands of greenlets in:
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run result = self._run(*self.args, **self.kwargs) File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 206, in __call__ response = self.get_response(request) File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 90, in get_response response = middleware_method(request) File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/contrib/sessions/middleware.py", line 12, in process_request request.session = engine.SessionStore(session_key) File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/contrib/sessions/backends/cache.py", line 14, in __init__ self._cache = get_cache(settings.SESSION_CACHE_ALIAS) File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/core/cache/__init__.py", line 135, in get_cache signals.request_finished.connect(cache.close) File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 116, in connect with self.lock: File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/hub.py", line 331, in switch return greenlet.switch(self)
and
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run result = self._run(*self.args, **self.kwargs) File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/http/response.py", line 308, in close signals.request_finished.send(sender=self._handler_class) File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 184, in send for receiver in self._live_receivers(sender): File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 242, in _live_receivers with self.lock: File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/hub.py", line 331, in switch return greenlet.switch(self)
Change History (15)
comment:1 by , 11 years ago
comment:2 by , 11 years ago
OK, this seems hairy. RLock doesn't work (I don't know why, but tests fail all over the place).
Still, the basic problem is that it seems that just looping over weakreffed receivers might end up cleaning the weakref, and that again will lead to trying to take self.lock. We must use some sort of concurrency control here, as otherwise we could end up caching receivers that have been removed. Maybe there is some lock-free algorithm for this...
I'll mark this accepted. I haven't reproduced the error, but in any case tacking a lock in an action that is fired by garbage collector seems like a seriously bad idea.
comment:3 by , 11 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:4 by , 11 years ago
Proposed patch at https://github.com/akaariai/django/compare/ticket_21952
The approach is to not clean up dead weakref receivers during garbage collection, instead just set a flag that the signal has dead receivers and do the cleanup later on when self.receivers is accessed while holding self.lock.
comment:6 by , 11 years ago
@akaariai, looking good overall, I believe this is the right approach.
The only potential issue I can spot is that send()
and send_robust()
bypass _live_receivers()
when a given sender
is found in the cache.
comment:7 by , 11 years ago
hmmh, yeah, it could be possible that we leak dead weakref receivers. We could just add a and not self._dead_receivers check in there.
comment:8 by , 11 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:12 by , 11 years ago
Cc: | added |
---|
I just tried backporting it, but it caused quite a few failures in the dispatch tests which I really don't want to debug again :). Since this only appears to affect gevent (which Django is completely untested on anyways -- although in theory this should affect CPython too) I'll not put any further work into this. If you want to give the backporting a shot and get the tests passing, I'll happily commit it.
comment:13 by , 11 years ago
We are experiencing this issue with uwsgi, django 1.6.2 and CPython. +1 for backport to 1.6
Frame 0x3a73f30, for file /opt/apk-signer/apk-signer/venv/lib/python2.6/site-packages/django/dispatch/dispatcher.py, line 270, in _remove_receiver (self=<Signal(use_caching=False, lock=<thread.lock at remote 0x7fc6c20bf6a8>, providing_args=set([]), sender_receivers_cache={}, receivers=[((38228048, 7195216), <weakref at remote 0x247db50>), (((49751120, 49833936), 7195216), <BoundMethodWeakref(weakFunc=<weakref at remote 0x2f87050>, deletionMethods=[<instancemethod at remote 0x2f6c460>], weakSelf=<weakref at remote 0x2f87158>, funcName='close', selfName='<django.core.cache.backends.memcached.MemcachedCache object at 0x2f72450>', key=(49751120, 49833936)) at remote 0x2f25750>), (((57907344, 49833936), 7195216), <BoundMethodWeakref(weakFunc=<weakref at remote 0x3735a48>, deletionMethods=[<instancemethod at remote 0x3731500>], weakSelf=<weakref at remote 0x3735af8>, funcName='close', selfName='<django.core.cache.backends.memcached.MemcachedCache object at 0x3739890>', key=(57907344, 49833936)) at remote 0x37398d0>...(truncated)
Full backtrace located here https://bug976729.bugzilla.mozilla.org/attachment.cgi?id=8391540
comment:14 by , 11 years ago
Resolution: | fixed |
---|---|
Status: | closed → new |
Since it is reproducible in CPython, maybe 1.6 (current latest version) should be patched as well.
comment:15 by , 11 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Since the original report is fixed, this should stay as fixed. Please open a new ticket with a patch, as I said before I ran into to many issues while backporting it…
I think we have to go with RLock (re-entrant lock) instead of Lock. What is happening here is that for some reason GC decides to clear the weakref version of the signal when looping over self.receivers in _live_receivers. If this is something that is caused by greenlets or something that could happen on pure CPython isn't clear to me.
The basic problem is that we are running somewhat complicated code (and worse yet, code that requires the signal's lock) as side effect of garbage collection. No good ideas of how to fix that...
This seems impossible to test, and RLocks should be safe here. So, maybe the easiest way is to just change self.lock = threading.Lock() to self.lock = threading.RLock().