#21952 closed Bug (fixed)

Deadlock while dispatching signals

Reported by: edevil Owned by: nobody
Component: Core (Other) Version: 1.6
Severity: Release blocker Keywords: signal deadlock
Cc: apollo13 Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I'm testing an upgrade from Django 1.5.2 to 1.6.1, and came across this deadlock:

File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
  result = self._run(*self.args, **self.kwargs)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/http/response.py", line 308, in close
  signals.request_finished.send(sender=self._handler_class)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 184, in send
  for receiver in self._live_receivers(sender):
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 245, in _live_receivers
  for (receiverkey, r_senderkey), receiver in self.receivers:
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/saferef.py", line 121, in remove
  function( self )
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 270, in _remove_receiver
  with self.lock:
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/hub.py", line 331, in switch
  return greenlet.switch(self)

While in _live_receivers we have acquired the lock, and we try to acquire it again in _remove_receiver which is called with the lock. I don't know if the Gevent interaction is triggering the issue here, but its locks are supposed to exhibit the same behaviour as the ones from stdlib. I'm using uWSGI with the Gevent loop engine so it spawns one greenlet per request. After this deadlock I ended up with thousands of greenlets in:

File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
  result = self._run(*self.args, **self.kwargs)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 206, in __call__
  response = self.get_response(request)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 90, in get_response
  response = middleware_method(request)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/contrib/sessions/middleware.py", line 12, in process_request
  request.session = engine.SessionStore(session_key)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/contrib/sessions/backends/cache.py", line 14, in __init__
  self._cache = get_cache(settings.SESSION_CACHE_ALIAS)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/core/cache/__init__.py", line 135, in get_cache
  signals.request_finished.connect(cache.close)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 116, in connect
  with self.lock:
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/hub.py", line 331, in switch
  return greenlet.switch(self)

and

File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
  result = self._run(*self.args, **self.kwargs)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/http/response.py", line 308, in close
  signals.request_finished.send(sender=self._handler_class)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 184, in send
  for receiver in self._live_receivers(sender):
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 242, in _live_receivers
  with self.lock:
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/hub.py", line 331, in switch
  return greenlet.switch(self)

Change History (15)

comment:1 Changed 17 months ago by akaariai

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

I think we have to go with RLock (re-entrant lock) instead of Lock. What is happening here is that for some reason GC decides to clear the weakref version of the signal when looping over self.receivers in _live_receivers. If this is something that is caused by greenlets or something that could happen on pure CPython isn't clear to me.

The basic problem is that we are running somewhat complicated code (and worse yet, code that requires the signal's lock) as side effect of garbage collection. No good ideas of how to fix that...

This seems impossible to test, and RLocks should be safe here. So, maybe the easiest way is to just change self.lock = threading.Lock() to self.lock = threading.RLock().

comment:2 Changed 17 months ago by akaariai

OK, this seems hairy. RLock doesn't work (I don't know why, but tests fail all over the place).

Still, the basic problem is that it seems that just looping over weakreffed receivers might end up cleaning the weakref, and that again will lead to trying to take self.lock. We must use some sort of concurrency control here, as otherwise we could end up caching receivers that have been removed. Maybe there is some lock-free algorithm for this...

I'll mark this accepted. I haven't reproduced the error, but in any case tacking a lock in an action that is fired by garbage collector seems like a seriously bad idea.

comment:3 Changed 17 months ago by akaariai

  • Triage Stage changed from Unreviewed to Accepted

comment:4 Changed 17 months ago by akaariai

Proposed patch at https://github.com/akaariai/django/compare/ticket_21952

The approach is to not clean up dead weakref receivers during garbage collection, instead just set a flag that the signal has dead receivers and do the cleanup later on when self.receivers is accessed while holding self.lock.

comment:5 Changed 17 months ago by edevil

  • Has patch set

Thanks for your work.

comment:6 Changed 17 months ago by loic84

@akaariai, looking good overall, I believe this is the right approach.

The only potential issue I can spot is that send() and send_robust() bypass _live_receivers() when a given sender is found in the cache.

comment:7 Changed 17 months ago by akaariai

hmmh, yeah, it could be possible that we leak dead weakref receivers. We could just add a and not self._dead_receivers check in there.

comment:8 Changed 17 months ago by Florian Apolloner <florian@…>

  • Resolution set to fixed
  • Status changed from new to closed

In c29d6f767691cceb9964c0d212e01281ac6721d3:

Fixed #21952 -- signals deadlock due to locking + weakref interaction

comment:9 Changed 17 months ago by Florian Apolloner <florian@…>

In 297c009cf22b14b0124b790303404ee06a289999:

Simplified signal code.

Refs #21952

comment:10 Changed 17 months ago by edevil

Will this be backported to 1.6?

comment:11 Changed 17 months ago by edevil

Ping.

comment:12 Changed 17 months ago by apollo13

  • Cc apollo13 added

I just tried backporting it, but it caused quite a few failures in the dispatch tests which I really don't want to debug again :). Since this only appears to affect gevent (which Django is completely untested on anyways -- although in theory this should affect CPython too) I'll not put any further work into this. If you want to give the backporting a shot and get the tests passing, I'll happily commit it.

comment:13 Changed 16 months ago by jason@…

We are experiencing this issue with uwsgi, django 1.6.2 and CPython. +1 for backport to 1.6

Frame 0x3a73f30, for file /opt/apk-signer/apk-signer/venv/lib/python2.6/site-packages/django/dispatch/dispatcher.py, line 270, in _remove_receiver (self=<Signal(use_caching=False, lock=<thread.lock at remote 0x7fc6c20bf6a8>, providing_args=set([]), sender_receivers_cache={}, receivers=[((38228048, 7195216), <weakref at remote 0x247db50>), (((49751120, 49833936), 7195216), <BoundMethodWeakref(weakFunc=<weakref at remote 0x2f87050>, deletionMethods=[<instancemethod at remote 0x2f6c460>], weakSelf=<weakref at remote 0x2f87158>, funcName='close', selfName='<django.core.cache.backends.memcached.MemcachedCache object at 0x2f72450>', key=(49751120, 49833936)) at remote 0x2f25750>), (((57907344, 49833936), 7195216), <BoundMethodWeakref(weakFunc=<weakref at remote 0x3735a48>, deletionMethods=[<instancemethod at remote 0x3731500>], weakSelf=<weakref at remote 0x3735af8>, funcName='close', selfName='<django.core.cache.backends.memcached.MemcachedCache object at 0x3739890>', key=(57907344, 49833936)) at remote 0x37398d0>...(truncated)

Full backtrace located here https://bug976729.bugzilla.mozilla.org/attachment.cgi?id=8391540

comment:14 Changed 16 months ago by edevil

  • Resolution fixed deleted
  • Status changed from closed to new

Since it is reproducible in CPython, maybe 1.6 (current latest version) should be patched as well.

comment:15 Changed 16 months ago by apollo13

  • Resolution set to fixed
  • Status changed from new to closed

Since the original report is fixed, this should stay as fixed. Please open a new ticket with a patch, as I said before I ran into to many issues while backporting it…

Note: See TracTickets for help on using tickets.
Back to Top