Opened 3 years ago

Closed 3 years ago

#21952 closed Bug (fixed)

Deadlock while dispatching signals

Reported by: André Cruz Owned by: nobody
Component: Core (Other) Version: 1.6
Severity: Release blocker Keywords: signal deadlock
Cc: Florian Apolloner Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I'm testing an upgrade from Django 1.5.2 to 1.6.1, and came across this deadlock:

File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
  result = self._run(*self.args, **self.kwargs)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/http/response.py", line 308, in close
  signals.request_finished.send(sender=self._handler_class)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 184, in send
  for receiver in self._live_receivers(sender):
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 245, in _live_receivers
  for (receiverkey, r_senderkey), receiver in self.receivers:
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/saferef.py", line 121, in remove
  function( self )
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 270, in _remove_receiver
  with self.lock:
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/hub.py", line 331, in switch
  return greenlet.switch(self)

While in _live_receivers we have acquired the lock, and we try to acquire it again in _remove_receiver which is called with the lock. I don't know if the Gevent interaction is triggering the issue here, but its locks are supposed to exhibit the same behaviour as the ones from stdlib. I'm using uWSGI with the Gevent loop engine so it spawns one greenlet per request. After this deadlock I ended up with thousands of greenlets in:

File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
  result = self._run(*self.args, **self.kwargs)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 206, in __call__
  response = self.get_response(request)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 90, in get_response
  response = middleware_method(request)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/contrib/sessions/middleware.py", line 12, in process_request
  request.session = engine.SessionStore(session_key)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/contrib/sessions/backends/cache.py", line 14, in __init__
  self._cache = get_cache(settings.SESSION_CACHE_ALIAS)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/core/cache/__init__.py", line 135, in get_cache
  signals.request_finished.connect(cache.close)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 116, in connect
  with self.lock:
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/hub.py", line 331, in switch
  return greenlet.switch(self)

and

File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
  result = self._run(*self.args, **self.kwargs)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/http/response.py", line 308, in close
  signals.request_finished.send(sender=self._handler_class)
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 184, in send
  for receiver in self._live_receivers(sender):
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 242, in _live_receivers
  with self.lock:
File "/servers/python-environments/discosite/local/lib/python2.7/site-packages/gevent/hub.py", line 331, in switch
  return greenlet.switch(self)

Change History (15)

comment:1 Changed 3 years ago by Anssi Kääriäinen

I think we have to go with RLock (re-entrant lock) instead of Lock. What is happening here is that for some reason GC decides to clear the weakref version of the signal when looping over self.receivers in _live_receivers. If this is something that is caused by greenlets or something that could happen on pure CPython isn't clear to me.

The basic problem is that we are running somewhat complicated code (and worse yet, code that requires the signal's lock) as side effect of garbage collection. No good ideas of how to fix that...

This seems impossible to test, and RLocks should be safe here. So, maybe the easiest way is to just change self.lock = threading.Lock() to self.lock = threading.RLock().

comment:2 Changed 3 years ago by Anssi Kääriäinen

OK, this seems hairy. RLock doesn't work (I don't know why, but tests fail all over the place).

Still, the basic problem is that it seems that just looping over weakreffed receivers might end up cleaning the weakref, and that again will lead to trying to take self.lock. We must use some sort of concurrency control here, as otherwise we could end up caching receivers that have been removed. Maybe there is some lock-free algorithm for this...

I'll mark this accepted. I haven't reproduced the error, but in any case tacking a lock in an action that is fired by garbage collector seems like a seriously bad idea.

comment:3 Changed 3 years ago by Anssi Kääriäinen

Triage Stage: UnreviewedAccepted

comment:4 Changed 3 years ago by Anssi Kääriäinen

Proposed patch at https://github.com/akaariai/django/compare/ticket_21952

The approach is to not clean up dead weakref receivers during garbage collection, instead just set a flag that the signal has dead receivers and do the cleanup later on when self.receivers is accessed while holding self.lock.

comment:5 Changed 3 years ago by André Cruz

Has patch: set

Thanks for your work.

comment:6 Changed 3 years ago by loic84

@akaariai, looking good overall, I believe this is the right approach.

The only potential issue I can spot is that send() and send_robust() bypass _live_receivers() when a given sender is found in the cache.

comment:7 Changed 3 years ago by Anssi Kääriäinen

hmmh, yeah, it could be possible that we leak dead weakref receivers. We could just add a and not self._dead_receivers check in there.

comment:8 Changed 3 years ago by Florian Apolloner <florian@…>

Resolution: fixed
Status: newclosed

In c29d6f767691cceb9964c0d212e01281ac6721d3:

Fixed #21952 -- signals deadlock due to locking + weakref interaction

comment:9 Changed 3 years ago by Florian Apolloner <florian@…>

In 297c009cf22b14b0124b790303404ee06a289999:

Simplified signal code.

Refs #21952

comment:10 Changed 3 years ago by André Cruz

Will this be backported to 1.6?

comment:11 Changed 3 years ago by André Cruz

Ping.

comment:12 Changed 3 years ago by Florian Apolloner

Cc: Florian Apolloner added

I just tried backporting it, but it caused quite a few failures in the dispatch tests which I really don't want to debug again :). Since this only appears to affect gevent (which Django is completely untested on anyways -- although in theory this should affect CPython too) I'll not put any further work into this. If you want to give the backporting a shot and get the tests passing, I'll happily commit it.

comment:13 Changed 3 years ago by jason@…

We are experiencing this issue with uwsgi, django 1.6.2 and CPython. +1 for backport to 1.6

Frame 0x3a73f30, for file /opt/apk-signer/apk-signer/venv/lib/python2.6/site-packages/django/dispatch/dispatcher.py, line 270, in _remove_receiver (self=<Signal(use_caching=False, lock=<thread.lock at remote 0x7fc6c20bf6a8>, providing_args=set([]), sender_receivers_cache={}, receivers=[((38228048, 7195216), <weakref at remote 0x247db50>), (((49751120, 49833936), 7195216), <BoundMethodWeakref(weakFunc=<weakref at remote 0x2f87050>, deletionMethods=[<instancemethod at remote 0x2f6c460>], weakSelf=<weakref at remote 0x2f87158>, funcName='close', selfName='<django.core.cache.backends.memcached.MemcachedCache object at 0x2f72450>', key=(49751120, 49833936)) at remote 0x2f25750>), (((57907344, 49833936), 7195216), <BoundMethodWeakref(weakFunc=<weakref at remote 0x3735a48>, deletionMethods=[<instancemethod at remote 0x3731500>], weakSelf=<weakref at remote 0x3735af8>, funcName='close', selfName='<django.core.cache.backends.memcached.MemcachedCache object at 0x3739890>', key=(57907344, 49833936)) at remote 0x37398d0>...(truncated)

Full backtrace located here https://bug976729.bugzilla.mozilla.org/attachment.cgi?id=8391540

comment:14 Changed 3 years ago by André Cruz

Resolution: fixed
Status: closednew

Since it is reproducible in CPython, maybe 1.6 (current latest version) should be patched as well.

comment:15 Changed 3 years ago by Florian Apolloner

Resolution: fixed
Status: newclosed

Since the original report is fixed, this should stay as fixed. Please open a new ticket with a patch, as I said before I ran into to many issues while backporting it…

Note: See TracTickets for help on using tickets.
Back to Top