Opened 13 years ago

Closed 13 years ago

Last modified 9 years ago

#18251 closed Bug (fixed)

multithreading deadlock in django.models.loading.get_apps

Reported by: harm Owned by: nobody
Component: Core (Other) Version: 1.3
Severity: Normal Keywords:
Cc: hjwp2@… Triage Stage: Ready for checkin
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

On a production site we have encountered deadlocks, which after fierce investigation pointed to the django internals.

We use a deployment: apache/mod_wsgi
apache config:

WSGIDaemonProcess <name_here> processes=1 threads=10 umask=0002 display-name=%{GROUP} stack-size=524288

details of deadlock

During the bootstrap of django, so the first requests it handles, a classic ABBA deadlock can occur.
This results in that django never boots, and that NO requests are handled. Just restarting apache works, (hoping that the same deadlock does not appear again).

description of deadlock

thread A                                                   thread B

 request foo      
 import foo.views.py 
 import foo.models.py                                      request 'bar'
                                                           import bar.models.py (acquires import lock B)
 bootstraps django
 calls django.models.loading.get_apps()                      
    self.write_lock.acquire()  (lock A!)
    load app apps
    . . .                                                  django.models.loading.get_apps()
    import bar.models.py  (takes import lock. lock B!)     self.write_lock.acquire()  (lock A!)
    
    <Blocked>                                              <Blocked>
    

NB: this exact deadlock was actually seen in stracktraces. This is not a guess.

How to reproduce

its hard to reproduce, you need high volume traffic + multithreaded deployement + a big enough application that different urls trigger compelete different code paths in your project.

I am not able to produce a simple test project that demonstrates this problem

We had a 100% reproducible setup, where the site would lock up (almost) every time when we touched wsgi.py during peak hours of the day.

  1. When you have a projects that is not multithreaded deployed -> this problem does not happen
  2. When your application has a limited set of urls/views so that all initial calls follow same code paths -> this problem does not happen
  3. When application has low traffic (the very first request can finish, without other requests beeing made) -> this problem does not happen.

Workaround

We use the attached DjangoWSGIHandler wrapper as a workaround. That solved the problem 100 % for us.
Its a drop-in replacement for DjangoWSGIHandler and effectively disables multithreading the first few request (allowing django to bootstrap properly), and only afther the first requests act multitheading again.

Please consider if this workaround should be applied to django itself. (Put the Lock HIGHER the chain, to the wsgi handler level)

Fix

Some init locking should be adding to DjangoWSGIHandler around the complete request.
django.core.handlers.wsgi.WSGIHandler() has such lock around middleware loading.

   if self._request_middleware is None:
               self.initLock.acquire()

Something similar should be around the complete request, not just middleware loading.
But only the very first request.

Attachments (2)

workaround.py (3.0 KB ) - added by harm 13 years ago.
workaround (not a fix) wrapper around DjangoWSGIHandler
threading_lock.tar.gz (4.9 KB ) - added by Anssi Kääriäinen 13 years ago.

Download all attachments as: .zip

Change History (21)

by harm, 13 years ago

Attachment: workaround.py added

workaround (not a fix) wrapper around DjangoWSGIHandler

comment:1 by Jannis Leidel, 13 years ago

Component: UncategorizedCore (Other)
Triage Stage: UnreviewedAccepted
Type: UncategorizedBug

comment:2 by Anssi Kääriäinen, 13 years ago

Could this work: in WSGIHandler, when serving request do:

from django.db.models.loading import app_cache_ready, get_apps
if not app_cache_ready():
    get_apps()

I didn't put any lock in there - I think the get_apps call should be safe as it pretty much immediately acquires the write_lock of app_cache. You don't even need the if not app_cache_ready(), as get_apps() returns immediately if apps are already loaded. It reads better to me that way..

Can you report if this works?

comment:3 by harm, 13 years ago

Hm. smart. Yes, that would work.
I cannot test it, as the system that reproduced this problem is our production system.

Note: you do need something like if not app_cache_ready(), as get_apps() actually does a load of work (even when called later), it returns the apps list, you don't want that every call. (at least in django 1.3)

micro optimization:
Als this is really really in a hotspot of the code, storing this state locally in WSGIHandler, might be faster than calling app_cache_ready() every request.
so something like this in WSGIHandler :

if not self._apps_initialized:
    get_apps()    # if entered more that once thats ok, as long as we dont hit this every request
    self._apps_initialized = True
Last edited 13 years ago by harm (previous) (diff)

comment:4 by Anssi Kääriäinen, 13 years ago

Good point about the additional work done by get_apps().

I don't believe self._apps_initialized is needed, a method call which returns self.loaded should be light enough. It is hard to imagine a situation where one method call is crucial to performance.

Maybe the best way would be to just make a load() method in apps, which returns immediately if already loaded.

Another way would be to first issue imp.acquire_lock() before taking the write_lock in app-loading. This should probably be done in any case. Although explicitly initializing Django's app-state feels right to me.

I checked the app-loading refactor work (#3591) and it seems it does not have any solution to this problem.

It would be very good to be able to write a test case for this, but it seems hard...

comment:5 by harm, 13 years ago

I'm not sure aquiring the imp.acquire_lock() would help in this case. Aren't we too late then ? (the lock B would already be taken bij thread B in the sequance diagram) leaving the same ABBA deadlock. I think this needs to be done in the wsgihandler.

comment:6 by Anssi Kääriäinen, 13 years ago

I think it is safe. The reason is that if B has the import lock, then A's imp.acquire_lock() will block. This is fine, as B can still continue working. If B goes into loading._populate, it has already the import lock, and can bypass A, take the write_lock, and proceed with app loading. If B just finishes its import, it will release the import lock, A can continue and again all fine.

Actually, to me it seems the write_lock is redundant in this case.

comment:7 by Anssi Kääriäinen, 13 years ago

To be precise: the import lock is always acquired before the write lock. Thus there can be no deadlocks. Also, the import lock is always held when the write lock is held. Thus, the write lock isn't needed.

comment:8 by harm, 13 years ago

I can't follow you.. If that is the case then there would not have been a problem in the first place, no ?

I'm not 100% sure how the import locks works, but it seemed to me, that when doing an import of a file foo.py you get an other lock than an other thread importing bar.py, no ?)

I mean you can't fix a ABBA deadlock by adding a lock C between A and B, only around it.

Last edited 13 years ago by harm (previous) (diff)

comment:9 by Anssi Kääriäinen, 13 years ago

All of the above is assuming there is just a single import lock. The import documentation from Python's docs suggests this, and this test case confirms it:

from threading import Thread
import time
import imp
class thread1(Thread):
    def run(self, *args, **kwargs):
        imp.acquire_lock()
        print "locked"
        time.sleep(10)
        print "unlocking"
        imp.release_lock()

class thread2(Thread):
    def run(self, *args, **kwargs):
        time.sleep(5)
        print "trying to import"
        import sys
        print "import done"
t1 = thread1()
t1.start()
t2 = thread2()
t2.start()

So, what is happening in your case is that apps are loaded from one thread outside of importing. Maybe as a side product of executing a query or something like that. The app loading will then import multiple other modules, and importing these modules needs the import lock. While this is happening, another thread imports a module, and as a side product of this the apps are loaded.

In this situation the first thread does not have the import lock when it enters app loading, it however needs it later on for initializing the applications. The second thread already has the import lock when it tries to load the applications. It tries to take the write_lock, and now you have a deadlock.

I will try to write a small test-application confirming the above, as at the moment the above is just speculation.

comment:10 by Anssi Kääriäinen, 13 years ago

Attached is a test project showing the problem. It deadlocks reliably on my machine if the writelock is used in app loading, but succeeds if it is changed to imp.acquire_lock() / imp.release_lock().

Unpack the project, run DJANGO_SETTINGS_MODULE=threading_lock.settings python locker.py

Last edited 13 years ago by Anssi Kääriäinen (previous) (diff)

by Anssi Kääriäinen, 13 years ago

Attachment: threading_lock.tar.gz added

comment:11 by Anssi Kääriäinen, 13 years ago

Triage Stage: AcceptedReady for checkin

A RFC patch in pull request https://github.com/django/django/pull/57.

As this is somewhat complex issue and getting this wrong would have serious consequences I hope to get a review from other core committers before commit.

When this gets committed, a note of this issue must be added to #3591, as there are no tests and the current patch in that ticket would reintroduce this issue.

comment:12 by harm, 13 years ago

patch looks sane. I think this is a clean an correct fix.

I can't validate your testcase though. It seems the attachment is corrupt ?

harm@dev3:~$ wget -q --no-check-certificate https://code.djangoproject.com/attachment/ticket/18251/threading_lock.tar.gz
harm@dev3:~$ tar vxzf threading_lock.tar.gz 

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Exiting with failure status due to previous errors

Last edited 13 years ago by harm (previous) (diff)

comment:13 by Anssi Kääriäinen, 13 years ago

It seems you can't download the file like that - I tried that and when I looked into the file it start with this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
...

downloading it through the direct download link works fine.

comment:14 by harm, 13 years ago

Ah... right.
Good testcase. I can't think of a way how to put it in a unittest though...

comment:15 by Anssi Kääriäinen, 13 years ago

I will commit this one soon. I haven't gotten a review for this, but I am pretty confident this is in fact the right approach.

comment:16 by Anssi Kääriäinen <akaariai@…>, 13 years ago

Resolution: fixed
Status: newclosed

In [2b9fb2e6443c04e4415b17083d727bd80047b6e5]:

Fixed #18251 -- Removed a deadlock possibility in apploading

Thanks to harm for the report and comments.

comment:17 by Anssi Kääriäinen, 13 years ago

Im not going to backpatch this, at least not to 1.4.1. To me the risk that something subtle breaks is too great compared to the likelihood of users hitting this situation. If it turns out this deadlock is more common than it currently seems then lets revisit the backpatching.

comment:18 by Aymeric Augustin <aymeric.augustin@…>, 11 years ago

In d674fe6dee16735dd2670243153326806b7e6cb0:

Used a regular lock for app registry population.

Since the app registry is always populated before the first request is
processed, the situation described in #18251 for the old app cache
cannot happen any more.

Refs #18251, #21628.

comment:19 by Harry Percival, 9 years ago

Cc: hjwp2@… added
Note: See TracTickets for help on using tickets.
Back to Top