Code

Opened 2 years ago

Closed 22 months ago

Last modified 3 months ago

#18251 closed Bug (fixed)

multithreading deadlock in django.models.loading.get_apps

Reported by: harm Owned by: nobody
Component: Core (Other) Version: 1.3
Severity: Normal Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

On a production site we have encountered deadlocks, which after fierce investigation pointed to the django internals.

We use a deployment: apache/mod_wsgi
apache config:

WSGIDaemonProcess <name_here> processes=1 threads=10 umask=0002 display-name=%{GROUP} stack-size=524288

details of deadlock

During the bootstrap of django, so the first requests it handles, a classic ABBA deadlock can occur.
This results in that django never boots, and that NO requests are handled. Just restarting apache works, (hoping that the same deadlock does not appear again).

description of deadlock

thread A                                                   thread B

 request foo      
 import foo.views.py 
 import foo.models.py                                      request 'bar'
                                                           import bar.models.py (acquires import lock B)
 bootstraps django
 calls django.models.loading.get_apps()                      
    self.write_lock.acquire()  (lock A!)
    load app apps
    . . .                                                  django.models.loading.get_apps()
    import bar.models.py  (takes import lock. lock B!)     self.write_lock.acquire()  (lock A!)
    
    <Blocked>                                              <Blocked>
    

NB: this exact deadlock was actually seen in stracktraces. This is not a guess.

How to reproduce

its hard to reproduce, you need high volume traffic + multithreaded deployement + a big enough application that different urls trigger compelete different code paths in your project.

I am not able to produce a simple test project that demonstrates this problem

We had a 100% reproducible setup, where the site would lock up (almost) every time when we touched wsgi.py during peak hours of the day.

  1. When you have a projects that is not multithreaded deployed -> this problem does not happen
  2. When your application has a limited set of urls/views so that all initial calls follow same code paths -> this problem does not happen
  3. When application has low traffic (the very first request can finish, without other requests beeing made) -> this problem does not happen.

Workaround

We use the attached DjangoWSGIHandler wrapper as a workaround. That solved the problem 100 % for us.
Its a drop-in replacement for DjangoWSGIHandler and effectively disables multithreading the first few request (allowing django to bootstrap properly), and only afther the first requests act multitheading again.

Please consider if this workaround should be applied to django itself. (Put the Lock HIGHER the chain, to the wsgi handler level)

Fix

Some init locking should be adding to DjangoWSGIHandler around the complete request.
django.core.handlers.wsgi.WSGIHandler() has such lock around middleware loading.

   if self._request_middleware is None:
               self.initLock.acquire()

Something similar should be around the complete request, not just middleware loading.
But only the very first request.

Attachments (2)

workaround.py (3.0 KB) - added by harm 2 years ago.
workaround (not a fix) wrapper around DjangoWSGIHandler
threading_lock.tar.gz (4.9 KB) - added by akaariai 2 years ago.

Download all attachments as: .zip

Change History (20)

Changed 2 years ago by harm

workaround (not a fix) wrapper around DjangoWSGIHandler

comment:1 Changed 2 years ago by jezdez

  • Component changed from Uncategorized to Core (Other)
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Accepted
  • Type changed from Uncategorized to Bug

comment:2 Changed 2 years ago by akaariai

Could this work: in WSGIHandler, when serving request do:

from django.db.models.loading import app_cache_ready, get_apps
if not app_cache_ready():
    get_apps()

I didn't put any lock in there - I think the get_apps call should be safe as it pretty much immediately acquires the write_lock of app_cache. You don't even need the if not app_cache_ready(), as get_apps() returns immediately if apps are already loaded. It reads better to me that way..

Can you report if this works?

comment:3 Changed 2 years ago by harm

Hm. smart. Yes, that would work.
I cannot test it, as the system that reproduced this problem is our production system.

Note: you do need something like if not app_cache_ready(), as get_apps() actually does a load of work (even when called later), it returns the apps list, you don't want that every call. (at least in django 1.3)

micro optimization:
Als this is really really in a hotspot of the code, storing this state locally in WSGIHandler, might be faster than calling app_cache_ready() every request.
so an

if not self._apps_initialized:
    get_apps()    # if entered more that once thats ok, as long as we dont hit this every request
    self._apps_initialized = True
Version 0, edited 2 years ago by harm (next)

comment:4 Changed 2 years ago by akaariai

Good point about the additional work done by get_apps().

I don't believe self._apps_initialized is needed, a method call which returns self.loaded should be light enough. It is hard to imagine a situation where one method call is crucial to performance.

Maybe the best way would be to just make a load() method in apps, which returns immediately if already loaded.

Another way would be to first issue imp.acquire_lock() before taking the write_lock in app-loading. This should probably be done in any case. Although explicitly initializing Django's app-state feels right to me.

I checked the app-loading refactor work (#3591) and it seems it does not have any solution to this problem.

It would be very good to be able to write a test case for this, but it seems hard...

comment:5 Changed 2 years ago by harm

I'm not sure aquiring the imp.acquire_lock() would help in this case. Aren't we too late then ? (the lock B would already be taken bij thread B in the sequance diagram) leaving the same ABBA deadlock. I think this needs to be done in the wsgihandler.

comment:6 Changed 2 years ago by akaariai

I think it is safe. The reason is that if B has the import lock, then A's imp.acquire_lock() will block. This is fine, as B can still continue working. If B goes into loading._populate, it has already the import lock, and can bypass A, take the write_lock, and proceed with app loading. If B just finishes its import, it will release the import lock, A can continue and again all fine.

Actually, to me it seems the write_lock is redundant in this case.

comment:7 Changed 2 years ago by akaariai

To be precise: the import lock is always acquired before the write lock. Thus there can be no deadlocks. Also, the import lock is always held when the write lock is held. Thus, the write lock isn't needed.

comment:8 Changed 2 years ago by harm

I can't follow you.. If that is the case then there would not have been a problem in the first place, no ?

I'm not 100% sure how the import locks works, but it seemed to me, that when doing an import of a file foo.py you get an other lock than an other thread importing bar.py, no ?)

I mean you can't fix a ABBA deadlock by adding a lock C between A and B, only around it.

Last edited 2 years ago by harm (previous) (diff)

comment:9 Changed 2 years ago by akaariai

All of the above is assuming there is just a single import lock. The import documentation from Python's docs suggests this, and this test case confirms it:

from threading import Thread
import time
import imp
class thread1(Thread):
    def run(self, *args, **kwargs):
        imp.acquire_lock()
        print "locked"
        time.sleep(10)
        print "unlocking"
        imp.release_lock()

class thread2(Thread):
    def run(self, *args, **kwargs):
        time.sleep(5)
        print "trying to import"
        import sys
        print "import done"
t1 = thread1()
t1.start()
t2 = thread2()
t2.start()

So, what is happening in your case is that apps are loaded from one thread outside of importing. Maybe as a side product of executing a query or something like that. The app loading will then import multiple other modules, and importing these modules needs the import lock. While this is happening, another thread imports a module, and as a side product of this the apps are loaded.

In this situation the first thread does not have the import lock when it enters app loading, it however needs it later on for initializing the applications. The second thread already has the import lock when it tries to load the applications. It tries to take the write_lock, and now you have a deadlock.

I will try to write a small test-application confirming the above, as at the moment the above is just speculation.

comment:10 Changed 2 years ago by akaariai

Attached is a test project showing the problem. It deadlocks reliably on my machine if the writelock is used in app loading, but succeeds if it is changed to imp.acquire_lock() / imp.release_lock().

Unpack the project, run DJANGO_SETTINGS_MODULE=threading_lock.settings python locker.py

Last edited 2 years ago by akaariai (previous) (diff)

Changed 2 years ago by akaariai

comment:11 Changed 2 years ago by akaariai

  • Triage Stage changed from Accepted to Ready for checkin

A RFC patch in pull request https://github.com/django/django/pull/57.

As this is somewhat complex issue and getting this wrong would have serious consequences I hope to get a review from other core committers before commit.

When this gets committed, a note of this issue must be added to #3591, as there are no tests and the current patch in that ticket would reintroduce this issue.

comment:12 Changed 2 years ago by harm

patch looks sane. I think this is a clean an correct fix.

I can't validate your testcase though. It seems the attachment is corrupt ?

harm@dev3:~$ wget -q --no-check-certificate https://code.djangoproject.com/attachment/ticket/18251/threading_lock.tar.gz
harm@dev3:~$ tar vxzf threading_lock.tar.gz 

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Exiting with failure status due to previous errors

Last edited 2 years ago by harm (previous) (diff)

comment:13 Changed 2 years ago by akaariai

It seems you can't download the file like that - I tried that and when I looked into the file it start with this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
...

downloading it through the direct download link works fine.

comment:14 Changed 23 months ago by harm

Ah... right.
Good testcase. I can't think of a way how to put it in a unittest though...

comment:15 Changed 22 months ago by akaariai

I will commit this one soon. I haven't gotten a review for this, but I am pretty confident this is in fact the right approach.

comment:16 Changed 22 months ago by Anssi Kääriäinen <akaariai@…>

  • Resolution set to fixed
  • Status changed from new to closed

In [2b9fb2e6443c04e4415b17083d727bd80047b6e5]:

Fixed #18251 -- Removed a deadlock possibility in apploading

Thanks to harm for the report and comments.

comment:17 Changed 22 months ago by akaariai

Im not going to backpatch this, at least not to 1.4.1. To me the risk that something subtle breaks is too great compared to the likelihood of users hitting this situation. If it turns out this deadlock is more common than it currently seems then lets revisit the backpatching.

comment:18 Changed 3 months ago by Aymeric Augustin <aymeric.augustin@…>

In d674fe6dee16735dd2670243153326806b7e6cb0:

Used a regular lock for app registry population.

Since the app registry is always populated before the first request is
processed, the situation described in #18251 for the old app cache
cannot happen any more.

Refs #18251, #21628.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.