#18251 closed Bug (fixed)
multithreading deadlock in django.models.loading.get_apps
Reported by: | harm | Owned by: | nobody |
---|---|---|---|
Component: | Core (Other) | Version: | 1.3 |
Severity: | Normal | Keywords: | |
Cc: | hjwp2@… | Triage Stage: | Ready for checkin |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
On a production site we have encountered deadlocks, which after fierce investigation pointed to the django internals.
We use a deployment: apache/mod_wsgi
apache config:
WSGIDaemonProcess <name_here> processes=1 threads=10 umask=0002 display-name=%{GROUP} stack-size=524288
details of deadlock
During the bootstrap of django, so the first requests it handles, a classic ABBA deadlock can occur.
This results in that django never boots, and that NO requests are handled. Just restarting apache works, (hoping that the same deadlock does not appear again).
description of deadlock
thread A thread B request foo import foo.views.py import foo.models.py request 'bar' import bar.models.py (acquires import lock B) bootstraps django calls django.models.loading.get_apps() self.write_lock.acquire() (lock A!) load app apps . . . django.models.loading.get_apps() import bar.models.py (takes import lock. lock B!) self.write_lock.acquire() (lock A!) <Blocked> <Blocked>
NB: this exact deadlock was actually seen in stracktraces. This is not a guess.
How to reproduce
its hard to reproduce, you need high volume traffic + multithreaded deployement + a big enough application that different urls trigger compelete different code paths in your project.
I am not able to produce a simple test project that demonstrates this problem
We had a 100% reproducible setup, where the site would lock up (almost) every time when we touched wsgi.py during peak hours of the day.
- When you have a projects that is not multithreaded deployed -> this problem does not happen
- When your application has a limited set of urls/views so that all initial calls follow same code paths -> this problem does not happen
- When application has low traffic (the very first request can finish, without other requests beeing made) -> this problem does not happen.
Workaround
We use the attached DjangoWSGIHandler
wrapper as a workaround. That solved the problem 100 % for us.
Its a drop-in replacement for DjangoWSGIHandler
and effectively disables multithreading the first few request (allowing django to bootstrap properly), and only afther the first requests act multitheading again.
Please consider if this workaround should be applied to django itself. (Put the Lock HIGHER the chain, to the wsgi handler level)
Fix
Some init locking should be adding to DjangoWSGIHandler
around the complete request.
django.core.handlers.wsgi.WSGIHandler() has such lock around middleware loading.
if self._request_middleware is None: self.initLock.acquire()
Something similar should be around the complete request, not just middleware loading.
But only the very first request.
Attachments (2)
Change History (21)
by , 13 years ago
Attachment: | workaround.py added |
---|
comment:1 by , 13 years ago
Component: | Uncategorized → Core (Other) |
---|---|
Triage Stage: | Unreviewed → Accepted |
Type: | Uncategorized → Bug |
comment:2 by , 13 years ago
Could this work: in WSGIHandler, when serving request do:
from django.db.models.loading import app_cache_ready, get_apps if not app_cache_ready(): get_apps()
I didn't put any lock in there - I think the get_apps call should be safe as it pretty much immediately acquires the write_lock of app_cache. You don't even need the if not app_cache_ready(), as get_apps() returns immediately if apps are already loaded. It reads better to me that way..
Can you report if this works?
comment:3 by , 13 years ago
Hm. smart. Yes, that would work.
I cannot test it, as the system that reproduced this problem is our production system.
Note: you do need something like if not app_cache_ready()
, as get_apps()
actually does a load of work (even when called later), it returns the apps list, you don't want that every call. (at least in django 1.3)
micro optimization:
Als this is really really in a hotspot of the code, storing this state locally in WSGIHandler, might be faster than calling app_cache_ready()
every request.
so something like this in WSGIHandler
:
if not self._apps_initialized: get_apps() # if entered more that once thats ok, as long as we dont hit this every request self._apps_initialized = True
comment:4 by , 13 years ago
Good point about the additional work done by get_apps().
I don't believe self._apps_initialized is needed, a method call which returns self.loaded should be light enough. It is hard to imagine a situation where one method call is crucial to performance.
Maybe the best way would be to just make a load() method in apps, which returns immediately if already loaded.
Another way would be to first issue imp.acquire_lock() before taking the write_lock in app-loading. This should probably be done in any case. Although explicitly initializing Django's app-state feels right to me.
I checked the app-loading refactor work (#3591) and it seems it does not have any solution to this problem.
It would be very good to be able to write a test case for this, but it seems hard...
comment:5 by , 13 years ago
I'm not sure aquiring the imp.acquire_lock()
would help in this case. Aren't we too late then ? (the lock B would already be taken bij thread B in the sequance diagram) leaving the same ABBA deadlock. I think this needs to be done in the wsgihandler.
comment:6 by , 13 years ago
I think it is safe. The reason is that if B has the import lock, then A's imp.acquire_lock() will block. This is fine, as B can still continue working. If B goes into loading._populate, it has already the import lock, and can bypass A, take the write_lock, and proceed with app loading. If B just finishes its import, it will release the import lock, A can continue and again all fine.
Actually, to me it seems the write_lock is redundant in this case.
comment:7 by , 13 years ago
To be precise: the import lock is always acquired before the write lock. Thus there can be no deadlocks. Also, the import lock is always held when the write lock is held. Thus, the write lock isn't needed.
comment:8 by , 13 years ago
I can't follow you.. If that is the case then there would not have been a problem in the first place, no ?
I'm not 100% sure how the import locks works, but it seemed to me, that when doing an import of a file foo.py you get an other lock than an other thread importing bar.py, no ?)
I mean you can't fix a ABBA deadlock by adding a lock C between A and B, only around it.
comment:9 by , 13 years ago
All of the above is assuming there is just a single import lock. The import documentation from Python's docs suggests this, and this test case confirms it:
from threading import Thread import time import imp class thread1(Thread): def run(self, *args, **kwargs): imp.acquire_lock() print "locked" time.sleep(10) print "unlocking" imp.release_lock() class thread2(Thread): def run(self, *args, **kwargs): time.sleep(5) print "trying to import" import sys print "import done" t1 = thread1() t1.start() t2 = thread2() t2.start()
So, what is happening in your case is that apps are loaded from one thread outside of importing. Maybe as a side product of executing a query or something like that. The app loading will then import multiple other modules, and importing these modules needs the import lock. While this is happening, another thread imports a module, and as a side product of this the apps are loaded.
In this situation the first thread does not have the import lock when it enters app loading, it however needs it later on for initializing the applications. The second thread already has the import lock when it tries to load the applications. It tries to take the write_lock, and now you have a deadlock.
I will try to write a small test-application confirming the above, as at the moment the above is just speculation.
comment:10 by , 13 years ago
Attached is a test case showing the problem. It deadlocks reliably on my machine if the writelock is used in app loading, but succeeds if it is changed to imp.acquire_lock() / imp.release_lock().
by , 13 years ago
Attachment: | threading_lock.tar.gz added |
---|
comment:11 by , 13 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
A RFC patch in pull request https://github.com/django/django/pull/57.
As this is somewhat complex issue and getting this wrong would have serious consequences I hope to get a review from other core committers before commit.
When this gets committed, a note of this issue must be added to #3591, as there are no tests and the current patch in that ticket would reintroduce this issue.
comment:12 by , 13 years ago
patch looks sane. I think this is a clean an correct fix.
I can't validate your testcase though. It seems the attachment is corrupt ?
harm@dev3:~$ wget -q --no-check-certificate https://code.djangoproject.com/attachment/ticket/18251/threading_lock.tar.gz harm@dev3:~$ tar vxzf threading_lock.tar.gz gzip: stdin: not in gzip format tar: Child returned status 1 tar: Exiting with failure status due to previous errors
comment:13 by , 13 years ago
It seems you can't download the file like that - I tried that and when I looked into the file it start with this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> ...
downloading it through the direct download link works fine.
comment:14 by , 13 years ago
Ah... right.
Good testcase. I can't think of a way how to put it in a unittest though...
comment:15 by , 12 years ago
I will commit this one soon. I haven't gotten a review for this, but I am pretty confident this is in fact the right approach.
comment:16 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:17 by , 12 years ago
Im not going to backpatch this, at least not to 1.4.1. To me the risk that something subtle breaks is too great compared to the likelihood of users hitting this situation. If it turns out this deadlock is more common than it currently seems then lets revisit the backpatching.
comment:19 by , 9 years ago
Cc: | added |
---|
workaround (not a fix) wrapper around DjangoWSGIHandler