Opened 3 months ago
Last modified 2 weeks ago
#36770 assigned Cleanup/optimization
SQLite threading tests are flaky when parallel test suite runs in forkserver mode
| Reported by: | Jacob Walls | Owned by: | Kundan Yadav |
|---|---|---|---|
| Component: | Testing framework | Version: | 5.2 |
| Severity: | Normal | Keywords: | 3.14, forkserver, parallel |
| Cc: | Triage Stage: | Accepted | |
| Has patch: | no | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description
We have two tests often failing on GitHub Actions CI runs under the parallel test runner having to do with threading and sqlite in-memory databases.
backends.sqlite.tests.ThreadSharing.test_database_sharing_in_threadsservers.tests.LiveServerInMemoryDatabaseLockTest.test_in_memory_database_lock
As of now, the failures are most common on the byte-compiled Django workflow, but we've at least seen the test_in_memory_database_lock failure on other workflows.
Tracebacks:
====================================================================== FAIL: test_database_sharing_in_threads (backends.sqlite.tests.ThreadSharing.test_database_sharing_in_threads) ---------------------------------------------------------------------- Traceback (most recent call last): File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 58, in testPartExecutor yield File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 669, in run self._callTestMethod(testMethod) File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 615, in _callTestMethod result = method() ^^^^^^^^^^^^^^^ File "/home/runner/work/django/django/tests/backends/sqlite/tests.py", line 282, in test_database_sharing_in_threads self.assertEqual(Object.objects.count(), 2) ^^^^^^^^^^^ File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 925, in assertEqual assertion_func(first, second, msg=msg) ^^^^^^^^^^^^^^^ File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 918, in _baseAssertEqual raise self.failureException(msg) ^^^^^^^^^^^ AssertionError: 1 != 2 ----------------------------------------------------------------------
test_in_memory_database_lock (servers.tests.LiveServerInMemoryDatabaseLockTest.test_in_memory_database_lock) failed: AssertionError('Unexpected error due to a database lock.')
Other times, the workflow deadlocks, so we don't know which test failed, which caused us to add timeout-minutes: 60 everywhere defensively in e48527f91d341c85a652499a5baaf725d36ae54f.
This failure started manifesting after we upgraded more CI jobs to Python 3.14, which defaults POSIX systems to the forkserver multiprocessing mode. See #36531.
I haven't been successful reproducing locally.
I have a low-confidence hypothesis that it might be something do with calls to setup_worker_connection inside _init_worker that occur during the middle of the test runs when there are resource contentions. Is it possible that a late worker init is clobbering some of these particular tests' setup where database connections are being overwritten to be the same (?)
Change History (4)
comment:1 by , 3 months ago
| Triage Stage: | Unreviewed → Accepted |
|---|
comment:2 by , 2 months ago
| Owner: | set to |
|---|---|
| Status: | new → assigned |
comment:3 by , 2 months ago
| Keywords: | spawn removed |
|---|---|
| Summary: | SQLite threading tests are flaky when parallel test suite runs in forkserver/spawn → SQLite threading tests are flaky when parallel test suite runs in forkserver mode |
I haven't verified that this affects "spawn", so retitling.
comment:4 by , 2 weeks ago
I've seen this intermittently locally. Leaving aside the assertion failures for SQLite, we shouldn't have a hang in LiveServerTestCase when tests fail. You can engineer a hang like this, setting a miniscule timeout that will always raise:
diff --git a/django/test/testcases.py b/django/test/testcases.py index 5f83612fe5..622b938dd6 100644 --- a/django/test/testcases.py +++ b/django/test/testcases.py @@ -1844,7 +1844,8 @@ class LiveServerTestCase(TransactionTestCase): cls.addClassCleanup(cls._terminate_thread) # Wait for the live server to be ready - cls.server_thread.is_ready.wait() + if not cls.server_thread.is_ready.wait(timeout=0.001): + raise Exception("Live server never became ready.") if cls.server_thread.error: raise cls.server_thread.error
Then when KeyboardInterrupting out of it, you get a stack trace from doClassCleanups, suggesting that the termination code is waiting forever, even though the live server never started:
File "/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/unittest/suite.py", line 181, in _handleClassSetUp doClassCleanups() ~~~~~~~~~~~~~~~^^ File "/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/unittest/case.py", line 720, in doClassCleanups function(*args, **kwargs) ~~~~~~~~^^^^^^^^^^^^^^^^^ File "/Users/jwalls/django/django/test/testcases.py", line 1864, in _terminate_thread cls.server_thread.terminate() ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^ File "/Users/jwalls/django/django/test/testcases.py", line 1788, in terminate self.join()
Something like this fixes it:
-
django/test/testcases.py
diff --git a/django/test/testcases.py b/django/test/testcases.py index 5f83612fe5..9cbeeeca25 100644
a b class LiveServerThread(threading.Thread): 1781 1781 ) 1782 1782 1783 1783 def terminate(self): 1784 if hasattr(self, "httpd"): 1785 # Stop the WSGI server 1786 self.httpd.shutdown() 1787 self.httpd.server_close() 1788 self.join() 1784 if self.is_ready.is_set(): 1785 if hasattr(self, "httpd"): 1786 # Stop the WSGI server 1787 self.httpd.shutdown() 1788 self.httpd.server_close() 1789 self.join()
My theory is that the "live server never became ready" situation I simulated above is similar to the situation we're seeing on CI where a database lock entails a failure to start a live server thread.
Then for one of the underlying assertion failures, I don't know how I feel about masking a real problem, but we could probably reduce the chance of failing jobs by adjusting test_in_memory_database_lock() to use the other database instead of the default. It would still cover the code it's testing, but it would just have a much smaller chance of interacting poorly with other tests.
Thank you!