Opened 4 months ago
Last modified 10 days ago
#36770 new Cleanup/optimization
SQLite threading tests are flaky when parallel test suite runs in forkserver mode
| Reported by: | Jacob Walls | Owned by: | |
|---|---|---|---|
| Component: | Testing framework | Version: | 5.2 |
| Severity: | Normal | Keywords: | 3.14, forkserver, parallel |
| Cc: | Triage Stage: | Accepted | |
| Has patch: | no | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description
We have two tests often failing on GitHub Actions CI runs under the parallel test runner having to do with threading and sqlite in-memory databases.
backends.sqlite.tests.ThreadSharing.test_database_sharing_in_threadsservers.tests.LiveServerInMemoryDatabaseLockTest.test_in_memory_database_lock
As of now, the failures are most common on the byte-compiled Django workflow, but we've at least seen the test_in_memory_database_lock failure on other workflows.
Tracebacks:
====================================================================== FAIL: test_database_sharing_in_threads (backends.sqlite.tests.ThreadSharing.test_database_sharing_in_threads) ---------------------------------------------------------------------- Traceback (most recent call last): File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 58, in testPartExecutor yield File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 669, in run self._callTestMethod(testMethod) File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 615, in _callTestMethod result = method() ^^^^^^^^^^^^^^^ File "/home/runner/work/django/django/tests/backends/sqlite/tests.py", line 282, in test_database_sharing_in_threads self.assertEqual(Object.objects.count(), 2) ^^^^^^^^^^^ File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 925, in assertEqual assertion_func(first, second, msg=msg) ^^^^^^^^^^^^^^^ File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 918, in _baseAssertEqual raise self.failureException(msg) ^^^^^^^^^^^ AssertionError: 1 != 2 ----------------------------------------------------------------------
test_in_memory_database_lock (servers.tests.LiveServerInMemoryDatabaseLockTest.test_in_memory_database_lock) failed: AssertionError('Unexpected error due to a database lock.')
Other times, the workflow deadlocks, so we don't know which test failed, which caused us to add timeout-minutes: 60 everywhere defensively in e48527f91d341c85a652499a5baaf725d36ae54f.
This failure started manifesting after we upgraded more CI jobs to Python 3.14, which defaults POSIX systems to the forkserver multiprocessing mode. See #36531.
I haven't been successful reproducing locally.
I have a low-confidence hypothesis that it might be something do with calls to setup_worker_connection inside _init_worker that occur during the middle of the test runs when there are resource contentions. Is it possible that a late worker init is clobbering some of these particular tests' setup where database connections are being overwritten to be the same (?)
Change History (10)
comment:1 by , 4 months ago
| Triage Stage: | Unreviewed → Accepted |
|---|
comment:2 by , 4 months ago
| Owner: | set to |
|---|---|
| Status: | new → assigned |
follow-up: 6 comment:3 by , 4 months ago
| Keywords: | spawn removed |
|---|---|
| Summary: | SQLite threading tests are flaky when parallel test suite runs in forkserver/spawn → SQLite threading tests are flaky when parallel test suite runs in forkserver mode |
I haven't verified that this affects "spawn", so retitling.
comment:4 by , 2 months ago
I've seen this intermittently locally. Leaving aside the assertion failures for SQLite, we shouldn't have a hang in LiveServerTestCase when tests fail. You can engineer a hang like this, setting a miniscule timeout that will always raise:
diff --git a/django/test/testcases.py b/django/test/testcases.py index 5f83612fe5..622b938dd6 100644 --- a/django/test/testcases.py +++ b/django/test/testcases.py @@ -1844,7 +1844,8 @@ class LiveServerTestCase(TransactionTestCase): cls.addClassCleanup(cls._terminate_thread) # Wait for the live server to be ready - cls.server_thread.is_ready.wait() + if not cls.server_thread.is_ready.wait(timeout=0.001): + raise Exception("Live server never became ready.") if cls.server_thread.error: raise cls.server_thread.error
Then when KeyboardInterrupting out of it, you get a stack trace from doClassCleanups, suggesting that the termination code is waiting forever, even though the live server never started:
File "/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/unittest/suite.py", line 181, in _handleClassSetUp doClassCleanups() ~~~~~~~~~~~~~~~^^ File "/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/unittest/case.py", line 720, in doClassCleanups function(*args, **kwargs) ~~~~~~~~^^^^^^^^^^^^^^^^^ File "/Users/jwalls/django/django/test/testcases.py", line 1864, in _terminate_thread cls.server_thread.terminate() ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^ File "/Users/jwalls/django/django/test/testcases.py", line 1788, in terminate self.join()
Something like this fixes it:
-
django/test/testcases.py
diff --git a/django/test/testcases.py b/django/test/testcases.py index 5f83612fe5..9cbeeeca25 100644
a b class LiveServerThread(threading.Thread): 1781 1781 ) 1782 1782 1783 1783 def terminate(self): 1784 if hasattr(self, "httpd"): 1785 # Stop the WSGI server 1786 self.httpd.shutdown() 1787 self.httpd.server_close() 1788 self.join() 1784 if self.is_ready.is_set(): 1785 if hasattr(self, "httpd"): 1786 # Stop the WSGI server 1787 self.httpd.shutdown() 1788 self.httpd.server_close() 1789 self.join()
My theory is that the "live server never became ready" situation I simulated above is similar to the situation we're seeing on CI where a database lock entails a failure to start a live server thread.
Then for one of the underlying assertion failures, I don't know how I feel about masking a real problem, but we could probably reduce the chance of failing jobs by adjusting test_in_memory_database_lock() to use the other database instead of the default. It would still cover the code it's testing, but it would just have a much smaller chance of interacting poorly with other tests.
comment:6 by , 10 days ago
Replying to Jacob Walls:
we could probably reduce the chance of failing jobs by adjusting test_in_memory_database_lock() to use the other database instead of the default.
As Simon surmised on the above PR, this didn't help anything. (The PR still suffered from occasional failures in this test after trying this trick. I removed the speculative trick.)
comment:10 by , 10 days ago
| Has patch: | unset |
|---|---|
| Owner: | removed |
| Status: | assigned → new |
Thank you!