Context Navigation

← Previous Ticket
Next Ticket →

#36770 assigned Cleanup/optimization

SQLite threading tests are flaky when parallel test suite runs in forkserver mode

Reported by:	Jacob Walls	Owned by:	Kundan Yadav
Component:	Testing framework	Version:	5.2
Severity:	Normal	Keywords:	3.14, forkserver, parallel
Cc:		Triage Stage:	Accepted
Has patch:	no	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description

We have two tests often failing on GitHub Actions CI runs under the parallel test runner having to do with threading and sqlite in-memory databases.

backends.sqlite.tests.ThreadSharing.test_database_sharing_in_threads
servers.tests.LiveServerInMemoryDatabaseLockTest.test_in_memory_database_lock

As of now, the failures are most common on the byte-compiled Django workflow, but we've at least seen the test_in_memory_database_lock failure on other workflows.

Tracebacks:

======================================================================
FAIL: test_database_sharing_in_threads (backends.sqlite.tests.ThreadSharing.test_database_sharing_in_threads)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 669, in run
    self._callTestMethod(testMethod)
    
  File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 615, in _callTestMethod
    result = method()
    ^^^^^^^^^^^^^^^
  File "/home/runner/work/django/django/tests/backends/sqlite/tests.py", line 282, in test_database_sharing_in_threads
    self.assertEqual(Object.objects.count(), 2)
    ^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 925, in assertEqual
    assertion_func(first, second, msg=msg)
    ^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 918, in _baseAssertEqual
    raise self.failureException(msg)
    ^^^^^^^^^^^
AssertionError: 1 != 2

----------------------------------------------------------------------

test_in_memory_database_lock (servers.tests.LiveServerInMemoryDatabaseLockTest.test_in_memory_database_lock) failed:

    AssertionError('Unexpected error due to a database lock.')

Other times, the workflow deadlocks, so we don't know which test failed, which caused us to add timeout-minutes: 60 everywhere defensively in e48527f91d341c85a652499a5baaf725d36ae54f.

This failure started manifesting after we upgraded more CI jobs to Python 3.14, which defaults POSIX systems to the forkserver multiprocessing mode. See #36531.

I haven't been successful reproducing locally.

I have a low-confidence hypothesis that it might be something do with calls to setup_worker_connection inside _init_worker that occur during the middle of the test runs when there are resource contentions. Is it possible that a late worker init is clobbering some of these particular tests' setup where database connections are being overwritten to be the same (?)

Change History (4)

comment:1 by Natalia Bidart, 3 months ago

Triage Stage:	Unreviewed → Accepted

Thank you!

comment:2 by Kundan Yadav, 2 months ago

Owner:	set to Kundan Yadav
Status:	new → assigned

comment:3 by Jacob Walls, 2 months ago

Keywords:	spawn removed
Summary:	SQLite threading tests are flaky when parallel test suite runs in forkserver/spawn → SQLite threading tests are flaky when parallel test suite runs in forkserver mode

I haven't verified that this affects "spawn", so retitling.

comment:4 by Jacob Walls, 2 weeks ago

I've seen this intermittently locally. Leaving aside the assertion failures for SQLite, we shouldn't have a hang in LiveServerTestCase when tests fail. You can engineer a hang like this, setting a miniscule timeout that will always raise:

diff --git a/django/test/testcases.py b/django/test/testcases.py
index 5f83612fe5..622b938dd6 100644
--- a/django/test/testcases.py
+++ b/django/test/testcases.py
@@ -1844,7 +1844,8 @@ class LiveServerTestCase(TransactionTestCase):
         cls.addClassCleanup(cls._terminate_thread)
 
         # Wait for the live server to be ready
-        cls.server_thread.is_ready.wait()
+        if not cls.server_thread.is_ready.wait(timeout=0.001):
+            raise Exception("Live server never became ready.")
         if cls.server_thread.error:
             raise cls.server_thread.error

Then when KeyboardInterrupting out of it, you get a stack trace from doClassCleanups, suggesting that the termination code is waiting forever, even though the live server never started:

  File "/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/unittest/suite.py", line 181, in _handleClassSetUp
    doClassCleanups()
    ~~~~~~~~~~~~~~~^^
  File "/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/unittest/case.py", line 720, in doClassCleanups
    function(*args, **kwargs)
    ~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/Users/jwalls/django/django/test/testcases.py", line 1864, in _terminate_thread
    cls.server_thread.terminate()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/jwalls/django/django/test/testcases.py", line 1788, in terminate
    self.join()

Something like this fixes it:

django/test/testcases.py

diff --git a/django/test/testcases.py b/django/test/testcases.py
index 5f83612fe5..9cbeeeca25 100644

                class LiveServerThread(threading.Thread):
+        )
     def terminate(self):
+        if hasattr(self, "httpd"):
+            # Stop the WSGI server
+            self.httpd.shutdown()
+            self.httpd.server_close()
+        self.join()
+        if self.is_ready.is_set():
+            if hasattr(self, "httpd"):
+                # Stop the WSGI server
+                self.httpd.shutdown()
+                self.httpd.server_close()
+            self.join()

My theory is that the "live server never became ready" situation I simulated above is similar to the situation we're seeing on CI where a database lock entails a failure to start a live server thread.

Then for one of the underlying assertion failures, I don't know how I feel about masking a real problem, but we could probably reduce the chance of failing jobs by adjusting test_in_memory_database_lock() to use the other database instead of the default. It would still cover the code it's testing, but it would just have a much smaller chance of interacting poorly with other tests.

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues

Context Navigation

#36770 assigned Cleanup/optimization

SQLite threading tests are flaky when parallel test suite runs in forkserver mode

Description

Change History (4)

comment:1 by Natalia Bidart, 3 months ago

comment:2 by Kundan Yadav, 2 months ago

comment:3 by Jacob Walls, 2 months ago

comment:4 by Jacob Walls, 2 weeks ago

django/test/testcases.py

Download in other formats:

Django Links

Learn More

Get Involved

Get Help

Follow Us

Support Us