Opened 3 months ago

Last modified 2 weeks ago

#36770 assigned Cleanup/optimization

SQLite threading tests are flaky when parallel test suite runs in forkserver mode

Reported by: Jacob Walls Owned by: Kundan Yadav
Component: Testing framework Version: 5.2
Severity: Normal Keywords: 3.14, forkserver, parallel
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

We have two tests often failing on GitHub Actions CI runs under the parallel test runner having to do with threading and sqlite in-memory databases.

  • backends.sqlite.tests.ThreadSharing.test_database_sharing_in_threads
  • servers.tests.LiveServerInMemoryDatabaseLockTest.test_in_memory_database_lock

As of now, the failures are most common on the byte-compiled Django workflow, but we've at least seen the test_in_memory_database_lock failure on other workflows.

Tracebacks:

======================================================================
FAIL: test_database_sharing_in_threads (backends.sqlite.tests.ThreadSharing.test_database_sharing_in_threads)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 669, in run
    self._callTestMethod(testMethod)
    
  File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 615, in _callTestMethod
    result = method()
    ^^^^^^^^^^^^^^^
  File "/home/runner/work/django/django/tests/backends/sqlite/tests.py", line 282, in test_database_sharing_in_threads
    self.assertEqual(Object.objects.count(), 2)
    ^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 925, in assertEqual
    assertion_func(first, second, msg=msg)
    ^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 918, in _baseAssertEqual
    raise self.failureException(msg)
    ^^^^^^^^^^^
AssertionError: 1 != 2

----------------------------------------------------------------------
test_in_memory_database_lock (servers.tests.LiveServerInMemoryDatabaseLockTest.test_in_memory_database_lock) failed:

    AssertionError('Unexpected error due to a database lock.')

Other times, the workflow deadlocks, so we don't know which test failed, which caused us to add timeout-minutes: 60 everywhere defensively in e48527f91d341c85a652499a5baaf725d36ae54f.

This failure started manifesting after we upgraded more CI jobs to Python 3.14, which defaults POSIX systems to the forkserver multiprocessing mode. See #36531.

I haven't been successful reproducing locally.

I have a low-confidence hypothesis that it might be something do with calls to setup_worker_connection inside _init_worker that occur during the middle of the test runs when there are resource contentions. Is it possible that a late worker init is clobbering some of these particular tests' setup where database connections are being overwritten to be the same (?)

Change History (4)

comment:1 by Natalia Bidart, 3 months ago

Triage Stage: UnreviewedAccepted

Thank you!

comment:2 by Kundan Yadav, 2 months ago

Owner: set to Kundan Yadav
Status: newassigned

comment:3 by Jacob Walls, 2 months ago

Keywords: spawn removed
Summary: SQLite threading tests are flaky when parallel test suite runs in forkserver/spawnSQLite threading tests are flaky when parallel test suite runs in forkserver mode

I haven't verified that this affects "spawn", so retitling.

comment:4 by Jacob Walls, 2 weeks ago

I've seen this intermittently locally. Leaving aside the assertion failures for SQLite, we shouldn't have a hang in LiveServerTestCase when tests fail. You can engineer a hang like this, setting a miniscule timeout that will always raise:

diff --git a/django/test/testcases.py b/django/test/testcases.py
index 5f83612fe5..622b938dd6 100644
--- a/django/test/testcases.py
+++ b/django/test/testcases.py
@@ -1844,7 +1844,8 @@ class LiveServerTestCase(TransactionTestCase):
         cls.addClassCleanup(cls._terminate_thread)
 
         # Wait for the live server to be ready
-        cls.server_thread.is_ready.wait()
+        if not cls.server_thread.is_ready.wait(timeout=0.001):
+            raise Exception("Live server never became ready.")
         if cls.server_thread.error:
             raise cls.server_thread.error

Then when KeyboardInterrupting out of it, you get a stack trace from doClassCleanups, suggesting that the termination code is waiting forever, even though the live server never started:

  File "/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/unittest/suite.py", line 181, in _handleClassSetUp
    doClassCleanups()
    ~~~~~~~~~~~~~~~^^
  File "/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/unittest/case.py", line 720, in doClassCleanups
    function(*args, **kwargs)
    ~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/Users/jwalls/django/django/test/testcases.py", line 1864, in _terminate_thread
    cls.server_thread.terminate()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/jwalls/django/django/test/testcases.py", line 1788, in terminate
    self.join()

Something like this fixes it:

  • django/test/testcases.py

    diff --git a/django/test/testcases.py b/django/test/testcases.py
    index 5f83612fe5..9cbeeeca25 100644
    a b class LiveServerThread(threading.Thread):  
    17811781        )
    17821782
    17831783    def terminate(self):
    1784         if hasattr(self, "httpd"):
    1785             # Stop the WSGI server
    1786             self.httpd.shutdown()
    1787             self.httpd.server_close()
    1788         self.join()
     1784        if self.is_ready.is_set():
     1785            if hasattr(self, "httpd"):
     1786                # Stop the WSGI server
     1787                self.httpd.shutdown()
     1788                self.httpd.server_close()
     1789            self.join()

My theory is that the "live server never became ready" situation I simulated above is similar to the situation we're seeing on CI where a database lock entails a failure to start a live server thread.


Then for one of the underlying assertion failures, I don't know how I feel about masking a real problem, but we could probably reduce the chance of failing jobs by adjusting test_in_memory_database_lock() to use the other database instead of the default. It would still cover the code it's testing, but it would just have a much smaller chance of interacting poorly with other tests.

Note: See TracTickets for help on using tickets.
Back to Top