Opened 4 months ago

Last modified 10 days ago

#36770 new Cleanup/optimization

SQLite threading tests are flaky when parallel test suite runs in forkserver mode

Reported by: Jacob Walls Owned by:
Component: Testing framework Version: 5.2
Severity: Normal Keywords: 3.14, forkserver, parallel
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

We have two tests often failing on GitHub Actions CI runs under the parallel test runner having to do with threading and sqlite in-memory databases.

  • backends.sqlite.tests.ThreadSharing.test_database_sharing_in_threads
  • servers.tests.LiveServerInMemoryDatabaseLockTest.test_in_memory_database_lock

As of now, the failures are most common on the byte-compiled Django workflow, but we've at least seen the test_in_memory_database_lock failure on other workflows.

Tracebacks:

======================================================================
FAIL: test_database_sharing_in_threads (backends.sqlite.tests.ThreadSharing.test_database_sharing_in_threads)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 669, in run
    self._callTestMethod(testMethod)
    
  File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 615, in _callTestMethod
    result = method()
    ^^^^^^^^^^^^^^^
  File "/home/runner/work/django/django/tests/backends/sqlite/tests.py", line 282, in test_database_sharing_in_threads
    self.assertEqual(Object.objects.count(), 2)
    ^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 925, in assertEqual
    assertion_func(first, second, msg=msg)
    ^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.14.0/x64/lib/python3.14/unittest/case.py", line 918, in _baseAssertEqual
    raise self.failureException(msg)
    ^^^^^^^^^^^
AssertionError: 1 != 2

----------------------------------------------------------------------
test_in_memory_database_lock (servers.tests.LiveServerInMemoryDatabaseLockTest.test_in_memory_database_lock) failed:

    AssertionError('Unexpected error due to a database lock.')

Other times, the workflow deadlocks, so we don't know which test failed, which caused us to add timeout-minutes: 60 everywhere defensively in e48527f91d341c85a652499a5baaf725d36ae54f.

This failure started manifesting after we upgraded more CI jobs to Python 3.14, which defaults POSIX systems to the forkserver multiprocessing mode. See #36531.

I haven't been successful reproducing locally.

I have a low-confidence hypothesis that it might be something do with calls to setup_worker_connection inside _init_worker that occur during the middle of the test runs when there are resource contentions. Is it possible that a late worker init is clobbering some of these particular tests' setup where database connections are being overwritten to be the same (?)

Change History (10)

comment:1 by Natalia Bidart, 4 months ago

Triage Stage: UnreviewedAccepted

Thank you!

comment:2 by Kundan Yadav, 4 months ago

Owner: set to Kundan Yadav
Status: newassigned

comment:3 by Jacob Walls, 4 months ago

Keywords: spawn removed
Summary: SQLite threading tests are flaky when parallel test suite runs in forkserver/spawnSQLite threading tests are flaky when parallel test suite runs in forkserver mode

I haven't verified that this affects "spawn", so retitling.

comment:4 by Jacob Walls, 2 months ago

I've seen this intermittently locally. Leaving aside the assertion failures for SQLite, we shouldn't have a hang in LiveServerTestCase when tests fail. You can engineer a hang like this, setting a miniscule timeout that will always raise:

diff --git a/django/test/testcases.py b/django/test/testcases.py
index 5f83612fe5..622b938dd6 100644
--- a/django/test/testcases.py
+++ b/django/test/testcases.py
@@ -1844,7 +1844,8 @@ class LiveServerTestCase(TransactionTestCase):
         cls.addClassCleanup(cls._terminate_thread)
 
         # Wait for the live server to be ready
-        cls.server_thread.is_ready.wait()
+        if not cls.server_thread.is_ready.wait(timeout=0.001):
+            raise Exception("Live server never became ready.")
         if cls.server_thread.error:
             raise cls.server_thread.error

Then when KeyboardInterrupting out of it, you get a stack trace from doClassCleanups, suggesting that the termination code is waiting forever, even though the live server never started:

  File "/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/unittest/suite.py", line 181, in _handleClassSetUp
    doClassCleanups()
    ~~~~~~~~~~~~~~~^^
  File "/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/unittest/case.py", line 720, in doClassCleanups
    function(*args, **kwargs)
    ~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/Users/jwalls/django/django/test/testcases.py", line 1864, in _terminate_thread
    cls.server_thread.terminate()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/jwalls/django/django/test/testcases.py", line 1788, in terminate
    self.join()

Something like this fixes it:

  • django/test/testcases.py

    diff --git a/django/test/testcases.py b/django/test/testcases.py
    index 5f83612fe5..9cbeeeca25 100644
    a b class LiveServerThread(threading.Thread):  
    17811781        )
    17821782
    17831783    def terminate(self):
    1784         if hasattr(self, "httpd"):
    1785             # Stop the WSGI server
    1786             self.httpd.shutdown()
    1787             self.httpd.server_close()
    1788         self.join()
     1784        if self.is_ready.is_set():
     1785            if hasattr(self, "httpd"):
     1786                # Stop the WSGI server
     1787                self.httpd.shutdown()
     1788                self.httpd.server_close()
     1789            self.join()

My theory is that the "live server never became ready" situation I simulated above is similar to the situation we're seeing on CI where a database lock entails a failure to start a live server thread.


Then for one of the underlying assertion failures, I don't know how I feel about masking a real problem, but we could probably reduce the chance of failing jobs by adjusting test_in_memory_database_lock() to use the other database instead of the default. It would still cover the code it's testing, but it would just have a much smaller chance of interacting poorly with other tests.

comment:5 by Jacob Walls, 10 days ago

Has patch: set
Owner: changed from Kundan Yadav to Jacob Walls

in reply to:  3 comment:6 by Jacob Walls, 10 days ago

Replying to Jacob Walls:

we could probably reduce the chance of failing jobs by adjusting test_in_memory_database_lock() to use the other database instead of the default.

As Simon surmised on the above PR, this didn't help anything. (The PR still suffered from occasional failures in this test after trying this trick. I removed the speculative trick.)

comment:7 by Jacob Walls <jacobtylerwalls@…>, 10 days ago

In 6c9ef62:

Refs #36770 -- Guarded against an endless wait in LiveServerThread.terminate().

terminate() shouldn't assume the main server was started. (A deadlock
from mishandling of in-memory SQLite databases may have occurred.)

comment:8 by Jacob Walls <jacobtylerwalls@…>, 10 days ago

In 9c9a43b4:

Refs #36770 -- Preferred addCleanup() in live server tests.

comment:9 by Jacob Walls <jacobtylerwalls@…>, 10 days ago

In afa026c:

Refs #36770 -- Skipped test_in_memory_database_lock().

Skip pending some investigation.

comment:10 by Jacob Walls, 10 days ago

Has patch: unset
Owner: Jacob Walls removed
Status: assignednew
Note: See TracTickets for help on using tickets.
Back to Top