Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#32769 closed Bug (duplicate)

Fix for sporadically crashing parallel test runner

Reported by: Ceesjan Luiten Owned by: nobody
Component: Testing framework Version: 2.2
Severity: Normal Keywords:
Cc: Chris Jerdonek Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Ceesjan Luiten)

G'day,

I'd like to report a fix for a bug I found with the parallel test runner. I'm filing it against 2.2 since that's what I use myself, but looking at the sourcecode 3.2 also has this bug.

Tickets that sound related but also not entirely the same: #27734

I have about 25 active Django projects, all in CI with testcases, some going back 10 years. About 6 months ago I started using --parallel to speed up testing. The speedup is great, obviously I had to fix some tests that were not compatible but that has been sorted out by now.

Since using --parallel I've also encountered crashes that really can't be blamed on my own code. They happen quite infrequent, so up to now I've been ignoring them. Today I ran into it again and decided to do some digging.

The crash

They always look like this:

Traceback (most recent call last):
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 218, in ensure_connection
    self.connect()
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 196, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/postgresql/base.py", line 182, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/psycopg2/__init__.py", line 127, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL:  database "test_3" does not exist


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.6/unittest/suite.py", line 163, in _handleClassSetUp
    setUpClass()
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/test/testcases.py", line 1126, in setUpClass
    cls.cls_atomics = cls._enter_atomics()
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/test/testcases.py", line 1107, in _enter_atomics
    atomics[db_name].__enter__()
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/transaction.py", line 175, in __enter__
    if not connection.get_autocommit():
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 380, in get_autocommit
    self.ensure_connection()
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 218, in ensure_connection
    self.connect()
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 218, in ensure_connection
    self.connect()
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 196, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/postgresql/base.py", line 182, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/psycopg2/__init__.py", line 127, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: FATAL:  database "test_3" does not exist

My observations

It has been happening across projects. Today I zoomed in on 1 specific project. The facts summarized:

  • My CPU has 4 cores (this is important for the story)
  • When I only test 2 separate classes at once I can trigger it once every 30 times
  • The crashes always mention the DBs test_3 or test_4, never test_1
  • The command line used: python manage.py test my_app_with_two_testcase_classes --parallel
  • I haven't set DJANGO_TEST_PROCESSES

I hacked the Django code with additional print() statements. Things I noticed:

  • Django starts out with assuming it can do 4 parallel processess, which is correct for my CPU
  • When checking the Postgresql DB log 3 DBs are made:
    $ grep 'CREATE DATABASE' postgres.log
    2021-05-19 14:29:11.473 CEST [23952] LOG:  duration: 216.274 ms  statement: CREATE DATABASE test ENCODING 'UNICODE';
    2021-05-19 14:29:12.918 CEST [24024] LOG:  duration: 226.477 ms  statement: CREATE DATABASE "test_1" WITH TEMPLATE "test"
    2021-05-19 14:29:13.147 CEST [24025] LOG:  duration: 225.881 ms  statement: CREATE DATABASE "test_2" WITH TEMPLATE "test"
    
  • db.backends.base.creation BaseDatabaseCreation.get_test_db_clone_settings is called four times, with suffix 1, 2, 3 and 4
  • test.runner _init_worker is called four times, with _worker_id 1, 2, 3 and 4.
  • db.backends.postgresql.creation DatabaseCreation._clone_test_db is called twice, always with test_1 and test_2 as DBs
  • db.backends.postgresql.creation DatabaseWrapper.get_connection_params is called twice:
    • most of time time with test_1 and test_2. In these cases the testcases do not crash
    • once in a while with fe. test_1 and test_3. In this case Django will crash

My fix

Based on the prints I'm assuming Django creates 4 processes, then only initializes 2 because it only needs 2, and as long as the first 2 processes get used everything works out but if, by randomness, either the 3rd or 4th process is activated things break?

Because it happens so infrequently I can't 100% guarantee my fix works, but it appears to: after applying my fix _init_worker is called only twice, get_test_db_clone_settings is called only twice, and I haven't seen it crash after 200 consecutive runs.

--- old_runner.py	2021-05-19 15:01:32.756268287 +0200
+++ new_runner.py	2021-05-19 15:01:18.844150140 +0200
@@ -541,6 +541,7 @@
             # basis, there's no need for more processes than TestCases.
             parallel_units = len(parallel_suite.subsuites)
             self.parallel = min(self.parallel, parallel_units)
+            parallel_suite.processes = self.parallel
 
             # If there's only one TestCase, parallelization isn't needed.
             if self.parallel > 1:

I can make a PR in Github if you want (against the main branch then?)

Change History (3)

comment:1 by Ceesjan Luiten, 3 years ago

Description: modified (diff)

comment:2 by Mariusz Felisiak, 3 years ago

Cc: Chris Jerdonek added
Resolution: duplicate
Status: newclosed

Thanks for the report. I think it's a duplicate of #27734, it's not an issue with the number of workers, but an issue with a shared counter (see comment). Reducing the number of workers just make it less likely. Can you reproduce it with Python 3.8+ and on the main branch? Django 2.2 is in extended support so it doesn't receive bugfixes anymore (except security patches).

I agree that we could reduce the number of workers but it's only a small cleanup that will not fix the main issue described in #27734. I would reduce the number of processes directly in the ParallelTestSuite, e.g.

diff --git a/django/test/runner.py b/django/test/runner.py
index ab5512e628..50f66d3d87 100644
--- a/django/test/runner.py
+++ b/django/test/runner.py
@@ -405,7 +405,9 @@ class ParallelTestSuite(unittest.TestSuite):
 
     def __init__(self, suite, processes, failfast=False, buffer=False):
         self.subsuites = partition_suite_by_case(suite)
-        self.processes = processes
+        # Since tests are distributed across processes on a per-TestCase basis,
+        # there's no need for more processes than TestCases.
+        self.processes = min(processes, len(self.subsuites))
         self.failfast = failfast
         self.buffer = buffer
         super().__init__()
@@ -662,14 +664,8 @@ class DiscoverRunner:
                 self.failfast,
                 self.buffer,
             )
-
-            # Since tests are distributed across processes on a per-TestCase
-            # basis, there's no need for more processes than TestCases.
-            parallel_units = len(parallel_suite.subsuites)
-            self.parallel = min(self.parallel, parallel_units)
-
             # If there's only one TestCase, parallelization isn't needed.
-            if self.parallel > 1:
+            if parallel_suite.processes > 1:
                 suite = parallel_suite
 
         return suite

Feel-free to prepare a patch with Refs #27734 -- ....

Duplicate of #27734.

comment:3 by Ceesjan Luiten, 3 years ago

Hey Mariusz,

Thanks for taking this up.

I've verified that using Python 3.8.5 and the main branch:

  • without the patch the crashes still occur
  • with the patch the crashes no longer occur

I see that with the patch it will create 4 DBs again, even though there are only 2 needed (since it can only run 2 classes in parallel). I'm fine with this: having Django not crash is far, far more important than the ~250ms spent on creating a not-needed DB.

I'll make a PR using your code.

Note: See TracTickets for help on using tickets.
Back to Top