﻿id	summary	reporter	owner	description	type	status	component	version	severity	resolution	keywords	cc	stage	has_patch	needs_docs	needs_tests	needs_better_patch	easy	ui_ux
32769	Fix for sporadically crashing parallel test runner	Ceesjan Luiten	nobody	"G'day,

I'd like to report a fix for a bug I found with the parallel test runner. I'm filing it against 2.2 since that's what I use myself, but looking at the sourcecode 3.2 also has this bug.

Tickets that sound related but also not entirely the same: #27734 

I have about 25 active Django projects, all in CI with testcases, some going back 10 years. About 6 months ago I started using `--parallel` to speed up testing. The speedup is great, obviously I had to fix some tests that were not compatible but that has been sorted out by now.

Since using `--parallel` I've also encountered crashes that really can't be blamed on my own code. They happen quite infrequent, so up to now I've been ignoring them. Today I ran into it again and decided to do some digging.

== The crash

They always look like this:

{{{
#!python
Traceback (most recent call last):
  File ""/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py"", line 218, in ensure_connection
    self.connect()
  File ""/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py"", line 196, in connect
    self.connection = self.get_new_connection(conn_params)
  File ""/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/postgresql/base.py"", line 182, in get_new_connection
    connection = Database.connect(**conn_params)
  File ""/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/psycopg2/__init__.py"", line 127, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL:  database ""test_3"" does not exist


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ""/usr/lib/python3.6/unittest/suite.py"", line 163, in _handleClassSetUp
    setUpClass()
  File ""/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/test/testcases.py"", line 1126, in setUpClass
    cls.cls_atomics = cls._enter_atomics()
  File ""/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/test/testcases.py"", line 1107, in _enter_atomics
    atomics[db_name].__enter__()
  File ""/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/transaction.py"", line 175, in __enter__
    if not connection.get_autocommit():
  File ""/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py"", line 380, in get_autocommit
    self.ensure_connection()
  File ""/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py"", line 218, in ensure_connection
    self.connect()
  File ""/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/utils.py"", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File ""/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py"", line 218, in ensure_connection
    self.connect()
  File ""/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py"", line 196, in connect
    self.connection = self.get_new_connection(conn_params)
  File ""/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/postgresql/base.py"", line 182, in get_new_connection
    connection = Database.connect(**conn_params)
  File ""/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/psycopg2/__init__.py"", line 127, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: FATAL:  database ""test_3"" does not exist
}}}

== My observations

It has been happening across projects. Today I zoomed in on 1 specific project. The facts summarized:

* My CPU has 4 cores (this is important for the story)
* When I only test 2 separate classes at once I can trigger it once every 30 times
* The crashes always mention the DBs `test_3` or `test_4`, never `test_1`
* The command line used: `python manage.py test my_app_with_two_testcase_classes --parallel`
* I haven't set `DJANGO_TEST_PROCESSES`

I hacked the Django code with additional `print()` statements. Things I noticed:
* Django starts out with assuming it can do 4 parallel processess, which is correct for my CPU
* When checking the Postgresql DB log 3 DBs are made:
{{{
$ grep 'CREATE DATABASE' postgres.log
2021-05-19 14:29:11.473 CEST [23952] LOG:  duration: 216.274 ms  statement: CREATE DATABASE test ENCODING 'UNICODE';
2021-05-19 14:29:12.918 CEST [24024] LOG:  duration: 226.477 ms  statement: CREATE DATABASE ""test_1"" WITH TEMPLATE ""test""
2021-05-19 14:29:13.147 CEST [24025] LOG:  duration: 225.881 ms  statement: CREATE DATABASE ""test_2"" WITH TEMPLATE ""test""
}}}
* `db.backends.base.creation` `BaseDatabaseCreation.get_test_db_clone_settings` is called four times, with suffix `1`, `2`, `3` and `4`
* `test.runner` `_init_worker` is called four times, with `_worker_id` `1`, `2`, `3` and `4`.
* `db.backends.postgresql.creation` `DatabaseCreation._clone_test_db` is called twice, always with `test_1` and `test_2` as DBs
* `db.backends.postgresql.creation` `DatabaseWrapper.get_connection_params` is called twice:
  * most of time time with `test_1` and `test_2`. In these cases the testcases do not crash
  * once in a while with fe. `test_1` and `test_3`. In this case Django will crash

== My fix

Based on the prints I'm assuming Django creates 4 processes, then only initializes 2 because it only needs 2, and as long as the first 2 processes get used everything works out but if, by randomness, either the 3rd or 4th process is activated things break?

Because it happens so infrequently I can't 100% guarantee my fix works, but it appears to: after applying my fix `_init_worker` is called only twice, `get_test_db_clone_settings` is called only twice, and I haven't seen it crash after 200 consecutive runs.

{{{
--- old_runner.py	2021-05-19 15:01:32.756268287 +0200
+++ new_runner.py	2021-05-19 15:01:18.844150140 +0200
@@ -541,6 +541,7 @@
             # basis, there's no need for more processes than TestCases.
             parallel_units = len(parallel_suite.subsuites)
             self.parallel = min(self.parallel, parallel_units)
+            parallel_suite.processes = self.parallel
 
             # If there's only one TestCase, parallelization isn't needed.
             if self.parallel > 1:
}}}

I can make a PR in Github if you want (against the main branch then?)"	Bug	closed	Testing framework	2.2	Normal	duplicate		Chris Jerdonek	Unreviewed	0	0	0	0	0	0
