Opened 4 years ago

Last modified 4 years ago

#32769 closed Bug

Fix for sporadically crashing parallel test runner — at Initial Version

Reported by: Ceesjan Luiten Owned by: nobody
Component: Testing framework Version: 2.2
Severity: Normal Keywords:
Cc: Chris Jerdonek Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

G'day,

I'd like to report a fix for a bug I found with the parallel test runner. I'm filing it against 2.2 since that's what I use myself, but looking at the sourcecode 3.2 also has this bug.

Tickets that sound related but also not entirely the same: #27734

I have about 25 active Django projects, all in CI with testcases, some going back 10 years. About 6 months ago I started using --parallel to speed up testing. The speedup is great, obviously I had to fix some tests that were not compatible but that has been sorted out by now.

Since using --parallel I've also encountered crashes that really can't be blamed on my own code. They happen quite infrequent, so up to now I've been ignoring them. Today I ran into it again and decided to do some digging.

The crash

They always look like this:

Traceback (most recent call last):
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 218, in ensure_connection
    self.connect()
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 196, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/postgresql/base.py", line 182, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/psycopg2/__init__.py", line 127, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL:  database "test_3" does not exist


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.6/unittest/suite.py", line 163, in _handleClassSetUp
    setUpClass()
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/test/testcases.py", line 1126, in setUpClass
    cls.cls_atomics = cls._enter_atomics()
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/test/testcases.py", line 1107, in _enter_atomics
    atomics[db_name].__enter__()
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/transaction.py", line 175, in __enter__
    if not connection.get_autocommit():
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 380, in get_autocommit
    self.ensure_connection()
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 218, in ensure_connection
    self.connect()
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 218, in ensure_connection
    self.connect()
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 196, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/postgresql/base.py", line 182, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/psycopg2/__init__.py", line 127, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: FATAL:  database "test_3" does not exist

My observations

It has been happening across projects. Today I zoomed in on 1 specific project. The facts summarized:

  • My CPU has 4 cores (this is important for the story)
  • When I only test 2 separate classes at once I can trigger it once every 30 times
  • The crashes always mention the DBs test_3 or test_4, never test_1
  • The precise command line: python test my_app_with_two_testcase_classes --parallel
  • I haven't set DJANGO_TEST_PROCESSES

I hacked the Django code with additional print() statements. Things I noticed:

  • Django starts out with assuming it can do 4 parallel processess, which is correct for my CPU
  • When checking the Postgresql DB log 3 DBs are made:
    $ grep 'CREATE DATABASE' postgres.log
    2021-05-19 14:29:11.473 CEST [23952] LOG:  duration: 216.274 ms  statement: CREATE DATABASE test ENCODING 'UNICODE';
    2021-05-19 14:29:12.918 CEST [24024] LOG:  duration: 226.477 ms  statement: CREATE DATABASE "test_1" WITH TEMPLATE "test"
    2021-05-19 14:29:13.147 CEST [24025] LOG:  duration: 225.881 ms  statement: CREATE DATABASE "test_2" WITH TEMPLATE "test"
    
  • db.backends.base.creation BaseDatabaseCreation.get_test_db_clone_settings is called four times, with suffix 1, 2, 3 and 4
  • test.runner _init_worker is called four times, with _worker_id 1, 2, 3 and 4.
  • db.backends.postgresql.creation DatabaseCreation._clone_test_db is called twice, always with test_1 and test_2 as DBs
  • db.backends.postgresql.creation DatabaseWrapper.get_connection_params is called twice:
    • most of time time with test_1 and test_2. In these cases the testcases do not crash
    • once in a while with fe. test_1 and test_3. In this case Django will crash

My fix

Based on the prints I'm assuming Django creates 4 processes, then only initializes 2 because it only needs 2, and as long as the first 2 processes get used everything works out but if, by randomness, either the 3rd or 4th process is activated things break?

Because it happens so infrequently I can't 100% guarantee my fix works, but it appears to: after applying my fix _init_worker is called only twice, get_test_db_clone_settings is called only twice, and I haven't seen it crash after 200 consecutive runs.

--- old_runner.py	2021-05-19 15:01:32.756268287 +0200
+++ new_runner.py	2021-05-19 15:01:18.844150140 +0200
@@ -541,6 +541,7 @@
             # basis, there's no need for more processes than TestCases.
             parallel_units = len(parallel_suite.subsuites)
             self.parallel = min(self.parallel, parallel_units)
+            parallel_suite.processes = self.parallel
 
             # If there's only one TestCase, parallelization isn't needed.
             if self.parallel > 1:

I can make a PR in Github if you want (against the main branch then?)

Change History (0)

Note: See TracTickets for help on using tickets.
Back to Top