#32769 closed Bug (duplicate)
Fix for sporadically crashing parallel test runner
| Reported by: | Ceesjan Luiten | Owned by: | nobody |
|---|---|---|---|
| Component: | Testing framework | Version: | 2.2 |
| Severity: | Normal | Keywords: | |
| Cc: | Chris Jerdonek | Triage Stage: | Unreviewed |
| Has patch: | no | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description (last modified by )
G'day,
I'd like to report a fix for a bug I found with the parallel test runner. I'm filing it against 2.2 since that's what I use myself, but looking at the sourcecode 3.2 also has this bug.
Tickets that sound related but also not entirely the same: #27734
I have about 25 active Django projects, all in CI with testcases, some going back 10 years. About 6 months ago I started using --parallel to speed up testing. The speedup is great, obviously I had to fix some tests that were not compatible but that has been sorted out by now.
Since using --parallel I've also encountered crashes that really can't be blamed on my own code. They happen quite infrequent, so up to now I've been ignoring them. Today I ran into it again and decided to do some digging.
The crash
They always look like this:
Traceback (most recent call last): File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 218, in ensure_connection self.connect() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 196, in connect self.connection = self.get_new_connection(conn_params) File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/postgresql/base.py", line 182, in get_new_connection connection = Database.connect(**conn_params) File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/psycopg2/__init__.py", line 127, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) psycopg2.OperationalError: FATAL: database "test_3" does not exist The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/lib/python3.6/unittest/suite.py", line 163, in _handleClassSetUp setUpClass() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/test/testcases.py", line 1126, in setUpClass cls.cls_atomics = cls._enter_atomics() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/test/testcases.py", line 1107, in _enter_atomics atomics[db_name].__enter__() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/transaction.py", line 175, in __enter__ if not connection.get_autocommit(): File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 380, in get_autocommit self.ensure_connection() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 218, in ensure_connection self.connect() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/utils.py", line 89, in __exit__ raise dj_exc_value.with_traceback(traceback) from exc_value File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 218, in ensure_connection self.connect() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 196, in connect self.connection = self.get_new_connection(conn_params) File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/postgresql/base.py", line 182, in get_new_connection connection = Database.connect(**conn_params) File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/psycopg2/__init__.py", line 127, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) django.db.utils.OperationalError: FATAL: database "test_3" does not exist
My observations
It has been happening across projects. Today I zoomed in on 1 specific project. The facts summarized:
- My CPU has 4 cores (this is important for the story)
- When I only test 2 separate classes at once I can trigger it once every 30 times
- The crashes always mention the DBs
test_3ortest_4, nevertest_1 - The command line used:
python manage.py test my_app_with_two_testcase_classes --parallel - I haven't set
DJANGO_TEST_PROCESSES
I hacked the Django code with additional print() statements. Things I noticed:
- Django starts out with assuming it can do 4 parallel processess, which is correct for my CPU
- When checking the Postgresql DB log 3 DBs are made:
$ grep 'CREATE DATABASE' postgres.log 2021-05-19 14:29:11.473 CEST [23952] LOG: duration: 216.274 ms statement: CREATE DATABASE test ENCODING 'UNICODE'; 2021-05-19 14:29:12.918 CEST [24024] LOG: duration: 226.477 ms statement: CREATE DATABASE "test_1" WITH TEMPLATE "test" 2021-05-19 14:29:13.147 CEST [24025] LOG: duration: 225.881 ms statement: CREATE DATABASE "test_2" WITH TEMPLATE "test"
db.backends.base.creationBaseDatabaseCreation.get_test_db_clone_settingsis called four times, with suffix1,2,3and4test.runner_init_workeris called four times, with_worker_id1,2,3and4.db.backends.postgresql.creationDatabaseCreation._clone_test_dbis called twice, always withtest_1andtest_2as DBsdb.backends.postgresql.creationDatabaseWrapper.get_connection_paramsis called twice:- most of time time with
test_1andtest_2. In these cases the testcases do not crash - once in a while with fe.
test_1andtest_3. In this case Django will crash
- most of time time with
My fix
Based on the prints I'm assuming Django creates 4 processes, then only initializes 2 because it only needs 2, and as long as the first 2 processes get used everything works out but if, by randomness, either the 3rd or 4th process is activated things break?
Because it happens so infrequently I can't 100% guarantee my fix works, but it appears to: after applying my fix _init_worker is called only twice, get_test_db_clone_settings is called only twice, and I haven't seen it crash after 200 consecutive runs.
--- old_runner.py 2021-05-19 15:01:32.756268287 +0200
+++ new_runner.py 2021-05-19 15:01:18.844150140 +0200
@@ -541,6 +541,7 @@
# basis, there's no need for more processes than TestCases.
parallel_units = len(parallel_suite.subsuites)
self.parallel = min(self.parallel, parallel_units)
+ parallel_suite.processes = self.parallel
# If there's only one TestCase, parallelization isn't needed.
if self.parallel > 1:
I can make a PR in Github if you want (against the main branch then?)
Change History (3)
comment:1 by , 4 years ago
| Description: | modified (diff) |
|---|
comment:2 by , 4 years ago
| Cc: | added |
|---|---|
| Resolution: | → duplicate |
| Status: | new → closed |
comment:3 by , 4 years ago
Hey Mariusz,
Thanks for taking this up.
I've verified that using Python 3.8.5 and the main branch:
- without the patch the crashes still occur
- with the patch the crashes no longer occur
I see that with the patch it will create 4 DBs again, even though there are only 2 needed (since it can only run 2 classes in parallel). I'm fine with this: having Django not crash is far, far more important than the ~250ms spent on creating a not-needed DB.
I'll make a PR using your code.
Thanks for the report. I think it's a duplicate of #27734, it's not an issue with the number of workers, but an issue with a shared counter (see comment). Reducing the number of workers just make it less likely. Can you reproduce it with Python 3.8+ and on the main branch? Django 2.2 is in extended support so it doesn't receive bugfixes anymore (except security patches).
I agree that we could reduce the number of workers but it's only a small cleanup that will not fix the main issue described in #27734. I would reduce the number of processes directly in the
ParallelTestSuite, e.g.diff --git a/django/test/runner.py b/django/test/runner.py index ab5512e628..50f66d3d87 100644 --- a/django/test/runner.py +++ b/django/test/runner.py @@ -405,7 +405,9 @@ class ParallelTestSuite(unittest.TestSuite): def __init__(self, suite, processes, failfast=False, buffer=False): self.subsuites = partition_suite_by_case(suite) - self.processes = processes + # Since tests are distributed across processes on a per-TestCase basis, + # there's no need for more processes than TestCases. + self.processes = min(processes, len(self.subsuites)) self.failfast = failfast self.buffer = buffer super().__init__() @@ -662,14 +664,8 @@ class DiscoverRunner: self.failfast, self.buffer, ) - - # Since tests are distributed across processes on a per-TestCase - # basis, there's no need for more processes than TestCases. - parallel_units = len(parallel_suite.subsuites) - self.parallel = min(self.parallel, parallel_units) - # If there's only one TestCase, parallelization isn't needed. - if self.parallel > 1: + if parallel_suite.processes > 1: suite = parallel_suite return suiteFeel-free to prepare a patch with
Refs #27734 -- ....Duplicate of #27734.