#32769 closed Bug (duplicate)
Fix for sporadically crashing parallel test runner
Reported by: | Ceesjan Luiten | Owned by: | nobody |
---|---|---|---|
Component: | Testing framework | Version: | 2.2 |
Severity: | Normal | Keywords: | |
Cc: | Chris Jerdonek | Triage Stage: | Unreviewed |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
G'day,
I'd like to report a fix for a bug I found with the parallel test runner. I'm filing it against 2.2 since that's what I use myself, but looking at the sourcecode 3.2 also has this bug.
Tickets that sound related but also not entirely the same: #27734
I have about 25 active Django projects, all in CI with testcases, some going back 10 years. About 6 months ago I started using --parallel
to speed up testing. The speedup is great, obviously I had to fix some tests that were not compatible but that has been sorted out by now.
Since using --parallel
I've also encountered crashes that really can't be blamed on my own code. They happen quite infrequent, so up to now I've been ignoring them. Today I ran into it again and decided to do some digging.
The crash
They always look like this:
Traceback (most recent call last): File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 218, in ensure_connection self.connect() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 196, in connect self.connection = self.get_new_connection(conn_params) File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/postgresql/base.py", line 182, in get_new_connection connection = Database.connect(**conn_params) File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/psycopg2/__init__.py", line 127, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) psycopg2.OperationalError: FATAL: database "test_3" does not exist The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/lib/python3.6/unittest/suite.py", line 163, in _handleClassSetUp setUpClass() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/test/testcases.py", line 1126, in setUpClass cls.cls_atomics = cls._enter_atomics() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/test/testcases.py", line 1107, in _enter_atomics atomics[db_name].__enter__() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/transaction.py", line 175, in __enter__ if not connection.get_autocommit(): File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 380, in get_autocommit self.ensure_connection() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 218, in ensure_connection self.connect() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/utils.py", line 89, in __exit__ raise dj_exc_value.with_traceback(traceback) from exc_value File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 218, in ensure_connection self.connect() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/base/base.py", line 196, in connect self.connection = self.get_new_connection(conn_params) File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/django/db/backends/postgresql/base.py", line 182, in get_new_connection connection = Database.connect(**conn_params) File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6/site-packages/psycopg2/__init__.py", line 127, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) django.db.utils.OperationalError: FATAL: database "test_3" does not exist
My observations
It has been happening across projects. Today I zoomed in on 1 specific project. The facts summarized:
- My CPU has 4 cores (this is important for the story)
- When I only test 2 separate classes at once I can trigger it once every 30 times
- The crashes always mention the DBs
test_3
ortest_4
, nevertest_1
- The command line used:
python manage.py test my_app_with_two_testcase_classes --parallel
- I haven't set
DJANGO_TEST_PROCESSES
I hacked the Django code with additional print()
statements. Things I noticed:
- Django starts out with assuming it can do 4 parallel processess, which is correct for my CPU
- When checking the Postgresql DB log 3 DBs are made:
$ grep 'CREATE DATABASE' postgres.log 2021-05-19 14:29:11.473 CEST [23952] LOG: duration: 216.274 ms statement: CREATE DATABASE test ENCODING 'UNICODE'; 2021-05-19 14:29:12.918 CEST [24024] LOG: duration: 226.477 ms statement: CREATE DATABASE "test_1" WITH TEMPLATE "test" 2021-05-19 14:29:13.147 CEST [24025] LOG: duration: 225.881 ms statement: CREATE DATABASE "test_2" WITH TEMPLATE "test"
db.backends.base.creation
BaseDatabaseCreation.get_test_db_clone_settings
is called four times, with suffix1
,2
,3
and4
test.runner
_init_worker
is called four times, with_worker_id
1
,2
,3
and4
.db.backends.postgresql.creation
DatabaseCreation._clone_test_db
is called twice, always withtest_1
andtest_2
as DBsdb.backends.postgresql.creation
DatabaseWrapper.get_connection_params
is called twice:- most of time time with
test_1
andtest_2
. In these cases the testcases do not crash - once in a while with fe.
test_1
andtest_3
. In this case Django will crash
- most of time time with
My fix
Based on the prints I'm assuming Django creates 4 processes, then only initializes 2 because it only needs 2, and as long as the first 2 processes get used everything works out but if, by randomness, either the 3rd or 4th process is activated things break?
Because it happens so infrequently I can't 100% guarantee my fix works, but it appears to: after applying my fix _init_worker
is called only twice, get_test_db_clone_settings
is called only twice, and I haven't seen it crash after 200 consecutive runs.
--- old_runner.py 2021-05-19 15:01:32.756268287 +0200 +++ new_runner.py 2021-05-19 15:01:18.844150140 +0200 @@ -541,6 +541,7 @@ # basis, there's no need for more processes than TestCases. parallel_units = len(parallel_suite.subsuites) self.parallel = min(self.parallel, parallel_units) + parallel_suite.processes = self.parallel # If there's only one TestCase, parallelization isn't needed. if self.parallel > 1:
I can make a PR in Github if you want (against the main branch then?)
Change History (3)
comment:1 by , 3 years ago
Description: | modified (diff) |
---|
comment:2 by , 3 years ago
Cc: | added |
---|---|
Resolution: | → duplicate |
Status: | new → closed |
comment:3 by , 3 years ago
Hey Mariusz,
Thanks for taking this up.
I've verified that using Python 3.8.5 and the main branch:
- without the patch the crashes still occur
- with the patch the crashes no longer occur
I see that with the patch it will create 4 DBs again, even though there are only 2 needed (since it can only run 2 classes in parallel). I'm fine with this: having Django not crash is far, far more important than the ~250ms spent on creating a not-needed DB.
I'll make a PR using your code.
Thanks for the report. I think it's a duplicate of #27734, it's not an issue with the number of workers, but an issue with a shared counter (see comment). Reducing the number of workers just make it less likely. Can you reproduce it with Python 3.8+ and on the main branch? Django 2.2 is in extended support so it doesn't receive bugfixes anymore (except security patches).
I agree that we could reduce the number of workers but it's only a small cleanup that will not fix the main issue described in #27734. I would reduce the number of processes directly in the
ParallelTestSuite
, e.g.Feel-free to prepare a patch with
Refs #27734 -- ...
.Duplicate of #27734.