Opened 9 years ago

Last modified 73 minutes ago

#27734 new Bug

Parallel test workers can be assigned an unexpected index when recovering from errors

Reported by: Adam Wróbel Owned by: nobody
Component: Testing framework Version: 1.10
Severity: Normal Keywords: parallel
Cc: Chris Jerdonek, Aarav Sharma Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I've had this problem since upgrading to Python 3.6 on macOS Sierra – when I run my tests in parallel many of them blow up trying to use database with index that is higher than the number of parallel test processes. On my MacBook the default process number is 4 and tests fail trying to connect to test_my_db_5. On my iMac the default process number is 8 and test fail connecting to test_my_db_9. The same thing happens if I set the number of processes explicitly, e.g. --parallel=2. Higher number of processes yields less failures. If I explicitly set the number of processes to a number that is greater than the number of my test files then the tests succeed.

Traceback (most recent call last):
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/test/testcases.py", line 216, in __call__
    self._post_teardown()
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/test/testcases.py", line 908, in _post_teardown
    self._fixture_teardown()
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/test/testcases.py", line 943, in _fixture_teardown
    inhibit_post_migrate=inhibit_post_migrate)
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/core/management/__init__.py", line 130, in call_command
    return command.execute(*args, **defaults)
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/core/management/base.py", line 345, in execute
    output = self.handle(*args, **options)
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/core/management/commands/flush.py", line 54, in handle
    allow_cascade=allow_cascade)
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/core/management/sql.py", line 15, in sql_flush
    tables = connection.introspection.django_table_names(only_existing=True, include_views=False)
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/db/backends/base/introspection.py", line 88, in django_table_names
    existing_tables = self.table_names(include_views=include_views)
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/db/backends/base/introspection.py", line 55, in table_names
    with self.connection.cursor() as cursor:
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/db/backends/base/base.py", line 233, in cursor
    cursor = self.make_cursor(self._cursor())
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/db/backends/base/base.py", line 204, in _cursor
    self.ensure_connection()
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/db/backends/base/base.py", line 199, in ensure_connection
    self.connect()
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/utils/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/db/backends/base/base.py", line 199, in ensure_connection
    self.connect()
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/db/backends/base/base.py", line 171, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/django/db/backends/postgresql/base.py", line 176, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/Users/amw/.virtualenvs/envname/lib/python3.6/site-packages/psycopg2/__init__.py", line 164, in connect
    conn = _connect(dsn, connection_factory=connection_factory, async=async)
django.db.utils.OperationalError: FATAL:  database "test_my_db_5" does not exist

My requirements.txt file:

boto3==1.4.3
daemonize==2.4.7
Django==1.10.5
django-appconf==1.0.2
django-compressor==2.1
django-debug-toolbar==1.6
django-dotenv==1.4.1
django-extensions==1.7.5
django-ipware==1.1.6
djangorestframework==3.5.3
filemagic==1.6
inflection==0.3.1
jmespath==0.9.0
psycopg2==2.6.2
python-dateutil==2.6.0
rcssmin==1.0.6
rjsmin==1.0.12
s3transfer==0.1.10
six==1.10.0
sqlparse==0.2.2
uWSGI==2.0.14

Change History (15)

comment:1 by Tim Graham, 9 years ago

I haven't see that issue in my Python 3.6 testing. Could you investigate and tell us why Django is at fault?

comment:2 by Adam Wróbel, 9 years ago

I don't know if it's Django's fault, but I do know that it's Django that is creating test databases and configuring connection parameters. If you take a look on the backtrace you'll see that the only other Python package on it is psycopg2.

Parallel nature of this problem makes it hard to investigate with my debugger. The exception occurs in the child process, but I don't know where the incorrect information is generated. I was hoping for a hint for where to start the search.

comment:3 by Tim Graham, 9 years ago

The parallel test runner is in django/test/runner.py. Can you provide a small sample project (without any third-party dependencies) that reproduces the issue?

comment:4 by Tim Graham, 9 years ago

Resolution: needsinfo
Status: newclosed

comment:5 by David Breeden, 8 years ago

I'm having this problem. It's not Django's fault, although there could be better error reporting.

--parallel creates N databases for the N workers to use. Each worker claims the database it will use by incrementing a shared counter. When the multiprocessing.Pool is initialized with N processes, the workers are assigned databases 1 through N.

The problem happens when a worker exits. Normally, the workers in a pool will stay alive for the whole life of the pool. In my case, workers are exiting because of seg faults. When this happens, the pool repopulates with new workers. When the first new worker is initialized, it increments the counter to N + 1 and tries to use that database, and this error results.

Only some of my tests cause the seg fault, so if I run --parallel with other tests, it works fine.

It would have saved some time diagnosing the issue if Django raised an error when a worker incremented the counter over N, which seems like a reasonable thing for the initializer to do.

comment:6 by Aymeric Augustin, 8 years ago

Resolution: needsinfo
Status: closednew
Triage Stage: UnreviewedAccepted

I agree it would be nice to have a better error message in this case.

comment:7 by Tim Graham, 8 years ago

Summary: Parallel tests try to use database with too high indicesAdd a helpful error message when a parallel test worker is assigned an unexpected index
Type: BugCleanup/optimization

comment:8 by Adam Wróbel, 8 years ago

I don't think my issue was with segfaulting processes, but it's certain that something caused the process to be restarted. Better error message might be nice to have, but won't be helpful unless it helps determining what caused the process segfault/restart.

Maybe there would be a way to reuse a connection of a dead process?

comment:9 by Hameed Gifford, 6 years ago

To anyone else who's still having this problem, it seems to be specific to OSX. Or at least this reason and fix is.

https://stackoverflow.com/questions/50168647/multiprocessing-causes-python-to-crash-and-gives-an-error-may-have-been-in-progr

Putting this somewhere in your code before the test runner is initialized (settings worked good for me) fixes the issue:

import os
import platform
import sys

if 'test' in sys.argv and platform.system().lower() == 'darwin':
    os.environ.setdefault('OBJC_DISABLE_INITIALIZE_FORK_SAFETY', 'YES')

comment:10 by Mariusz Felisiak, 5 years ago

#32769 was closed as a duplicate, see comment.

comment:11 by Chris Jerdonek, 5 years ago

Cc: Chris Jerdonek added

comment:12 by Mariusz Felisiak <felisiak.mariusz@…>, 5 years ago

In cb6c197:

Refs #27734 -- Prevented creation of more parallel workers than TestCases.

The parallel test runner uses multiple workers to distribute the
workload. These workers are assigned a worker ID using a globally
incremented variable, which determines what test database to connect
to. When the worker ID surpasses the test database IDs Django will
crash.

This reduce likelihood of crashing parallel tests because
ParallelTestSuite will no longer create more workers than TestCases.

It won't eliminate the problem completely though because there are
other circumstances in which new workers can be created which can then
be assigned an "illegal" worker ID.

comment:13 by Aarav Sharma, 7 hours ago

A patch has been submitted for this issue:
https://github.com/django/django/pull/20971
This improves the error message shown when a parallel test worker fails to initialize.

comment:14 by Aarav Sharma, 7 hours ago

Cc: Aarav Sharma added
Has patch: set

comment:15 by Jacob Walls, 73 minutes ago

Has patch: unset
Summary: Add a helpful error message when a parallel test worker is assigned an unexpected indexParallel test workers can be assigned an unexpected index when recovering from errors
Type: Cleanup/optimizationBug

Thanks Aarav Sharma for alerting me that we had an issue for this. I've been meaning to open this exact issue for a while.

It's easiest to simulate the occasional failure described here by just forcing workers to respawn with maxtasksperchild:

  • django/test/runner.py

    diff --git a/django/test/runner.py b/django/test/runner.py
    index 56eda9d98d..ddbbeee510 100644
    a b class ParallelTestSuite(unittest.TestSuite):  
    576576                self.debug_mode,
    577577                self.used_aliases,
    578578            ],
     579            # For testing only,
     580            # Emulate the occasional failures that happen (under resource contention?)
     581            maxtasksperchild=1,
    579582        ) as pool:
    580583            test_results = pool.imap_unordered(self.run_subsuite.__func__, args)
    581584

Then run a subsuite of tests that make heavy use of the database:

./runtests.py queries

  File "/Users/jwalls/django/django/db/backends/sqlite3/creation.py", line 146, in setup_worker_connection
    source_db = self.connection.Database.connect(
        f"file:{source_db_name}?mode=ro", uri=True
    )
sqlite3.OperationalError: unable to open database file

I'd like to reframe this ticket to actually fix the underlying issue instead of adjusting the error message. We could do that by making the database clones a pool that that workers pull from instead of trying to map ephemeral workers 1:1 with fixed resources. All of this functionality in the parallel runner is private, so we shouldn't be shy about fixing the design.

Note: See TracTickets for help on using tickets.
Back to Top