Opened 4 months ago
Last modified 4 months ago
#29243 new Cleanup/optimization
Improve efficiency of migration graph algorithm
Reported by: | Krzysztof Gogolewski | Owned by: | nobody |
---|---|---|---|
Component: | Migrations | Version: | 2.0 |
Severity: | Normal | Keywords: | |
Cc: | Andrew Godwin | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
It seems to me that the current migration graph code is quadratic or even exponential in some cases. I would expect computing the forward or backward plan to be linear in the size of the graph. I might write a patch but I would appreciate a sanity check first. Everything here concerns django/db/migrations/graph.py.
- Consider this function:
def iterative_dfs(self, start, forwards=True): """Iterative depth-first search for finding dependencies.""" visited = [] stack = [start] while stack: node = stack.pop() visited.append(node) stack += sorted(node.parents if forwards else node.children) return list(OrderedSet(reversed(visited)))
We are visiting a node even if it was already visited before. Example where this goes bad:
1 / \ 2 3 \ / 4 / \ 5 6 \ / 7
Starting from 1, we'll visit node 4 two times, node 7 four times etc. - this is exponential.
Instead, we should keep a set of nodes that were already visited, and do not process already visited nodes. This will also get rid of OrderedSet that is used to remove duplicates. Here's a possible implementation:
def iterative_dfs(self, start, forwards=True): visited = [] visited_set = set() stack = [(start, False)] while stack: node, processed = stack.pop() if node in visited_set: pass elif processed: visited_set.add(node) visited.append(node) else: stack.append((node, True)) stack += [(n, False) for n in sorted(node.parents if forwards else node.children)] return visited
2.
Node.ancestors()
and Node.descendants()
compute the list of ancestors and descendants in a graph for caching - this can have O(N) items each, which means it's quadratic.
I think that once the issue in iterative_dfs
is fixed, we can remove the recursive logic for computing DFS altogether - this means identifiers ancestors
, descendants
, cached
, clear_cache
and RECURSION_DEPTH_WARNING
. The cache was needed to avoid exponential behavior (like the one in iterative_dfs
), but after the fix, the algorithm will be linear and should be faster than quadratic Node.ancestors()
.
3.
After the recursive DFS logic is removed, I think we can merge ensure_not_cyclic
and iterative_dfs
into a single function; it could be renamed to topological_sort
.
Change History (6)
comment:1 Changed 4 months ago by
Cc: | Andrew Godwin added |
---|---|
Type: | Uncategorized → Cleanup/optimization |
comment:2 Changed 4 months ago by
comment:4 Changed 4 months ago by
There have been attempts of implementing a topological sorting of the migration dependency graph before. IIRC (it was about 3 years ago), the issue was an unstable plan that could drastically change is some apps added more migrations.
comment:5 Changed 4 months ago by
I haven't combined ensure_not_cyclic
and iterative_dfs
in the end, as we are checking for existence of cycles in the whole graph, but are interested in a plan only for some of the nodes. It's likely we could reuse the already existing topological sort in ensure_not_cyclic
but I'd like to merge the existing branch first.
comment:6 Changed 4 months ago by
Has patch: | set |
---|---|
Summary: | Migration graph code is inefficient → Improve efficiency of migration graph algorithm |
Triage Stage: | Unreviewed → Accepted |
Tentatively accepting for further evaluation of the patch. If the patch isn't viable, it might be useful to add a test to show why not, if possible.
Just FYI there is already a topological sort implementation in the DB migration code: https://github.com/django/django/blob/master/django/db/migrations/topological_sort.py