Opened 4 months ago

Last modified 4 months ago

#29243 new Cleanup/optimization

Improve efficiency of migration graph algorithm

Reported by: Krzysztof Gogolewski Owned by: nobody
Component: Migrations Version: 2.0
Severity: Normal Keywords:
Cc: Andrew Godwin Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

It seems to me that the current migration graph code is quadratic or even exponential in some cases. I would expect computing the forward or backward plan to be linear in the size of the graph. I might write a patch but I would appreciate a sanity check first. Everything here concerns django/db/migrations/graph.py.

  1. Consider this function:
    def iterative_dfs(self, start, forwards=True):
        """Iterative depth-first search for finding dependencies."""
        visited = []
        stack = [start]
        while stack:
            node = stack.pop()
            visited.append(node)
            stack += sorted(node.parents if forwards else node.children)
        return list(OrderedSet(reversed(visited)))

We are visiting a node even if it was already visited before. Example where this goes bad:

 1
/ \
2 3
\ /
 4
/ \
5 6
\ /
 7

Starting from 1, we'll visit node 4 two times, node 7 four times etc. - this is exponential.

Instead, we should keep a set of nodes that were already visited, and do not process already visited nodes. This will also get rid of OrderedSet that is used to remove duplicates. Here's a possible implementation:

    def iterative_dfs(self, start, forwards=True):
        visited = []
        visited_set = set()
        stack = [(start, False)]
        while stack:
             node, processed = stack.pop()
             if node in visited_set:
                 pass
             elif processed:
                 visited_set.add(node)
                 visited.append(node)
             else:
                 stack.append((node, True))
                 stack += [(n, False) for n in sorted(node.parents if forwards else node.children)]
        return visited

2.
Node.ancestors() and Node.descendants() compute the list of ancestors and descendants in a graph for caching - this can have O(N) items each, which means it's quadratic.

I think that once the issue in iterative_dfs is fixed, we can remove the recursive logic for computing DFS altogether - this means identifiers ancestors, descendants, cached, clear_cache and RECURSION_DEPTH_WARNING. The cache was needed to avoid exponential behavior (like the one in iterative_dfs), but after the fix, the algorithm will be linear and should be faster than quadratic Node.ancestors().

3.
After the recursive DFS logic is removed, I think we can merge ensure_not_cyclic and iterative_dfs into a single function; it could be renamed to topological_sort.

Change History (6)

comment:1 Changed 4 months ago by Tim Graham

Cc: Andrew Godwin added
Type: UncategorizedCleanup/optimization

comment:2 Changed 4 months ago by Ramiro Morales

Just FYI there is already a topological sort implementation in the DB migration code: https://github.com/django/django/blob/master/django/db/migrations/topological_sort.py

comment:4 Changed 4 months ago by Markus Holtermann

There have been attempts of implementing a topological sorting of the migration dependency graph before. IIRC (it was about 3 years ago), the issue was an unstable plan that could drastically change is some apps added more migrations.

comment:5 Changed 4 months ago by Krzysztof Gogolewski

I haven't combined ensure_not_cyclic and iterative_dfs in the end, as we are checking for existence of cycles in the whole graph, but are interested in a plan only for some of the nodes. It's likely we could reuse the already existing topological sort in ensure_not_cyclic but I'd like to merge the existing branch first.

comment:6 Changed 4 months ago by Tim Graham

Has patch: set
Summary: Migration graph code is inefficientImprove efficiency of migration graph algorithm
Triage Stage: UnreviewedAccepted

Tentatively accepting for further evaluation of the patch. If the patch isn't viable, it might be useful to add a test to show why not, if possible.

Note: See TracTickets for help on using tickets.
Back to Top