id	summary	reporter	owner	description	type	status	component	version	severity	resolution	keywords	cc	stage	has_patch	needs_docs	needs_tests	needs_better_patch	easy	ui_ux
32948	Optimise Q combination and inversion	Keryn Knight	Nick Pope	"This is corollary to #32946 and #32940.

Q is currently inconsistent with it's friends `WhereNode` and `Node` in that it doesn't use the `_new_instance` trick. Even using the `_new_instance` trick leaves some performance on the table vs just inlining the `__class__` switch, because it's an extra method call which affects both `_combine()` and `__invert__()`.

The `_combine` method also has conditionals for what to do about an ''empty'' node being combined, either lhs or rhs. One side uses `deconstruct`, the other uses the shallow copy protocol (only since c8b659430556dca0b2fe27cf2ea0f8290dbafecd), which is unimplemented.

If `__copy__` is not implemented, it ultimately falls back (after some branching checks) to the builtin `__reduce_ex__(4)` + `copy._reconstruct` which gives:
{{{
In [3]: x = Q()
In [4]: %timeit copy.copy(x)
2.2 µs ± 70.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [5]: %timeit copy.copy(Q())
3.52 µs ± 264 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
}}}

If we implement the necessary method like so:
{{{
def __copy__(self):
    obj = self._new_instance()
    obj.__dict__ = self.__dict__.copy()
    return obj
}}}
we can reduce those numbers to:
{{{
In [3]: x = Q()
In [4]: %timeit copy.copy(x)
1.27 µs ± 6.19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit copy.copy(Q())
2.37 µs ± 28.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
}}}

we can then reduce the work further by not invoking `copy.copy()` at all, by setting the `copy = __copy__` attribute on the Q class.

From there, we can avoid calling `self.deconstruct()` at all, instead calling `self.copy()` knowing that `self` has values, but `other` does not. Both are basically on-par with eachother speedwise, with deconstruct being faster on empty nodes (which `self` isn't) and copy being minimally faster when there's a different connector (eg: `OR`).

For inverting, we can similarly change it to avoid the `Node.add()` call:
{{{
    def __invert__(self):
        obj = self.copy()
        obj.negate()
        return obj
}}}
which would allow it to go from:
{{{
In [2]: %timeit ~Q()
2.89 µs ± 18.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit ~Q(a=1, b=2, c=3, d=4)
3.77 µs ± 57.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
}}}
to:
{{{
In [2]: %timeit ~Q()
2.34 µs ± 9.49 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit ~Q(a=1, b=2, c=3, d=4)
3.14 µs ± 72.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
}}}


In totality, then, baselines:
{{{
In [2]: full = Q(a=1, b=2, c=3)
In [3]: full2 = Q(d=4, e=5, f=6)
In [4]: empty = Q()
In [5]: %timeit full & full2
        2.65 µs ± 17.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [6]: %timeit full | full2
        3 µs ± 39.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: %timeit ~(full | full2)
        5.09 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit ~(full & full2)
        4.67 µs ± 58.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [9]: %timeit empty & full
        2.81 µs ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [10]: %timeit full & empty
        3.16 µs ± 43.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [11]: %timeit empty | full
        2.8 µs ± 22.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [12]: %timeit full | empty
        3.13 µs ± 20.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: values = (Q(a=1), Q(b=2), Q(c=3), Q(d=4), Q(e__in=[1,2,3,4]))
In [14]: %timeit reduce(or_, values)
        12 µs ± 212 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
}}}

and after the changes:
{{{
In [5]: %timeit full & full2
        2.11 µs ± 20.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [6]: %timeit full | full2
        2.39 µs ± 37.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: %timeit ~(full | full2)
        3.62 µs ± 47.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit ~(full & full2)
        3.34 µs ± 28.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [9]: %timeit empty & full
        1.57 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [10]: %timeit full & empty
        1.68 µs ± 18.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [11]: %timeit empty | full
        1.66 µs ± 24.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [12]: %timeit full | empty
        1.8 µs ± 23 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [13]: values = (Q(a=1), Q(b=2), Q(c=3), Q(d=4), Q(e__in=[1,2,3,4]))
In [14]: %timeit reduce(or_, values)
        9.59 µs ± 56.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
}}}

A final note then, if inlining the `_new_instance` code into the copy method were done, it looks like it shaves off 100-120ns per copy, from my quick tests. So there is still performance on the table to put it fully ahead of deconstruct (and another possibility exists - that the deconstruction for migrations in some way gets more complex, negatively affecting runtime performance for self copies)

I have a patch series which will need a little tidying (don't they always?) but should pass CI when I open the PR for discussion."	Cleanup/optimization	closed	Database layer (models, ORM)	dev	Normal	fixed			Ready for checkin	1	0	0	0	0	0