id,summary,reporter,owner,description,type,status,component,version,severity,resolution,keywords,cc,stage,has_patch,needs_docs,needs_tests,needs_better_patch,easy,ui_ux 32948,Optimise Q combination and inversion,Keryn Knight,Nick Pope,"This is corollary to #32946 and #32940. Q is currently inconsistent with it's friends `WhereNode` and `Node` in that it doesn't use the `_new_instance` trick. Even using the `_new_instance` trick leaves some performance on the table vs just inlining the `__class__` switch, because it's an extra method call which affects both `_combine()` and `__invert__()`. The `_combine` method also has conditionals for what to do about an ''empty'' node being combined, either lhs or rhs. One side uses `deconstruct`, the other uses the shallow copy protocol (only since c8b659430556dca0b2fe27cf2ea0f8290dbafecd), which is unimplemented. If `__copy__` is not implemented, it ultimately falls back (after some branching checks) to the builtin `__reduce_ex__(4)` + `copy._reconstruct` which gives: {{{ In [3]: x = Q() In [4]: %timeit copy.copy(x) 2.2 µs ± 70.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [5]: %timeit copy.copy(Q()) 3.52 µs ± 264 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) }}} If we implement the necessary method like so: {{{ def __copy__(self): obj = self._new_instance() obj.__dict__ = self.__dict__.copy() return obj }}} we can reduce those numbers to: {{{ In [3]: x = Q() In [4]: %timeit copy.copy(x) 1.27 µs ± 6.19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit copy.copy(Q()) 2.37 µs ± 28.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) }}} we can then reduce the work further by not invoking `copy.copy()` at all, by setting the `copy = __copy__` attribute on the Q class. From there, we can avoid calling `self.deconstruct()` at all, instead calling `self.copy()` knowing that `self` has values, but `other` does not. Both are basically on-par with eachother speedwise, with deconstruct being faster on empty nodes (which `self` isn't) and copy being minimally faster when there's a different connector (eg: `OR`). For inverting, we can similarly change it to avoid the `Node.add()` call: {{{ def __invert__(self): obj = self.copy() obj.negate() return obj }}} which would allow it to go from: {{{ In [2]: %timeit ~Q() 2.89 µs ± 18.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [3]: %timeit ~Q(a=1, b=2, c=3, d=4) 3.77 µs ± 57.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) }}} to: {{{ In [2]: %timeit ~Q() 2.34 µs ± 9.49 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [3]: %timeit ~Q(a=1, b=2, c=3, d=4) 3.14 µs ± 72.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) }}} In totality, then, baselines: {{{ In [2]: full = Q(a=1, b=2, c=3) In [3]: full2 = Q(d=4, e=5, f=6) In [4]: empty = Q() In [5]: %timeit full & full2 2.65 µs ± 17.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [6]: %timeit full | full2 3 µs ± 39.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [7]: %timeit ~(full | full2) 5.09 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [8]: %timeit ~(full & full2) 4.67 µs ± 58.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [9]: %timeit empty & full 2.81 µs ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [10]: %timeit full & empty 3.16 µs ± 43.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [11]: %timeit empty | full 2.8 µs ± 22.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [12]: %timeit full | empty 3.13 µs ± 20.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [13]: values = (Q(a=1), Q(b=2), Q(c=3), Q(d=4), Q(e__in=[1,2,3,4])) In [14]: %timeit reduce(or_, values) 12 µs ± 212 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) }}} and after the changes: {{{ In [5]: %timeit full & full2 2.11 µs ± 20.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [6]: %timeit full | full2 2.39 µs ± 37.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [7]: %timeit ~(full | full2) 3.62 µs ± 47.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [8]: %timeit ~(full & full2) 3.34 µs ± 28.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [9]: %timeit empty & full 1.57 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [10]: %timeit full & empty 1.68 µs ± 18.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [11]: %timeit empty | full 1.66 µs ± 24.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [12]: %timeit full | empty 1.8 µs ± 23 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [13]: values = (Q(a=1), Q(b=2), Q(c=3), Q(d=4), Q(e__in=[1,2,3,4])) In [14]: %timeit reduce(or_, values) 9.59 µs ± 56.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) }}} A final note then, if inlining the `_new_instance` code into the copy method were done, it looks like it shaves off 100-120ns per copy, from my quick tests. So there is still performance on the table to put it fully ahead of deconstruct (and another possibility exists - that the deconstruction for migrations in some way gets more complex, negatively affecting runtime performance for self copies) I have a patch series which will need a little tidying (don't they always?) but should pass CI when I open the PR for discussion.",Cleanup/optimization,closed,"Database layer (models, ORM)",dev,Normal,fixed,,,Ready for checkin,1,0,0,0,0,0