Opened 11 months ago

Last modified 7 months ago

#32948 assigned Cleanup/optimization

Optimise Q combination and inversion

Reported by: Keryn Knight Owned by: Nick Pope
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

This is corollary to #32946 and #32940.

Q is currently inconsistent with it's friends WhereNode and Node in that it doesn't use the _new_instance trick. Even using the _new_instance trick leaves some performance on the table vs just inlining the __class__ switch, because it's an extra method call which affects both _combine() and __invert__().

The _combine method also has conditionals for what to do about an empty node being combined, either lhs or rhs. One side uses deconstruct, the other uses the shallow copy protocol (only since c8b659430556dca0b2fe27cf2ea0f8290dbafecd), which is unimplemented.

If __copy__ is not implemented, it ultimately falls back (after some branching checks) to the builtin __reduce_ex__(4) + copy._reconstruct which gives:

In [3]: x = Q()
In [4]: %timeit copy.copy(x)
2.2 µs ± 70.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [5]: %timeit copy.copy(Q())
3.52 µs ± 264 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

If we implement the necessary method like so:

def __copy__(self):
    obj = self._new_instance()
    obj.__dict__ = self.__dict__.copy()
    return obj

we can reduce those numbers to:

In [3]: x = Q()
In [4]: %timeit copy.copy(x)
1.27 µs ± 6.19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit copy.copy(Q())
2.37 µs ± 28.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

we can then reduce the work further by not invoking copy.copy() at all, by setting the copy = __copy__ attribute on the Q class.

From there, we can avoid calling self.deconstruct() at all, instead calling self.copy() knowing that self has values, but other does not. Both are basically on-par with eachother speedwise, with deconstruct being faster on empty nodes (which self isn't) and copy being minimally faster when there's a different connector (eg: OR).

For inverting, we can similarly change it to avoid the Node.add() call:

    def __invert__(self):
        obj = self.copy()
        obj.negate()
        return obj

which would allow it to go from:

In [2]: %timeit ~Q()
2.89 µs ± 18.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit ~Q(a=1, b=2, c=3, d=4)
3.77 µs ± 57.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

to:

In [2]: %timeit ~Q()
2.34 µs ± 9.49 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit ~Q(a=1, b=2, c=3, d=4)
3.14 µs ± 72.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In totality, then, baselines:

In [2]: full = Q(a=1, b=2, c=3)
In [3]: full2 = Q(d=4, e=5, f=6)
In [4]: empty = Q()
In [5]: %timeit full & full2
        2.65 µs ± 17.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [6]: %timeit full | full2
        3 µs ± 39.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: %timeit ~(full | full2)
        5.09 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit ~(full & full2)
        4.67 µs ± 58.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [9]: %timeit empty & full
        2.81 µs ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [10]: %timeit full & empty
        3.16 µs ± 43.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [11]: %timeit empty | full
        2.8 µs ± 22.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [12]: %timeit full | empty
        3.13 µs ± 20.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: values = (Q(a=1), Q(b=2), Q(c=3), Q(d=4), Q(e__in=[1,2,3,4]))
In [14]: %timeit reduce(or_, values)
        12 µs ± 212 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

and after the changes:

In [5]: %timeit full & full2
        2.11 µs ± 20.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [6]: %timeit full | full2
        2.39 µs ± 37.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: %timeit ~(full | full2)
        3.62 µs ± 47.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit ~(full & full2)
        3.34 µs ± 28.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [9]: %timeit empty & full
        1.57 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [10]: %timeit full & empty
        1.68 µs ± 18.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [11]: %timeit empty | full
        1.66 µs ± 24.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [12]: %timeit full | empty
        1.8 µs ± 23 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [13]: values = (Q(a=1), Q(b=2), Q(c=3), Q(d=4), Q(e__in=[1,2,3,4]))
In [14]: %timeit reduce(or_, values)
        9.59 µs ± 56.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

A final note then, if inlining the _new_instance code into the copy method were done, it looks like it shaves off 100-120ns per copy, from my quick tests. So there is still performance on the table to put it fully ahead of deconstruct (and another possibility exists - that the deconstruction for migrations in some way gets more complex, negatively affecting runtime performance for self copies)

I have a patch series which will need a little tidying (don't they always?) but should pass CI when I open the PR for discussion.

Change History (6)

comment:1 Changed 11 months ago by Mariusz Felisiak

Triage Stage: UnreviewedAccepted

Tentatively accepted, readability with the proposed patch is crucial to me.

comment:2 Changed 11 months ago by Claude Paroz

+1 to privilege readability over micro-optimization, if such a choice has to be made.

comment:3 Changed 11 months ago by Keryn Knight

Has patch: set

I don't think readability has been hurt by the change; I could argue that it's been improved even, perhaps. Do judge for yourselves, naturally.

At this point there's 2 PRs to look at:
my original one: https://github.com/django/django/pull/14673
Nick P's one based on mine above: https://github.com/django/django/pull/14677

Please note that the discussion about Nick's variant happened in-line in my PR, as regrettably I didn't see the other PR until after we'd started discussing the changes.

comment:4 Changed 9 months ago by Mariusz Felisiak

Patch needs improvement: set

Marking as "needs improvement" because we need to reach a consensus before the next review.

comment:5 Changed 9 months ago by Nick Pope

Owner: changed from Keryn Knight to Nick Pope
Patch needs improvement: unset

comment:6 Changed 7 months ago by Mariusz Felisiak

Needs tests: set

From the discussion it seems that extra test coverage is needed.

Note: See TracTickets for help on using tickets.
Back to Top