Opened 3 years ago

Closed 21 months ago

#32948 closed Cleanup/optimization (fixed)

Optimise Q combination and inversion

Reported by: Keryn Knight Owned by: Nick Pope
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

This is corollary to #32946 and #32940.

Q is currently inconsistent with it's friends WhereNode and Node in that it doesn't use the _new_instance trick. Even using the _new_instance trick leaves some performance on the table vs just inlining the __class__ switch, because it's an extra method call which affects both _combine() and __invert__().

The _combine method also has conditionals for what to do about an empty node being combined, either lhs or rhs. One side uses deconstruct, the other uses the shallow copy protocol (only since c8b659430556dca0b2fe27cf2ea0f8290dbafecd), which is unimplemented.

If __copy__ is not implemented, it ultimately falls back (after some branching checks) to the builtin __reduce_ex__(4) + copy._reconstruct which gives:

In [3]: x = Q()
In [4]: %timeit copy.copy(x)
2.2 µs ± 70.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [5]: %timeit copy.copy(Q())
3.52 µs ± 264 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

If we implement the necessary method like so:

def __copy__(self):
    obj = self._new_instance()
    obj.__dict__ = self.__dict__.copy()
    return obj

we can reduce those numbers to:

In [3]: x = Q()
In [4]: %timeit copy.copy(x)
1.27 µs ± 6.19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit copy.copy(Q())
2.37 µs ± 28.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

we can then reduce the work further by not invoking copy.copy() at all, by setting the copy = __copy__ attribute on the Q class.

From there, we can avoid calling self.deconstruct() at all, instead calling self.copy() knowing that self has values, but other does not. Both are basically on-par with eachother speedwise, with deconstruct being faster on empty nodes (which self isn't) and copy being minimally faster when there's a different connector (eg: OR).

For inverting, we can similarly change it to avoid the Node.add() call:

    def __invert__(self):
        obj = self.copy()
        obj.negate()
        return obj

which would allow it to go from:

In [2]: %timeit ~Q()
2.89 µs ± 18.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit ~Q(a=1, b=2, c=3, d=4)
3.77 µs ± 57.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

to:

In [2]: %timeit ~Q()
2.34 µs ± 9.49 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit ~Q(a=1, b=2, c=3, d=4)
3.14 µs ± 72.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In totality, then, baselines:

In [2]: full = Q(a=1, b=2, c=3)
In [3]: full2 = Q(d=4, e=5, f=6)
In [4]: empty = Q()
In [5]: %timeit full & full2
        2.65 µs ± 17.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [6]: %timeit full | full2
        3 µs ± 39.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: %timeit ~(full | full2)
        5.09 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit ~(full & full2)
        4.67 µs ± 58.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [9]: %timeit empty & full
        2.81 µs ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [10]: %timeit full & empty
        3.16 µs ± 43.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [11]: %timeit empty | full
        2.8 µs ± 22.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [12]: %timeit full | empty
        3.13 µs ± 20.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: values = (Q(a=1), Q(b=2), Q(c=3), Q(d=4), Q(e__in=[1,2,3,4]))
In [14]: %timeit reduce(or_, values)
        12 µs ± 212 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

and after the changes:

In [5]: %timeit full & full2
        2.11 µs ± 20.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [6]: %timeit full | full2
        2.39 µs ± 37.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: %timeit ~(full | full2)
        3.62 µs ± 47.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit ~(full & full2)
        3.34 µs ± 28.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [9]: %timeit empty & full
        1.57 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [10]: %timeit full & empty
        1.68 µs ± 18.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [11]: %timeit empty | full
        1.66 µs ± 24.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [12]: %timeit full | empty
        1.8 µs ± 23 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [13]: values = (Q(a=1), Q(b=2), Q(c=3), Q(d=4), Q(e__in=[1,2,3,4]))
In [14]: %timeit reduce(or_, values)
        9.59 µs ± 56.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

A final note then, if inlining the _new_instance code into the copy method were done, it looks like it shaves off 100-120ns per copy, from my quick tests. So there is still performance on the table to put it fully ahead of deconstruct (and another possibility exists - that the deconstruction for migrations in some way gets more complex, negatively affecting runtime performance for self copies)

I have a patch series which will need a little tidying (don't they always?) but should pass CI when I open the PR for discussion.

Change History (15)

comment:1 by Mariusz Felisiak, 3 years ago

Triage Stage: UnreviewedAccepted

Tentatively accepted, readability with the proposed patch is crucial to me.

comment:2 by Claude Paroz, 3 years ago

+1 to privilege readability over micro-optimization, if such a choice has to be made.

comment:3 by Keryn Knight, 3 years ago

Has patch: set

I don't think readability has been hurt by the change; I could argue that it's been improved even, perhaps. Do judge for yourselves, naturally.

At this point there's 2 PRs to look at:
my original one: https://github.com/django/django/pull/14673
Nick P's one based on mine above: https://github.com/django/django/pull/14677

Please note that the discussion about Nick's variant happened in-line in my PR, as regrettably I didn't see the other PR until after we'd started discussing the changes.

comment:4 by Mariusz Felisiak, 3 years ago

Patch needs improvement: set

Marking as "needs improvement" because we need to reach a consensus before the next review.

comment:5 by Nick Pope, 3 years ago

Owner: changed from Keryn Knight to Nick Pope
Patch needs improvement: unset

comment:6 by Mariusz Felisiak, 2 years ago

Needs tests: set

From the discussion it seems that extra test coverage is needed.

comment:7 by Nick Pope, 21 months ago

Needs tests: unset

Unflagging "needs tests" as some have been added to cover the various approaches to copying/cloning which should provide confidence that the behaviour hasn't changed.

comment:8 by Mariusz Felisiak <felisiak.mariusz@…>, 21 months ago

In cc52e02c:

Refs #32948 -- Added more tests for django.utils.tree.Node.

The tests for creating new instances or copying instances of Node and
its subclasses didn't fully capture the behaviour of the implementation,
particularly around whether the children list or is contents were the
same as the source.

comment:9 by Mariusz Felisiak, 21 months ago

Triage Stage: AcceptedReady for checkin

comment:10 by Mariusz Felisiak <felisiak.mariusz@…>, 21 months ago

In ddf0002:

Refs #32948 -- Renamed Node._new_instance() to Node.create().

Node._new_instance() was added in
6dd2b5468fa275d53aa60fdcaff8c28bdc5e9c25 to work around Q.init()
having an incompatible signature with Node.init().

It was intended as a hook that could be overridden if subclasses needed
to change the behaviour of instantiation of their specialised form of
Node. In practice this doesn't ever seem to have been used for this
purpose and there are very few calls to Node._new_instance() with other
code, e.g. Node.deepcopy() calling Node and overriding class as
required.

Rename this to Node.create() to make it a more "official" piece of
private API that we can use to simplify a lot of other areas internally.

The docstring and nearby comment have been reworded to read more
clearly.

comment:11 by Mariusz Felisiak <felisiak.mariusz@…>, 21 months ago

In ed9eca84:

Refs #32948 -- Simplified WhereNode and Node.deepcopy()/add().

We can use copy() in Node.add() instead of create() as we don't need the
children to be cloned via [:] subscript in init().

comment:12 by Mariusz Felisiak <felisiak.mariusz@…>, 21 months ago

In 19b866c2:

Refs #32948 -- Added Node.copy().

This allows the copy.copy() usage in the Q._combine() method to finish
sooner, instead of having to fallback to using the reduce_ex(4)
method.

Thia also avoids having to fall into copy.copy() at in Q._combine(),
when combining a Q() with another Q().

Co-authored-by: Keryn Knight <keryn@…>

comment:13 by Mariusz Felisiak <felisiak.mariusz@…>, 21 months ago

In 845667f2:

Refs #32948 -- Simplified and optimized Q._combine() and invert().

  • Removed use of Q.deconstruct() in Q._combine().
  • Simplified and optimized Q.invert() by taking a shallow copy and swapping the negated attribute only.
  • Simplified construction in Q._combine().
  • Simplified conditions in Q._combine() as Q.conditional = True the first isinstance() check is unnecessary.
  • Removed copy.copy() branch in Q._combine().

Co-authored-by: Keryn Knight <keryn@…>

comment:14 by Mariusz Felisiak <felisiak.mariusz@…>, 21 months ago

In 9dff316b:

Refs #32948, Refs #32946 -- Used Q.create() internally for dynamic Q() objects.

Node.create() which has a compatible signature with Node.init()
takes in a single children argument rather than relying in unpacking
*args in Q.init() which calls Node.init().

In addition, we were often needing to unpack iterables into *args and
can instead pass a list direct to Node.create().

comment:15 by Mariusz Felisiak, 21 months ago

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.
Back to Top