Opened 3 years ago

Last modified 3 years ago

#32970 closed Cleanup/optimization

Investigate feasibility of improving WhereNode clone performance — at Version 1

Reported by: Keryn Knight Owned by: Keryn Knight
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Keryn Knight)

This relates to #32948 insofar as it's about figuring out how shallow or deep a copy is necessary for cloning Node instances/subclasses. And a bit to #28455 which acknowledges the need to avoid cloning sometimes due to it being costly.

The PR I made for optimising Q combining/inverting (https://github.com/django/django/pull/14673) wants to introduce a __copy__ method onto Q objects to indicate what level of operation is needed. Unlike that PR, it's decidedly not possible to take the same approach with WhereNode and just do clone.__dict__ = self.__dict__.copy() (things broke horribly) because of attributes like contains_aggregate and output_field which are cached properties on the WhereNode instance. Also resolved gets set after the fact by resolve_expression. But, as far as I can tell, the looping/testing/child cloning may be removable. Perhaps once upon a time they were required and other components changed such that it's now safe to consider. I can't readily say, but initial experimentation suggests it's OK to look at, at least in theory.

There is a PR already for this, here: https://github.com/django/django/pull/14709 which is currently marked draft/WIP because I couldn't easily run the whole test suite locally and needed to see if the more exotic parts caused problems. They didn't, which was unexpected, so here I am. The PR currently replaces the WhereNode.clone method like so:

clone = self.__class__._new_instance(children=[], connector=self.connector, negated=self.negated)
for child in self.children:
    if hasattr(child, 'clone'):
        clone.children.append(child.clone())
    else:
        clone.children.append(child)

with:

clone = self.__class__._new_instance(children=None, connector=self.connector, negated=self.negated)
clone.children = self.children[:]

But I think that Q and WhereNode can both just return a shallow copy by only aliasing the children property, at which point I think the __copy__ method could be implemented on Node directly and mirror the existing __deepcopy__ method. By preferring those stdlib names it draws a line in the sand over what level of copy should be expected. The existing clone and copy methods can become aliases to same.

Before any changes

If we now begin to examine the before and after with as much data as I can easily gather. First a bit of prelude:

In [1]: from django.db.models.sql.where import WhereNode
In [2]: from django.db.models import QuerySet
In [3]: from django.db.models.sql import Query
In [4]: from django.contrib.auth.models import User
In [5]: x = User.objects.all()
In [6]: y = User.objects.filter(username='test', email='test@test.test').exclude(username='test').filter(email='nottest')
In [7]: %load_ext line_profiler

Now let's establish a ballpark for how much time is spent where, I've stripped the line profiling down to just the relevant bits for clarity.

Queryset with no clauses (x)

In [8]: %lprun -f QuerySet._chain -f QuerySet._clone -f Query.chain -f Query.clone -f WhereNode.clone for _ in range(1000): x._chain()

Function: QuerySet._chain
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1325                                               def _chain(self, **kwargs):
  ...
  1330      1000      88169.0     88.2     97.8          obj = self._clone()

Function: QuerySet._clone
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1337                                               def _clone(self):
  ...
  1342      1000      80796.0     80.8     86.3          chained = self.query.chain()
  1343      1000       7240.0      7.2      7.7          c = self.__class__(model=self.model, query=chained, using=self._db, hints=self._hints))

Function: Query.chain
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   341                                               def chain(self, klass=None):
   ...
   346      2000     116961.0     58.5     94.8          obj = self.clone()

Function: Query.clone
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   290                                               def clone(self):
   ...
   304      2000      18294.0      9.1     27.0          obj.where = self.where.clone()

Function: WhereNode.clone
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   142                                               def clone(self):
   ...
   148      4000       6227.0      1.6     63.7          clone = self.__class__._new_instance(
   149      2000       1170.0      0.6     12.0              children=[], connector=self.connector, negated=self.negated)
   150      2000       1407.0      0.7     14.4          for child in self.children:
   151                                                       if hasattr(child, 'clone'):
   152                                                           clone.children.append(child.clone())
   153                                                       else:
   154                                                           clone.children.append(child)
   155      2000        969.0      0.5      9.9          return clone

Queryset with where clauses (y)

Youu can see the cost begin to go up:

In [9]: %lprun -f QuerySet._chain -f QuerySet._clone -f Query.chain -f Query.clone -f WhereNode.clone for _ in range(1000): y._chain()

Function: Query.clone
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   290                                               def clone(self):
   ...
   304      2000      60065.0     30.0     59.1          obj.where = self.where.clone()

Function: WhereNode.clone
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   142                                               def clone(self):
   148      8000      11473.0      1.4     33.1          clone = self.__class__._new_instance(
   149      4000       2379.0      0.6      6.9              children=[], connector=self.connector, negated=self.negated)
   150     14000       7241.0      0.5     20.9          for child in self.children:
   151     10000       5553.0      0.6     16.0              if hasattr(child, 'clone'):
   152      2000       1255.0      0.6      3.6                  clone.children.append(child.clone())
   153                                                       else:
   154      8000       5069.0      0.6     14.6                  clone.children.append(child)
   155      4000       1715.0      0.4      4.9          return clone

and the timings for those 2 (x and y) querysets, plus an example of prefetching related data:

In [10]: %timeit x._chain()
6.43 µs ± 52 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [11]: %timeit y._chain()
8.24 µs ± 35.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each))
In [12]: %timeit y.query.where.clone()
2.79 µs ± 38.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %prun for _ in range(100000): y._chain()
3400003 function calls (3300003 primitive calls) in 1.506 seconds
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       100000    0.357    0.000    1.007    0.000 query.py:290(clone)
200000/100000    0.301    0.000    0.539    0.000 where.py:142(clone) <-- WhereNode
       100000    0.172    0.000    1.344    0.000 query.py:1337(_clone)
       100000    0.068    0.000    1.086    0.000 query.py:341(chain)
       100000    0.058    0.000    1.415    0.000 query.py:1325(_chain)
In [14]: %timeit tuple(User.objects.prefetch_related('groups__permissions', 'user_permissions'))
18.2 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each))

After replacing the clone method

Given those data points, here are the same bits of line profiling, showing only those where the needle has moved:

Queryset with no clauses (x)

Function: Query.clone
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   290                                               def clone(self):
   ...
   304      2000      14134.0      7.1     22.8          obj.where = self.where.clone()

Function: WhereNode.clone
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   142                                               def clone(self):
   ...
   148      2000       5586.0      2.8     71.0          clone = self.__class__._new_instance(children=None, connector=self.connector, negated=self.negated)
   149      2000       1412.0      0.7     18.0          clone.children = self.children[:]
   150      2000        865.0      0.4     11.0          return clone

Queryset with where clauses (y)

Function: Query.clone at line 290
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   290                                               def clone(self):
   ...
   304      2000      13730.0      6.9     24.7          obj.where = self.where.clone()

Function: WhereNode.clone
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   142                                               def clone(self):
   ...
   148      2000       5273.0      2.6     69.0          clone = self.__class__._new_instance(children=None, connector=self.connector, negated=self.negated)
   149      2000       1498.0      0.7     19.6          clone.children = self.children[:]
   150      2000        871.0      0.4     11.4          return clone

And the timings; in the empty case (x), it's basically just paying the cost of class creation, but where there are clauses it is more consistent (ie: nearer a constant time vs linear growth).

In [10]: %timeit x._chain()
6.51 µs ± 47.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each))
In [11]: %timeit y._chain()
6.25 µs ± 36.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [12]: %timeit y.query.where.clone()
1.16 µs ± 9.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [13]: %prun for _ in range(100000): y._chain()
2100003 function calls in 1.087 seconds
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
100000    0.355    0.000    0.606    0.000 query.py:290(clone)
100000    0.169    0.000    0.939    0.000 query.py:1337(_clone)
100000    0.068    0.000    0.139    0.000 where.py:142(clone) <--- WhereNode
100000    0.067    0.000    0.683    0.000 query.py:341(chain)
100000    0.058    0.000    1.009    0.000 query.py:1325(_chain)
In [14]: %timeit tuple(User.objects.prefetch_related('groups__permissions', 'user_permissions'))
17.9 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Gut feeling says the "final" implementation could be something like the following, directly on the Node class.

def __copy__(self):
    clone = self._new_instance(children=None, connector=self.connector, negated=self.negated)
    clone.children = self.children[:]
    return clone

copy = __copy__
clone = __copy__

def __deepcopy__(self, memo):
    ...
    clone.children = copy.deepcopy(self.children, memo)
    ...

which would remove the need for methods on either subclass. Note also that aliasing the methods that way rather than doing def copy(): return self.clone() as WhereNode currently does is also slightly quicker. We're talking 0.10µs - small potatoes, but it doesn't hurt to have + it's 1 fewer stack frame.

A final note then: if this properly shallow clone is done, I suspect resolve_expression can be altered too, to avoid calling _resolve_node if it's already been done on self - the shallow copy would ensure that anything which had previously been resolved is kept around and doesn't need doing again, I think.

Other stats: Of the 14,000 tests I can run easily, only 1 time was resolved in self.__dict__ when clone was called (the value was True). contains_aggregate turned up 209 times (207 with a value of False, 2 with True), and output_field only appeared 2 times (both BooleanField). Those are the only extra attribute I spotted.

Change History (1)

comment:1 by Keryn Knight, 3 years ago

Description: modified (diff)
Note: See TracTickets for help on using tickets.
Back to Top