id	summary	reporter	owner	description	type	status	component	version	severity	resolution	keywords	cc	stage	has_patch	needs_docs	needs_tests	needs_better_patch	easy	ui_ux
32970	Investigate feasibility of improving WhereNode clone performance	Keryn Knight	Keryn Knight	"This relates to #32948 insofar as it's about figuring out how shallow or deep a copy is necessary for cloning `Node` instances/subclasses. And a bit to #28455 which acknowledges the need to avoid cloning sometimes due to it being costly.

The PR I made for optimising Q combining/inverting (https://github.com/django/django/pull/14673) wants to introduce a `__copy__` method onto `Q` objects to indicate what level of operation is needed. Unlike that PR, it's decidedly not possible to take the same approach with WhereNode and just do `clone.__dict__ = self.__dict__.copy()` (things broke horribly) because of attributes like `contains_aggregate` and `output_field` which are cached properties on the WhereNode instance. Also `resolved` gets set after the fact by `resolve_expression`. But, as far as I can tell, the looping/testing/child cloning ''may be removable''. Perhaps once upon a time they were required and other components changed such that it's now safe to consider. I can't readily say, but initial experimentation suggests it's OK to look at, at least in theory.

There is a PR already for this, here: https://github.com/django/django/pull/14709 which is currently marked draft/WIP because I couldn't easily run the whole test suite locally and needed to see if the more exotic parts caused problems. They didn't, which was unexpected, so here I am. The PR currently replaces the `WhereNode.clone` method like so:
{{{
clone = self.__class__._new_instance(children=[], connector=self.connector, negated=self.negated)
for child in self.children:
    if hasattr(child, 'clone'):
        clone.children.append(child.clone())
    else:
        clone.children.append(child)
}}}
with:
{{{
clone = self.__class__._new_instance(children=None, connector=self.connector, negated=self.negated)
clone.children = self.children[:]
}}}


But I ''think'' that Q and WhereNode ''can'' both just return a shallow copy by only aliasing the children property, at which point I think the `__copy__` method could be implemented on `Node` directly and mirror the existing `__deepcopy__` method. By preferring those stdlib names it draws a line in the sand over what level of copy should be expected. The existing `clone` and `copy` methods can become aliases to same.


=== Before any changes

If we now begin to examine the before and after with as much data as I can easily gather. First a bit of prelude:

{{{
In [1]: from django.db.models.sql.where import WhereNode
In [2]: from django.db.models import QuerySet
In [3]: from django.db.models.sql import Query
In [4]: from django.contrib.auth.models import User
In [5]: x = User.objects.all()
In [6]: y = User.objects.filter(username='test', email='test@test.test').exclude(username='test').filter(email='nottest')
In [7]: %load_ext line_profiler
}}}

Now let's establish a ballpark for how much time is spent where, I've stripped the line profiling down to just the relevant bits for clarity.

==== Queryset with no clauses (`x`)

{{{
In [8]: %lprun -f QuerySet._chain -f QuerySet._clone -f Query.chain -f Query.clone -f WhereNode.clone for _ in range(1000): x._chain()

Function: QuerySet._chain
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1325                                               def _chain(self, **kwargs):
  ...
  1330      1000      88169.0     88.2     97.8          obj = self._clone()

Function: QuerySet._clone
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1337                                               def _clone(self):
  ...
  1342      1000      80796.0     80.8     86.3          chained = self.query.chain()
  1343      1000       7240.0      7.2      7.7          c = self.__class__(model=self.model, query=chained, using=self._db, hints=self._hints))

Function: Query.chain
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   341                                               def chain(self, klass=None):
   ...
   346      2000     116961.0     58.5     94.8          obj = self.clone()

Function: Query.clone
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   290                                               def clone(self):
   ...
   304      2000      18294.0      9.1     27.0          obj.where = self.where.clone()

Function: WhereNode.clone
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   142                                               def clone(self):
   ...
   148      4000       6227.0      1.6     63.7          clone = self.__class__._new_instance(
   149      2000       1170.0      0.6     12.0              children=[], connector=self.connector, negated=self.negated)
   150      2000       1407.0      0.7     14.4          for child in self.children:
   151                                                       if hasattr(child, 'clone'):
   152                                                           clone.children.append(child.clone())
   153                                                       else:
   154                                                           clone.children.append(child)
   155      2000        969.0      0.5      9.9          return clone
}}}

==== Queryset with where clauses (`y`)

Youu can see the cost begin to go up:
{{{
In [9]: %lprun -f QuerySet._chain -f QuerySet._clone -f Query.chain -f Query.clone -f WhereNode.clone for _ in range(1000): y._chain()

Function: Query.clone
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   290                                               def clone(self):
   ...
   304      2000      60065.0     30.0     59.1          obj.where = self.where.clone()

Function: WhereNode.clone
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   142                                               def clone(self):
   148      8000      11473.0      1.4     33.1          clone = self.__class__._new_instance(
   149      4000       2379.0      0.6      6.9              children=[], connector=self.connector, negated=self.negated)
   150     14000       7241.0      0.5     20.9          for child in self.children:
   151     10000       5553.0      0.6     16.0              if hasattr(child, 'clone'):
   152      2000       1255.0      0.6      3.6                  clone.children.append(child.clone())
   153                                                       else:
   154      8000       5069.0      0.6     14.6                  clone.children.append(child)
   155      4000       1715.0      0.4      4.9          return clone

}}}

and the timings for those 2 (`x` and `y`) querysets, plus an example of prefetching related data:

{{{
In [10]: %timeit x._chain()
6.43 µs ± 52 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [11]: %timeit y._chain()
8.24 µs ± 35.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each))
In [12]: %timeit y.query.where.clone()
2.79 µs ± 38.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %prun for _ in range(100000): y._chain()
3400003 function calls (3300003 primitive calls) in 1.506 seconds
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       100000    0.357    0.000    1.007    0.000 query.py:290(clone)
200000/100000    0.301    0.000    0.539    0.000 where.py:142(clone) <-- WhereNode
       100000    0.172    0.000    1.344    0.000 query.py:1337(_clone)
       100000    0.068    0.000    1.086    0.000 query.py:341(chain)
       100000    0.058    0.000    1.415    0.000 query.py:1325(_chain)
In [14]: %timeit tuple(User.objects.prefetch_related('groups__permissions', 'user_permissions'))
18.2 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each))
}}}

=== After replacing the clone method

Given those data points, here are the same bits of line profiling, showing only those where the needle has moved:

==== Queryset with no clauses (`x`)

{{{
Function: Query.clone
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   290                                               def clone(self):
   ...
   304      2000      14134.0      7.1     22.8          obj.where = self.where.clone()

Function: WhereNode.clone
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   142                                               def clone(self):
   ...
   148      2000       5586.0      2.8     71.0          clone = self.__class__._new_instance(children=None, connector=self.connector, negated=self.negated)
   149      2000       1412.0      0.7     18.0          clone.children = self.children[:]
   150      2000        865.0      0.4     11.0          return clone
}}}

==== Queryset with where clauses (`y`)

{{{
Function: Query.clone at line 290
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   290                                               def clone(self):
   ...
   304      2000      13730.0      6.9     24.7          obj.where = self.where.clone()

Function: WhereNode.clone
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   142                                               def clone(self):
   ...
   148      2000       5273.0      2.6     69.0          clone = self.__class__._new_instance(children=None, connector=self.connector, negated=self.negated)
   149      2000       1498.0      0.7     19.6          clone.children = self.children[:]
   150      2000        871.0      0.4     11.4          return clone
}}}

And the timings; in the empty case (`x`), it's ''basically'' just paying the cost of class creation, but where there are clauses it is more consistent (ie: ''nearer'' a constant time vs linear growth).
{{{
In [10]: %timeit x._chain()
6.51 µs ± 47.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each))
In [11]: %timeit y._chain()
6.25 µs ± 36.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [12]: %timeit y.query.where.clone()
1.16 µs ± 9.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [13]: %prun for _ in range(100000): y._chain()
2100003 function calls in 1.087 seconds
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
100000    0.355    0.000    0.606    0.000 query.py:290(clone)
100000    0.169    0.000    0.939    0.000 query.py:1337(_clone)
100000    0.068    0.000    0.139    0.000 where.py:142(clone) <--- WhereNode
100000    0.067    0.000    0.683    0.000 query.py:341(chain)
100000    0.058    0.000    1.009    0.000 query.py:1325(_chain)
In [14]: %timeit tuple(User.objects.prefetch_related('groups__permissions', 'user_permissions'))
17.9 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
}}}

Gut feeling says the ""final"" implementation could be something like the following, directly on the `Node` class.
{{{
def __copy__(self):
    clone = self._new_instance(children=None, connector=self.connector, negated=self.negated)
    clone.children = self.children[:]
    return clone

copy = __copy__
clone = __copy__

def __deepcopy__(self, memo):
    ...
    clone.children = copy.deepcopy(self.children, memo)
    ...
}}}

which would remove the need for methods on either subclass. Note also that aliasing the methods that way rather than doing `def copy(): return self.clone()` as `WhereNode` currently does is also ''slightly'' quicker. We're talking `0.10µs` - small potatoes, but it doesn't hurt to have + it's 1 fewer stack frame.

A final note then: if this properly shallow clone is done, I suspect `resolve_expression` can be altered too, to avoid calling `_resolve_node` if it's already been done on `self` - the shallow copy would ensure that anything which had previously been resolved is kept around and doesn't need doing again, I think.

Other stats: Of the 14,000 tests I can run easily, only 1 time was `resolved` in `self.__dict__` when `clone` was called (the value was `True`). `contains_aggregate` turned up 209 times (207 with a value of `False`, 2 with `True`), and `output_field` only appeared 2 times (both `BooleanField`). Those are the only extra attribute I spotted."	Cleanup/optimization	closed	Database layer (models, ORM)	dev	Normal	wontfix			Ready for checkin	1	0	0	0	0	0