Opened 3 years ago
Last modified 3 years ago
#32970 closed Cleanup/optimization
Investigate feasibility of improving WhereNode clone performance — at Version 1
Reported by: | Keryn Knight | Owned by: | Keryn Knight |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Ready for checkin | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
This relates to #32948 insofar as it's about figuring out how shallow or deep a copy is necessary for cloning Node
instances/subclasses. And a bit to #28455 which acknowledges the need to avoid cloning sometimes due to it being costly.
The PR I made for optimising Q combining/inverting (https://github.com/django/django/pull/14673) wants to introduce a __copy__
method onto Q
objects to indicate what level of operation is needed. Unlike that PR, it's decidedly not possible to take the same approach with WhereNode and just do clone.__dict__ = self.__dict__.copy()
(things broke horribly) because of attributes like contains_aggregate
and output_field
which are cached properties on the WhereNode instance. Also resolved
gets set after the fact by resolve_expression
. But, as far as I can tell, the looping/testing/child cloning may be removable. Perhaps once upon a time they were required and other components changed such that it's now safe to consider. I can't readily say, but initial experimentation suggests it's OK to look at, at least in theory.
There is a PR already for this, here: https://github.com/django/django/pull/14709 which is currently marked draft/WIP because I couldn't easily run the whole test suite locally and needed to see if the more exotic parts caused problems. They didn't, which was unexpected, so here I am. The PR currently replaces the WhereNode.clone
method like so:
clone = self.__class__._new_instance(children=[], connector=self.connector, negated=self.negated) for child in self.children: if hasattr(child, 'clone'): clone.children.append(child.clone()) else: clone.children.append(child)
with:
clone = self.__class__._new_instance(children=None, connector=self.connector, negated=self.negated) clone.children = self.children[:]
But I think that Q and WhereNode can both just return a shallow copy by only aliasing the children property, at which point I think the __copy__
method could be implemented on Node
directly and mirror the existing __deepcopy__
method. By preferring those stdlib names it draws a line in the sand over what level of copy should be expected. The existing clone
and copy
methods can become aliases to same.
Before any changes
If we now begin to examine the before and after with as much data as I can easily gather. First a bit of prelude:
In [1]: from django.db.models.sql.where import WhereNode In [2]: from django.db.models import QuerySet In [3]: from django.db.models.sql import Query In [4]: from django.contrib.auth.models import User In [5]: x = User.objects.all() In [6]: y = User.objects.filter(username='test', email='test@test.test').exclude(username='test').filter(email='nottest') In [7]: %load_ext line_profiler
Now let's establish a ballpark for how much time is spent where, I've stripped the line profiling down to just the relevant bits for clarity.
Queryset with no clauses (x
)
In [8]: %lprun -f QuerySet._chain -f QuerySet._clone -f Query.chain -f Query.clone -f WhereNode.clone for _ in range(1000): x._chain() Function: QuerySet._chain Line # Hits Time Per Hit % Time Line Contents ============================================================== 1325 def _chain(self, **kwargs): ... 1330 1000 88169.0 88.2 97.8 obj = self._clone() Function: QuerySet._clone Line # Hits Time Per Hit % Time Line Contents ============================================================== 1337 def _clone(self): ... 1342 1000 80796.0 80.8 86.3 chained = self.query.chain() 1343 1000 7240.0 7.2 7.7 c = self.__class__(model=self.model, query=chained, using=self._db, hints=self._hints)) Function: Query.chain Line # Hits Time Per Hit % Time Line Contents ============================================================== 341 def chain(self, klass=None): ... 346 2000 116961.0 58.5 94.8 obj = self.clone() Function: Query.clone Line # Hits Time Per Hit % Time Line Contents ============================================================== 290 def clone(self): ... 304 2000 18294.0 9.1 27.0 obj.where = self.where.clone() Function: WhereNode.clone Line # Hits Time Per Hit % Time Line Contents ============================================================== 142 def clone(self): ... 148 4000 6227.0 1.6 63.7 clone = self.__class__._new_instance( 149 2000 1170.0 0.6 12.0 children=[], connector=self.connector, negated=self.negated) 150 2000 1407.0 0.7 14.4 for child in self.children: 151 if hasattr(child, 'clone'): 152 clone.children.append(child.clone()) 153 else: 154 clone.children.append(child) 155 2000 969.0 0.5 9.9 return clone
Queryset with where clauses (y
)
Youu can see the cost begin to go up:
In [9]: %lprun -f QuerySet._chain -f QuerySet._clone -f Query.chain -f Query.clone -f WhereNode.clone for _ in range(1000): y._chain() Function: Query.clone Line # Hits Time Per Hit % Time Line Contents ============================================================== 290 def clone(self): ... 304 2000 60065.0 30.0 59.1 obj.where = self.where.clone() Function: WhereNode.clone Line # Hits Time Per Hit % Time Line Contents ============================================================== 142 def clone(self): 148 8000 11473.0 1.4 33.1 clone = self.__class__._new_instance( 149 4000 2379.0 0.6 6.9 children=[], connector=self.connector, negated=self.negated) 150 14000 7241.0 0.5 20.9 for child in self.children: 151 10000 5553.0 0.6 16.0 if hasattr(child, 'clone'): 152 2000 1255.0 0.6 3.6 clone.children.append(child.clone()) 153 else: 154 8000 5069.0 0.6 14.6 clone.children.append(child) 155 4000 1715.0 0.4 4.9 return clone
and the timings for those 2 (x
and y
) querysets, plus an example of prefetching related data:
In [10]: %timeit x._chain() 6.43 µs ± 52 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [11]: %timeit y._chain() 8.24 µs ± 35.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)) In [12]: %timeit y.query.where.clone() 2.79 µs ± 38.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [13]: %prun for _ in range(100000): y._chain() 3400003 function calls (3300003 primitive calls) in 1.506 seconds ncalls tottime percall cumtime percall filename:lineno(function) 100000 0.357 0.000 1.007 0.000 query.py:290(clone) 200000/100000 0.301 0.000 0.539 0.000 where.py:142(clone) <-- WhereNode 100000 0.172 0.000 1.344 0.000 query.py:1337(_clone) 100000 0.068 0.000 1.086 0.000 query.py:341(chain) 100000 0.058 0.000 1.415 0.000 query.py:1325(_chain) In [14]: %timeit tuple(User.objects.prefetch_related('groups__permissions', 'user_permissions')) 18.2 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each))
After replacing the clone method
Given those data points, here are the same bits of line profiling, showing only those where the needle has moved:
Queryset with no clauses (x
)
Function: Query.clone Line # Hits Time Per Hit % Time Line Contents ============================================================== 290 def clone(self): ... 304 2000 14134.0 7.1 22.8 obj.where = self.where.clone() Function: WhereNode.clone Line # Hits Time Per Hit % Time Line Contents ============================================================== 142 def clone(self): ... 148 2000 5586.0 2.8 71.0 clone = self.__class__._new_instance(children=None, connector=self.connector, negated=self.negated) 149 2000 1412.0 0.7 18.0 clone.children = self.children[:] 150 2000 865.0 0.4 11.0 return clone
Queryset with where clauses (y
)
Function: Query.clone at line 290 Line # Hits Time Per Hit % Time Line Contents ============================================================== 290 def clone(self): ... 304 2000 13730.0 6.9 24.7 obj.where = self.where.clone() Function: WhereNode.clone Line # Hits Time Per Hit % Time Line Contents ============================================================== 142 def clone(self): ... 148 2000 5273.0 2.6 69.0 clone = self.__class__._new_instance(children=None, connector=self.connector, negated=self.negated) 149 2000 1498.0 0.7 19.6 clone.children = self.children[:] 150 2000 871.0 0.4 11.4 return clone
And the timings; in the empty case (x
), it's basically just paying the cost of class creation, but where there are clauses it is more consistent (ie: nearer a constant time vs linear growth).
In [10]: %timeit x._chain() 6.51 µs ± 47.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)) In [11]: %timeit y._chain() 6.25 µs ± 36.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [12]: %timeit y.query.where.clone() 1.16 µs ± 9.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [13]: %prun for _ in range(100000): y._chain() 2100003 function calls in 1.087 seconds ncalls tottime percall cumtime percall filename:lineno(function) 100000 0.355 0.000 0.606 0.000 query.py:290(clone) 100000 0.169 0.000 0.939 0.000 query.py:1337(_clone) 100000 0.068 0.000 0.139 0.000 where.py:142(clone) <--- WhereNode 100000 0.067 0.000 0.683 0.000 query.py:341(chain) 100000 0.058 0.000 1.009 0.000 query.py:1325(_chain) In [14]: %timeit tuple(User.objects.prefetch_related('groups__permissions', 'user_permissions')) 17.9 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Gut feeling says the "final" implementation could be something like the following, directly on the Node
class.
def __copy__(self): clone = self._new_instance(children=None, connector=self.connector, negated=self.negated) clone.children = self.children[:] return clone copy = __copy__ clone = __copy__ def __deepcopy__(self, memo): ... clone.children = copy.deepcopy(self.children, memo) ...
which would remove the need for methods on either subclass. Note also that aliasing the methods that way rather than doing def copy(): return self.clone()
as WhereNode
currently does is also slightly quicker. We're talking 0.10µs
- small potatoes, but it doesn't hurt to have + it's 1 fewer stack frame.
A final note then: if this properly shallow clone is done, I suspect resolve_expression
can be altered too, to avoid calling _resolve_node
if it's already been done on self
- the shallow copy would ensure that anything which had previously been resolved is kept around and doesn't need doing again, I think.
Other stats: Of the 14,000 tests I can run easily, only 1 time was resolved
in self.__dict__
when clone
was called (the value was True
). contains_aggregate
turned up 209 times (207 with a value of False
, 2 with True
), and output_field
only appeared 2 times (both BooleanField
). Those are the only extra attribute I spotted.