Opened 3 years ago
Last modified 3 years ago
#32940 closed Cleanup/optimization
Consider removing unused/unnecessary functionality in tree.Node.add — at Initial Version
Reported by: | Keryn Knight | Owned by: | nobody |
---|---|---|---|
Component: | Utilities | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
I'm proposing for discussion the removal of 2 things within the Node.add
method, or at least further investigation on why it's not possible to do so; those 2 bits are outlined immediately below.
If we instrument the Node.add
method, used by query_utils.Q
and where.WhereNode
extensively, like so:
... if data in self.children: print('v- data existed') traceback.print_stack() if self.connector == conn_type and data in self.children: print('^- data existed and connector matched') return data if not squash: print('<- squash is not Truthy') self.children.append(data) return data ...
and run the test suite (or as much as I'm able easily):
... Ran 14834 tests in 453.388s OK (skipped=1188, expected failures=4)
We will find that the data in self.children
barely registers as tested/used, and squash
is entirely unused (unless those 1188 tests hide the truth).
Having separately instrumented the add method previously to check the number of calls by class type, it appears as below:
1493
calls toQ.add
of which0
have adata in children
match.2
calls toNode.add
of which1
is adata in children
match.210862
calls toWhereNode.add
of which2
are a `data in children match.
But of all of those matches, the connector is never the same, so it never does that return early expected optimisation.
The tests which exercise those 3 matches are:
tests.queries.tests.DisjunctiveFilterTests.test_ticket8283
tests.queries.tests.Queries4Tests.test_combine_or_filter_reuse
tests.utils_tests.test_tree.NodeTests.test_add_eq_child_mixed_connector
The last of those is intentionally making sure that it does get added due to the connector difference.
The first two, as they could affect where clause pushdown & merging, or cause different query plans due to repeats, we should check to make sure that the queries (additionally to the results) remain the same before and after by doing the following back of the napkin job for each of them:
self.assertEqual( str(x.query), "???" )
and confirm that they remain the same (they do), which are:
'SELECT "queries_extrainfo"."id", "queries_extrainfo"."info", "queries_extrainfo"."note_id", "queries_extrainfo"."value", "queries_extrainfo"."date_id", "queries_extrainfo"."filterable" FROM "queries_extrainfo" WHERE (("queries_extrainfo"."note_id" = 1 OR "queries_extrainfo"."info" = e2) AND "queries_extrainfo"."note_id" = 1) ORDER BY "queries_extrainfo"."info" ASC'; 'SELECT "queries_extrainfo"."id", "queries_extrainfo"."info", "queries_extrainfo"."note_id", "queries_extrainfo"."value", "queries_extrainfo"."date_id", "queries_extrainfo"."filterable" FROM "queries_extrainfo" WHERE (("queries_extrainfo"."info" = e2 OR "queries_extrainfo"."note_id" = 1) AND "queries_extrainfo"."note_id" = 1) ORDER BY "queries_extrainfo"."info" ASC'; 'SELECT "queries_author"."id", "queries_author"."name", "queries_author"."num", "queries_author"."extra_id" FROM "queries_author" WHERE (("queries_author"."name" = a1 OR "queries_author"."name" = a3) AND "queries_author"."name" = a1)';
Removing the data in self.children
check should be a small optimisation win because it's an O(n)
scan across the entire list of direct children, and uses the Node.__eq__
method which does an equality scan on it's own children (so grandchildren).
In [2]: %%timeit ...: x = Q(('a', 1), ('b', 2), c=3, d=4, b=1) ...: y = Q(aa=1, bb=2, cc=3) ...: x.add(y, Q.OR) 4.81 µs ± 16.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
after removing:
4.66 µs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
It's worth noting that the concept of checking data in self.children
is semi-flawed anyway, because Q(('a', 1), a=2)
or Q(('a', 1), ('a', 2))
very simply allow duplicate entries to occur immediately and up-front.
I have a patch which I'll push up shortly once I've tidied it of all my instrumentation and pdb
usage. The hope is that the entire CI passes, if it doesn't this is all moot :)