Opened 4 years ago

Last modified 6 months ago

#28455 assigned Cleanup/optimization

Create "inplace" QuerySets to speed up certain operations

Reported by: Anssi Kääriäinen Owned by: Keryn Knight
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords:
Cc: Simon Charette Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

Part two of #20880: Inplace querysets work precisely like full-clone querysets. A queryset is called inplace if it doesn't do cloning at all between operations. This again allows speeding up certain internal operations like model.save() and prefetch_related(), and also allows users to optimise heavy-to-create querysets by avoiding clone() overhead.

Change History (5)

comment:1 Changed 4 years ago by Tim Graham

I rebased work from Anssi's branch. I guess the plan was to keep "inplace" as a private API initially (hence the underscore prefix) but I'm not sure. Someone else is welcome to continue this work. I'm not planning to continue it soon.

comment:2 Changed 2 years ago by Simon Charette

Cc: Simon Charette added

comment:3 Changed 6 months ago by Keryn Knight

Owner: changed from nobody to Keryn Knight
Status: newassigned

I don't think I'm stepping on anyone's toes at this juncture, but do shout if so!

I'd accidentally been wading into similar waters when I stumbled across this ticket, which has ended up at roughly the same place/conclusions, and I have a rough sketch based on the prior branches for an MVP to resume discussion on this. I'll tidy up what I have, try and gather some simple benchmarks, and throw it into the PR queue for sanity checking.

I guarantee there's more nuance to this than I've anticipated so far, but we've got to restart somewhere, right? If it works out, all the better :)

comment:4 Changed 6 months ago by Keryn Knight

PR is https://github.com/django/django/pull/14675

Based heavily on Anssi's original implementation idea. This introduces 2 ways of opting out of cloning, both ostensibly private API. Firstly and primarily, a context manager which temporarily makes the current QuerySet (+ Query, implicitly) mutable:

with User.objects.all()._avoid_cloning() as queryset1:
    queryset1.filter(x=1).exclude(y=2).filter(z__in=[1,2,3])
# OR 
queryset2 = User.objects.all()
with queryset2._avoid_cloning():
    queryset2.filter(x=1).exclude(y=2).filter(z__in=[1,2,3])

And secondly, an imperative form, for use cases where there might be multiple filters across a long list of lines, where it's not desirable to muddy the VCS history and/or the indent depth by using the context manager:

queryset = User.objects.all()._disable_cloning()
queryset.filter(x=1).exclude(y=2).filter(z__in=[1,2,3])
queryset._enable_cloning()
queryset2 = queryset.filter(abc='test')  # This should be a new clone, as usual.

There's currently no support for any sort of nesting or ensured balancing (as with the original implementation, as I understand it), so ._disable_cloning()._disable_cloning()._disable_cloning() would be undone by a single ._enable_cloning() call because it's a toggle rather than a stack of pushes/pops; hence the preference for the contextmanager.

Given the following SQLite data:

In [1]: from django.contrib.auth.models import User, Group, Permission
In [2]: Group.objects.count()
Out[2]: 101
In [3]: User.objects.count()
Out[3]: 101
In [4]: Permission.objects.count()
Out[4]: 24
In [5]: User.user_permissions.through.count()
Out[5]: 0
In [6]: for user in User.objects.all():
   ...:     print(f'{user.groups.count()}, {user.groups.first().pk}, {user.user_permissions.count()}')
1, 1, 0
1, 2, 0
1, 3, 0
1, 4, 0
1, 5, 0
... they all have 1 group and zero permissions
In[7]: from django.db import connection
In[8]: connection.queries[-2:]
[{'sql': 'SELECT "auth_user"."id", ... FROM "auth_user"',
  'time': '0.001'},
 {'sql': 'SELECT ("auth_user_groups"."user_id") AS "_prefetch_related_val_user_id", "auth_group"."id", "auth_group"."name" FROM "auth_group" INNER JOIN "auth_user_groups" ON ("auth_group"."id" = "auth_user_groups"."group_id") WHERE "auth_user_groups"."user_id" IN (1, 2, 3, [...] 99, 100, 101)',
  'time': '0.000'}]

Before applying any of the changes:

In [2]: %timeit tuple(User.objects.all())
1.26 ms ± 9.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
(0 calls to ._clone)
In [3]: %timeit tuple(User.objects.prefetch_related('groups'))
6.52 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(105 calls to ._clone)
In [4]: %timeit tuple(User.objects.prefetch_related('groups', 'user_permissions'))
12.1 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(213 calls to ._clone)

After enabling the avoidance cloning technique for prefetch_related only:

In [2]: %timeit tuple(User.objects.all())
1.28 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
(0 calls to ._clone)
In [3]: %timeit tuple(User.objects.prefetch_related('groups'))
5.93 ms ± 53.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(4 calls to ._clone)
In [4]: %timeit tuple(User.objects.prefetch_related('groups', 'user_permissions'))
10.2 ms ± 59.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(7 calls to ._clone)

So for the 1 prefetch it's a decrease of maybe ~9%, and for the 2 prefetches it's a decrease of roughly ~15%. I make no pretense about being able to calculate improvement by factor vs percent and risk getting it wrong.

The number of calls to clone was done by re-running the queries in fresh ipython sessions, having _clone() increment a global integer each time it's called. It isn't part of the actual timeit numbers.

These improved prefetch timings are only found when hitting the manager.get_queryset() of prefetch_one_level because we can assume that returns a fresh QuerySet instance (or subclass) that holds OK data and doesn't need cloning.
The same cannot be said of if leaf and lookup.queryset [...] I don't think, because that queryset is shared and in an unknown state, so at least for the initial MVP remains cloned. It may do one fewer clone operation if the database would be changed by using().

Calling chain directly, without disabling cloning:

In [1]: from django.contrib.auth.models import User, Group, Permission
In [2]: x = User.objects.all()
In [4]: %timeit y = x._chain()
7.05 µs ± 62.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

With cloning disabled:

In [6]: %timeit y = x._chain()
275 ns ± 3.15 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

As I write this, I have not implemented the other updates Anssi did like update get_contenttypes_and_models or create_permissions et al so that the proposed API changes can be decided on and looked over before moving forward with other potential usages.

Additionally as I write this, I've added no new tests to properly demonstrate the correctness of it, for the same reasons. For the discussion to kick off, it's going to be enough to see that the test suite passes or fails and iterate from there.

The number of QuerySet._clone operations done by the test suite (Ran 14876 tests, skipped=1192, expected failures=4) drops from 144851 to 142305 which isn't a huge amount of confidence, but it's a start.

comment:5 Changed 6 months ago by Keryn Knight

Has patch: set
Patch needs improvement: set

Updated slightly, and now I've sat down to get cProfile information using %prun for _ in range(100): tuple(User.objects.prefetch_related('groups__permissions', 'user_permissions'))

First, the baseline, showing only operations related to the change, for brevity (so no Model.__init__ etc):

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    31200    0.165    0.000    0.303    0.000 query.py:290(clone)
      300    0.098    0.000    3.134    0.010 query.py:1860(prefetch_one_level)
    31200    0.066    0.000    0.444    0.000 query.py:1337(_clone)
    30200    0.055    0.000    0.655    0.000 related_descriptors.py:883(_apply_rel_filters)
    40600    0.052    0.000    1.231    0.000 query.py:45(__iter__)
    30200    0.051    0.000    0.897    0.000 related_descriptors.py:899(get_queryset)
    31200    0.046    0.000    0.500    0.000 query.py:1325(_chain)
    40200    0.041    0.000    0.327    0.000 base.py:511(from_db)
    30500    0.039    0.000    0.931    0.000 query.py:982(_filter_or_exclude)
    30600    0.030    0.000    0.194    0.000 manager.py:142(get_queryset)
    31200    0.029    0.000    0.337    0.000 query.py:341(chain)
    31200    0.028    0.000    0.070    0.000 where.py:142(clone)

Using the @contextmanager decorator. Lines are shown in the same order as the above, so they're technically ordered by the baseline's internal time:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.007    0.000    0.013    0.000 query.py:290(clone)
      300    0.082    0.000    2.932    0.010 query.py:1881(prefetch_one_level)
     1000    0.003    0.000    0.019    0.000 query.py:1358(_clone)
    30200    0.097    0.000    0.444    0.000 related_descriptors.py:884(_apply_rel_filters)
    40600    0.052    0.000    0.947    0.000 query.py:46(__iter__)
    30200    0.054    0.000    1.039    0.000 related_descriptors.py:901(get_queryset)
    31200    0.029    0.000    0.056    0.000 query.py:1343(_chain)
    40200    0.041    0.000    0.376    0.000 base.py:511(from_db)
    30500    0.041    0.000    0.496    0.000 query.py:984(_filter_or_exclude)
    30600    0.032    0.000    0.545    0.000 manager.py:142(get_queryset)
     1000    0.001    0.000    0.014    0.000 query.py:341(chain)
     1000    0.002    0.000    0.003    0.000 where.py:142(clone)
      ...
    30600    0.081    0.000    0.086    0.000 contextlib.py:86(__init__)
    60400    0.024    0.000    0.034    0.000 query.py:1335(_avoid_cloning)
    30600    0.017    0.000    0.067    0.000 contextlib.py:121(__exit__)
    30600    0.016    0.000    0.102    0.000 contextlib.py:242(helper)
    30600    0.016    0.000    0.042    0.000 contextlib.py:112(__enter__)

And then finally with a custom context manager class, again using the same ordering as the baseline:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.010    0.000    0.016    0.000 query.py:290(clone)
      300    0.095    0.000    3.293    0.011 query.py:1888(prefetch_one_level)
     1000    0.004    0.000    0.023    0.000 query.py:1365(_clone)
    30200    0.102    0.000    0.412    0.000 related_descriptors.py:884(_apply_rel_filters)
    40600    0.062    0.000    1.133    0.000 query.py:46(__iter__)
    30200    0.063    0.000    1.032    0.000 related_descriptors.py:901(get_queryset)
    31200    0.077    0.000    0.111    0.000 query.py:1350(_chain)
    40200    0.049    0.000    0.391    0.000 base.py:511(from_db)
    30500    0.070    0.000    0.650    0.000 query.py:996(_filter_or_exclude)
    30600    0.038    0.000    0.563    0.000 manager.py:142(get_queryset)
     1000    0.002    0.000    0.018    0.000 query.py:341(chain)
     1000    0.002    0.000    0.004    0.000 where.py:142(clone)
      ...
    30200    0.006    0.000    0.006    0.000 query.py:178(__init__)
    30200    0.016    0.000    0.022    0.000 query.py:1347(_avoid_cloning)
    30200    0.014    0.000    0.020    0.000 query.py:184(__exit__)
    30200    0.015    0.000    0.021    0.000 query.py:181(__enter__)

Using the custom context manager is substantially faster than the contextlib decorator, at least in the simple form I currently have it. Here's the decorator version:

In [1]: from django.contrib.auth.models import User
In [2]: x = User.objects.all()
In [3]: %timeit x._avoid_cloning()
1.01 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit with x._avoid_cloning(): pass
1.98 µs ± 63.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

and then using the custom class:

In [3]: %timeit x._avoid_cloning()
276 ns ± 2.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit with x._avoid_cloning(): pass
819 ns ± 39.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

py-spy (sampling profiler) also suggests sql.Query.clone is 3% of the time spent, where it doesn't even look to get sampled enough to be relevant, once cloning avoidance is introduced.

Meanwhile linters are the bane of my life and so the patch remains 'needs improvement'

Note: See TracTickets for help on using tickets.
Back to Top