Opened 12 years ago

Closed 12 years ago

#19029 closed Cleanup/optimization (wontfix)

Reuse QuerySet._result_cache during IN clause

Reported by: Matt Robenolt Owned by: Matt Robenolt
Component: Database layer (models, ORM) Version: 1.4
Severity: Normal Keywords: orm, queryset, result_cache, optimization, subquery
Cc: Florian Apolloner, charette.s@… Triage Stage: Design decision needed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

If I have a QuerySet that has been evaluated previously, and shove it into an __in clause, a subquery is still generated.

Expected result:

  • Reuse a list of primary keys instead, to prevent the database from performing the subquery.
cities = City.objects.all()
list(cities)  # Force evaluation of cities to populate cache
print cities._result_cache is not None  # Verify that the cache is filled
Event.objects.filter(venue__city__in=cities)  # This should use just the pks instead of a subquery

I began digging through the ORM, and found the source of the problem being that the origin QuerySet object is being clone()'d, causing it's _result_cache to be None.

The simple solution for now is to not pass a QuerySet into an IN clause if we know the set of ids. Can be remedied by something like [m.pk for m in qs], or just list(qs).

For the latter solution, I think the documentation should be updated to reflect this distinction to avoid unexpected results and performance degradation.

Change History (7)

comment:1 by Matt Robenolt, 12 years ago

Has patch: set
Owner: changed from nobody to Matt Robenolt
Triage Stage: UnreviewedFixed on a branch

comment:2 by Florian Apolloner, 12 years ago

Triage Stage: Fixed on a branchDesign decision needed

I think we should leave the behavior as it is, you can use Event.objects.filter(venue__city__in=cities.values_list(…)) if you want to prevent a subquery. If we change this behavior there is no easy way to get back to the current status without manually resetting the _result_cache. I also think that a subquery will usually perform better than putting a large list of numbers into the in-clause.

Last edited 12 years ago by Florian Apolloner (previous) (diff)

comment:3 by Florian Apolloner, 12 years ago

Cc: Florian Apolloner added

comment:4 by Anssi Kääriäinen, 12 years ago

Resolution: wontfix
Status: newclosed

I did a test using postgres. For 1000 Base objects, each having one Related object doing a subquery is roughly 3x faster:

Base.objects.filter(related__in=Related.objects.all())
vs
Base.objects.filter(related__in=list_of_related_pks)

There are for sure cases where the numbers favour passing the list of pk values, too. But, we should not change what we have currently, as we will just be favouring another use case and penalizing cases where we currently are faster.

Django can't know when it is better to use subquery and when to use the already fetched values. The user can choose that by doing related__in=[o.pk for o in qs] when needed.

Closing as wontfix as I don't believe this to be an optimization in general.

If there are benchmarks showing that in common real world cases this is an optimization, then lets reconsider. However this will boil down to what are the common real world cases...

comment:5 by Matt Robenolt, 12 years ago

I believe the examples that were posted are a very poor representation of the actual problem. Of course a subquery of Related.objects.all() isn't going to be much of a gain. But it's very easy to look beyond that and imagine if we had something like: Related.objects.filter(another_related__and_another__field__contains="foo") We definitely can agree that preventing a complex query like that from running over and over again as a subquery is a good thing.

And with that said, I agree that Django can't determine when is better and make that choice for the user. So the ultimate solution in my mind is then documentation explaining what is happening with __in. I wrote a tiny blog post about it: http://mattrobenolt.com/the-django-orm-and-subqueries/

Should I create another ticket just for the documentation update? Or is it sufficient to use this current one?

comment:6 by Matt Robenolt, 12 years ago

Resolution: wontfix
Status: closedreopened

comment:7 by Simon Charette, 12 years ago

Cc: charette.s@… added
Resolution: wontfix
Status: reopenedclosed

Please don't reopen tickets closed by a core committer, if you disagree with the decision the right place to have this discussion is on the django-dev mailing list.

Note: See TracTickets for help on using tickets.
Back to Top