Code

#19029 closed Cleanup/optimization (wontfix)

Reuse QuerySet._result_cache during IN clause

Reported by: mattrobenolt Owned by: mattrobenolt
Component: Database layer (models, ORM) Version: 1.4
Severity: Normal Keywords: orm, queryset, result_cache, optimization, subquery
Cc: apollo13, charette.s@… Triage Stage: Design decision needed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

If I have a QuerySet that has been evaluated previously, and shove it into an __in clause, a subquery is still generated.

Expected result:

  • Reuse a list of primary keys instead, to prevent the database from performing the subquery.
cities = City.objects.all()
list(cities)  # Force evaluation of cities to populate cache
print cities._result_cache is not None  # Verify that the cache is filled
Event.objects.filter(venue__city__in=cities)  # This should use just the pks instead of a subquery

I began digging through the ORM, and found the source of the problem being that the origin QuerySet object is being clone()'d, causing it's _result_cache to be None.

The simple solution for now is to not pass a QuerySet into an IN clause if we know the set of ids. Can be remedied by something like [m.pk for m in qs], or just list(qs).

For the latter solution, I think the documentation should be updated to reflect this distinction to avoid unexpected results and performance degradation.

Attachments (0)

Change History (7)

comment:1 Changed 19 months ago by mattrobenolt

  • Has patch set
  • Needs documentation unset
  • Needs tests unset
  • Owner changed from nobody to mattrobenolt
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Fixed on a branch

comment:2 Changed 19 months ago by apollo13

  • Triage Stage changed from Fixed on a branch to Design decision needed

I think we should leave the behavior as it is, you can use Event.objects.filter(venue__city__in=cities.values_list(…)) if you want to prevent a subquery. If we change this behavior there is no easy way to get back to the current status without manually resetting the _result_cache. I also think that a subquery will usually perform better than putting a large list of numbers into the in-clause.

Last edited 19 months ago by apollo13 (previous) (diff)

comment:3 Changed 19 months ago by apollo13

  • Cc apollo13 added

comment:4 Changed 19 months ago by akaariai

  • Resolution set to wontfix
  • Status changed from new to closed

I did a test using postgres. For 1000 Base objects, each having one Related object doing a subquery is roughly 3x faster:

Base.objects.filter(related__in=Related.objects.all())
vs
Base.objects.filter(related__in=list_of_related_pks)

There are for sure cases where the numbers favour passing the list of pk values, too. But, we should not change what we have currently, as we will just be favouring another use case and penalizing cases where we currently are faster.

Django can't know when it is better to use subquery and when to use the already fetched values. The user can choose that by doing related__in=[o.pk for o in qs] when needed.

Closing as wontfix as I don't believe this to be an optimization in general.

If there are benchmarks showing that in common real world cases this is an optimization, then lets reconsider. However this will boil down to what are the common real world cases...

comment:5 Changed 19 months ago by mattrobenolt

I believe the examples that were posted are a very poor representation of the actual problem. Of course a subquery of Related.objects.all() isn't going to be much of a gain. But it's very easy to look beyond that and imagine if we had something like: Related.objects.filter(another_related__and_another__field__contains="foo") We definitely can agree that preventing a complex query like that from running over and over again as a subquery is a good thing.

And with that said, I agree that Django can't determine when is better and make that choice for the user. So the ultimate solution in my mind is then documentation explaining what is happening with __in. I wrote a tiny blog post about it: http://mattrobenolt.com/the-django-orm-and-subqueries/

Should I create another ticket just for the documentation update? Or is it sufficient to use this current one?

comment:6 Changed 19 months ago by mattrobenolt

  • Resolution wontfix deleted
  • Status changed from closed to reopened

comment:7 Changed 19 months ago by charettes

  • Cc charette.s@… added
  • Resolution set to wontfix
  • Status changed from reopened to closed

Please don't reopen tickets closed by a core committer, if you disagree with the decision the right place to have this discussion is on the django-dev mailing list.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.