Opened 12 years ago
Closed 12 years ago
#19029 closed Cleanup/optimization (wontfix)
Reuse QuerySet._result_cache during IN clause
Reported by: | Matt Robenolt | Owned by: | Matt Robenolt |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 1.4 |
Severity: | Normal | Keywords: | orm, queryset, result_cache, optimization, subquery |
Cc: | Florian Apolloner, charette.s@… | Triage Stage: | Design decision needed |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
If I have a QuerySet
that has been evaluated previously, and shove it into an __in
clause, a subquery is still generated.
Expected result:
- Reuse a list of primary keys instead, to prevent the database from performing the subquery.
cities = City.objects.all() list(cities) # Force evaluation of cities to populate cache print cities._result_cache is not None # Verify that the cache is filled Event.objects.filter(venue__city__in=cities) # This should use just the pks instead of a subquery
I began digging through the ORM, and found the source of the problem being that the origin QuerySet
object is being clone()'d, causing it's _result_cache
to be None
.
The simple solution for now is to not pass a QuerySet
into an IN clause if we know the set of ids. Can be remedied by something like [m.pk for m in qs]
, or just list(qs)
.
For the latter solution, I think the documentation should be updated to reflect this distinction to avoid unexpected results and performance degradation.
Change History (7)
comment:1 by , 12 years ago
Has patch: | set |
---|---|
Owner: | changed from | to
Triage Stage: | Unreviewed → Fixed on a branch |
comment:2 by , 12 years ago
Triage Stage: | Fixed on a branch → Design decision needed |
---|
I think we should leave the behavior as it is, you can use Event.objects.filter(venue__city__in=cities.values_list(…))
if you want to prevent a subquery. If we change this behavior there is no easy way to get back to the current status without manually resetting the _result_cache
. I also think that a subquery will usually perform better than putting a large list of numbers into the in-clause.
comment:3 by , 12 years ago
Cc: | added |
---|
comment:4 by , 12 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
I did a test using postgres. For 1000 Base objects, each having one Related object doing a subquery is roughly 3x faster:
Base.objects.filter(related__in=Related.objects.all()) vs Base.objects.filter(related__in=list_of_related_pks)
There are for sure cases where the numbers favour passing the list of pk values, too. But, we should not change what we have currently, as we will just be favouring another use case and penalizing cases where we currently are faster.
Django can't know when it is better to use subquery and when to use the already fetched values. The user can choose that by doing related__in=[o.pk for o in qs]
when needed.
Closing as wontfix as I don't believe this to be an optimization in general.
If there are benchmarks showing that in common real world cases this is an optimization, then lets reconsider. However this will boil down to what are the common real world cases...
comment:5 by , 12 years ago
I believe the examples that were posted are a very poor representation of the actual problem. Of course a subquery of Related.objects.all()
isn't going to be much of a gain. But it's very easy to look beyond that and imagine if we had something like: Related.objects.filter(another_related__and_another__field__contains="foo")
We definitely can agree that preventing a complex query like that from running over and over again as a subquery is a good thing.
And with that said, I agree that Django can't determine when is better and make that choice for the user. So the ultimate solution in my mind is then documentation explaining what is happening with __in
. I wrote a tiny blog post about it: http://mattrobenolt.com/the-django-orm-and-subqueries/
Should I create another ticket just for the documentation update? Or is it sufficient to use this current one?
comment:6 by , 12 years ago
Resolution: | wontfix |
---|---|
Status: | closed → reopened |
comment:7 by , 12 years ago
Cc: | added |
---|---|
Resolution: | → wontfix |
Status: | reopened → closed |
Please don't reopen tickets closed by a core committer, if you disagree with the decision the right place to have this discussion is on the django-dev mailing list.
See: https://github.com/django/django/pull/396