Opened 10 years ago
Last modified 13 months ago
#22757 new Cleanup/optimization
prefetch_related isn't as effecient as it could be with GenericForeignKey and proxy models
Reported by: | Owned by: | ||
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | prefetch_related |
Cc: | Simon Charette | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
prefetch_related
is able to prefetch across a GenericForiegnKey. However, it does more queries than necessary when using proxy models.
Our models:
from django.db import models from django.contrib.contenttypes.generic import GenericForeignKey from django.contrib.contenttypes.models import ContentType class Parent(models.Model): content_type = models.ForeignKey(ContentType) object_id = models.PositiveIntegerField() child = GenericForeignKey('content_type', 'object_id', for_concrete_model=False) class Child(models.Model): pass class ProxiedChild(Child): class Meta: proxy = True
Now create some instances
>>> child = Child.objects.create() >>> proxied_child = ProxiedChild.objects.create() >>> p1 = Parent.objects.create(child=child) >>> p2 = Parent.objects.create(child=proxied_child)
And query using prefetch_related:
>>> Parent.objects.all().prefetch_related('child') SELECT "foo_parent"."id", "foo_parent"."content_type_id", "foo_parent"."object_id" FROM "foo_parent" LIMIT 21 SELECT "foo_child"."id" FROM "foo_child" WHERE "foo_child"."id" IN (3) SELECT "foo_child"."id" FROM "foo_child" WHERE "foo_child"."id" IN (2)
This is doing 3 queries instead of the expected 2. The two queries for the foo_child
table should be combined to be "foo_child"."id" IN (2, 3)
Change History (7)
comment:1 by , 10 years ago
Component: | contrib.contenttypes → Database layer (models, ORM) |
---|---|
Keywords: | prefetch_related added |
Triage Stage: | Unreviewed → Accepted |
Type: | Uncategorized → Cleanup/optimization |
Version: | 1.6 → master |
comment:2 by , 7 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:3 by , 2 years ago
Owner: | removed |
---|---|
Status: | assigned → new |
comment:4 by , 13 months ago
Owner: | set to |
---|---|
Status: | new → assigned |
comment:5 by , 13 months ago
Owner: | removed |
---|---|
Status: | assigned → new |
I started working on this at DjangoCon US. I had difficulty coming up with a true optimization. I started from GenericForeignKey.get_prefetch_queryset
like Simon recommended. Grouping the queries by concrete model was fairly straightforward, the difficulty began with turning all instances back into their correct concrete/proxy types correctly.
The only method I could come up with required iterating over every object and checking if its instance type was proxy
. I created a dict of the instances content types to instance PK during the query grouping so that I could use it to identify the correct ContentType
. In the below code fkeys
was a list of grouped ids for all concrete and proxy objects:
objs = concrete_ct.get_all_objects_for_this_type(pk__in=fkeys) for obj in objs: instance = instance_dict[obj.pk] if instance.content_type.model_class()._meta.proxy: obj.__dict__.pop("_state") # necessary? ret_obj = instance.content_type.model_class()(**obj.__dict__) else: ret_obj = obj ret_val.append(ret_obj)
There may be a cleaner way to get the correct Instance type back that I'm unaware of as this piece felt quite ugly:
obj.__dict__.pop("_state") # necessary? ret_obj = instance.content_type.model_class()(**obj.__dict__)
Unless there is a way to avoid iterating over all objects to determine the correct instance I question the viability of optimizing the amount of queries here. Realistically it seems that the amount of objects returned is likely to be far higher than the amount of proxy models queried.
If i'm way off on this approach or if there is a way to get the correct type for return instances without iterating all objects I'd love to learn about it! I don't anticipate having a lot of time to make progress on this in the near future so I'm going to un-assign myself, but I'm happy to pick it back up in the future if anyone has recommendations on the approach.
comment:6 by , 13 months ago
Cc: | added |
---|
comment:7 by , 13 months ago
Thanks for having a shot at this Nolan!
I think that a lot of the complexity here stems from the fact doesn't support polymorphic querysets that is querysets that can return multiple models.
This poses a challenge from an implementation perspective for this issue because the (simplified here) return signature of get_prefetch_querysets
contains three members
- A collection of prefetch querysets
- A callable that takes a model instance returned from one of the prefetch queries and return a tuple uniquely identifying the item (that's
lambda obj: (obj.pk, obj.__class__)
) - A callable that take model instances related objects are prefetched for and return a tuple uniquely identifying the related items (that's
gfk_key
)
The prefetch_related
logic then executes all the querysets from 1., build a map from 2, and inserts it with 3. to do the proper _prefetch_related_cache
assignments.
Since Django only allows for a single model class to be used to construct objects returned from a queryset, remember the lack of polymorphism support, then one might be tempted to annotate objects with a _content_type_id
attribute to have the logic in 2. changed to lambda obj: (obj.pk, self.get_content_type(id=obj._content_type_id, using=obj._state.db).model_class()
so the proper mapping occurs. Doing do doesn't solve the problem of proper __class__
assignment as you've brought up Nolan nor does it address another edge case that would have probably caused during review.
Even if we're able to return a specialized Queryset
that overrides the _iterable_class
to return mixed models (it's the tactic used by packages like django-polymorphic for example) how we deal with the case where the same id is needs to be prefetched both concretely and for it's proxy.
In other words, given the reported scenario, what kind of SQL query would we generate for
child = Child.objects.create(id=1) proxy_child = ProxiedChild.objects.get(id=1) parent = Parent.objects.create(child=child) proxy_parent = Parent.objects.create(child=proxy_child) Parent.objects.prefetch_related("child")
In this case the same foo_child
row needs to returned twice one as a child and one as a proxy_child which cannot be achieved with a simple id IN (1, 1)
.
The only way I can think of dealing with this edge case is by using a UNION ALL
with combined annotation so the following SQL would be generated
-- Assuming the content type of the Child model is 2 and the content type of the ProxiedChild model is 3 SELECT "foo_child"."id", 2 AS "_content_type_id" FROM "foo_child" WHERE "foo_child"."id" IN (1) UNION ALL SELECT "foo_child"."id", 3 AS "_content_type_id" FROM "foo_child" WHERE "foo_child"."id" IN (1)
And that must then be combined with the _iterable_class
override mentioned above.
There's even more fun now that we've added GenericPrefetch
in 5.0 as it allows you to specify querysets for each of the models involved which means that it allows passing querysets with select_related
and annotations which cannot use the UNION ALL
approach described above so the optimization would have to be disabled.
After looking at this into more details I also agree that the addition in complexity to avoid this query is likely not worth it as it would either require some potentially expensive Python workaround or significant boiler plate to get working properly. I suspect we'll want to wontfix this one unfortunately.
This is achievable by grouping queries by concrete model in
GenericForeignKey.get_prefetch_queryset
but will also likely require adjusting the prefetching logic to allow for some custom logic to be run on assignment so returnedChild
instances can be turned intoProxiedChild
.