Opened 10 years ago

Last modified 5 months ago

#22757 new Cleanup/optimization

prefetch_related isn't as effecient as it could be with GenericForeignKey and proxy models

Reported by: gwahl@… Owned by:
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords: prefetch_related
Cc: Simon Charette Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

prefetch_related is able to prefetch across a GenericForiegnKey. However, it does more queries than necessary when using proxy models.

Our models:

from django.db import models
from django.contrib.contenttypes.generic import GenericForeignKey
from django.contrib.contenttypes.models import ContentType

class Parent(models.Model):
    content_type = models.ForeignKey(ContentType)
    object_id = models.PositiveIntegerField()
    child = GenericForeignKey('content_type', 'object_id', for_concrete_model=False)

class Child(models.Model):
    pass

class ProxiedChild(Child):
    class Meta:
        proxy = True

Now create some instances

>>> child = Child.objects.create()
>>> proxied_child = ProxiedChild.objects.create()
>>> p1 = Parent.objects.create(child=child)
>>> p2 = Parent.objects.create(child=proxied_child)

And query using prefetch_related:

>>> Parent.objects.all().prefetch_related('child')
SELECT "foo_parent"."id", "foo_parent"."content_type_id", "foo_parent"."object_id" FROM "foo_parent" LIMIT 21

SELECT "foo_child"."id" FROM "foo_child" WHERE "foo_child"."id" IN (3)

SELECT "foo_child"."id" FROM "foo_child" WHERE "foo_child"."id" IN (2)

This is doing 3 queries instead of the expected 2. The two queries for the foo_child table should be combined to be "foo_child"."id" IN (2, 3)

Change History (7)

comment:1 by Simon Charette, 10 years ago

Component: contrib.contenttypesDatabase layer (models, ORM)
Keywords: prefetch_related added
Triage Stage: UnreviewedAccepted
Type: UncategorizedCleanup/optimization
Version: 1.6master

comment:2 by Paulo, 7 years ago

Owner: changed from nobody to Paulo
Status: newassigned

comment:3 by Simon Charette, 19 months ago

Owner: Paulo removed
Status: assignednew

This is achievable by grouping queries by concrete model in GenericForeignKey.get_prefetch_queryset but will also likely require adjusting the prefetching logic to allow for some custom logic to be run on assignment so returned Child instances can be turned into ProxiedChild.

comment:4 by Nolan Little, 5 months ago

Owner: set to Nolan Little
Status: newassigned

comment:5 by Nolan Little, 5 months ago

Owner: Nolan Little removed
Status: assignednew

I started working on this at DjangoCon US. I had difficulty coming up with a true optimization. I started from GenericForeignKey.get_prefetch_queryset like Simon recommended. Grouping the queries by concrete model was fairly straightforward, the difficulty began with turning all instances back into their correct concrete/proxy types correctly.

The only method I could come up with required iterating over every object and checking if its instance type was proxy. I created a dict of the instances content types to instance PK during the query grouping so that I could use it to identify the correct ContentType. In the below code fkeys was a list of grouped ids for all concrete and proxy objects:

objs = concrete_ct.get_all_objects_for_this_type(pk__in=fkeys)
    for obj in objs:
          instance = instance_dict[obj.pk]
          if instance.content_type.model_class()._meta.proxy:
               obj.__dict__.pop("_state")  # necessary?
               ret_obj = instance.content_type.model_class()(**obj.__dict__)
          else:
               ret_obj = obj
          ret_val.append(ret_obj)

There may be a cleaner way to get the correct Instance type back that I'm unaware of as this piece felt quite ugly:

obj.__dict__.pop("_state")  # necessary?
ret_obj = instance.content_type.model_class()(**obj.__dict__)

Unless there is a way to avoid iterating over all objects to determine the correct instance I question the viability of optimizing the amount of queries here. Realistically it seems that the amount of objects returned is likely to be far higher than the amount of proxy models queried.

If i'm way off on this approach or if there is a way to get the correct type for return instances without iterating all objects I'd love to learn about it! I don't anticipate having a lot of time to make progress on this in the near future so I'm going to un-assign myself, but I'm happy to pick it back up in the future if anyone has recommendations on the approach.

comment:6 by Simon Charette, 5 months ago

Cc: Simon Charette added

comment:7 by Simon Charette, 5 months ago

Thanks for having a shot at this Nolan!

I think that a lot of the complexity here stems from the fact the ORM doesn't support polymorphic querysets that is querysets that can return multiple models.

This poses a challenge from an implementation perspective for this issue because the (simplified here) return signature of get_prefetch_querysets must contains three members

  1. A collection of prefetching querysets
  2. A callable that takes a model instance returned from the prefetching querysets and return a tuple uniquely identifying the item (that's lambda obj: (obj.pk, obj.__class__))
  3. A callable that take model instances that objects are prefetched for and return a tuple uniquely identifying the related items (that's gfk_key)

The prefetch_related logic then executes all the querysets from 1., build a map from 2, and intersects it with 3. to do the proper _prefetch_related_cache assignments.

Since Django only allows for a single model class to be used to construct objects returned from a queryset, remember the lack of polymorphism support, then one might be tempted to annotate objects with a _content_type_id attribute (e.g. using Case(When) like we do in bulk_update based on the id) to have the logic in 2. changed to lambda obj: (obj.pk, self.get_content_type(id=obj._content_type_id, using=obj._state.db).model_class() so the proper mapping occurs. Doing do doesn't solve the problem of proper __class__ assignment as you've brought up Nolan nor does it address another edge case that would have probably caused during review.

Even if we're able to return a specialized Queryset that overrides the _iterable_class to return mixed models (it's the tactic used by packages like django-polymorphic for example) we'd have to we deal with the case where the same id is needs to be prefetched both concretely and for it's proxy.

In other words, given the reported scenario, what kind of SQL query would we generate for

child = Child.objects.create(id=1)
proxy_child = ProxiedChild.objects.get(id=1)

parent = Parent.objects.create(child=child)
proxy_parent = Parent.objects.create(child=proxy_child)

Parent.objects.prefetch_related("child")

In this case the same foo_child row needs to returned twice one as a Child and one as a ProxiedChild which cannot be achieved with a simple id IN (1, 1) to span two rows.

The only way I can think of dealing with this edge case is by using a UNION ALL with combined annotation so the following SQL would be generated

-- Assuming the content type of the Child model is 2 and the content type of the ProxiedChild model is 3
SELECT "foo_child"."id", 2 AS "_content_type_id" FROM "foo_child" WHERE "foo_child"."id" IN (1)
UNION ALL
SELECT "foo_child"."id", 3 AS "_content_type_id" FROM "foo_child" WHERE "foo_child"."id" IN (1)

And that must then be combined with the _iterable_class logic mentioned above to create model instances with the proper model class based on the _content_type_id annotation.

There's even more fun now that we've added GenericPrefetch in 5.0 as it allows you to specify querysets for each of the models involved which means that it allows passing querysets with select_related and annotations that result in variadic SELECT statement which cannot use the UNION ALL approach described above so the optimization would have to be disabled.

After looking at this into more details I also agree that the addition in complexity to avoid this query is likely not worth it as it would either require some potentially expensive Python workaround or significant boiler plate to get working properly. I suspect we'll want to wontfix this one unfortunately at least until someone can come with a clever way of achieving the above efficiently or the ORM gains polymorphic capabilities which is a long shot.

Last edited 5 months ago by Simon Charette (previous) (diff)
Note: See TracTickets for help on using tickets.
Back to Top