Opened 25 hours ago

Closed 24 hours ago

Last modified 23 hours ago

#36157 closed Bug (duplicate)

Unusual behaviour when pre-fetching with only applied on the related fields

Reported by: Tim McCurrach Owned by:
Component: Uncategorized Version: 5.1
Severity: Normal Keywords:
Cc: Tim McCurrach Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Tim McCurrach)

When prefetching related models. If you apply .only() to the related queryset, django performs additional lookups for related IDs that have been left out of the only.

It is probably easiest to explain the issue with an example.

Example Situation

Suppose you have these models:

class Blog(models.Model):
    name = models.CharField(max_length=100)

class Post(models.Model):
    name = models.CharField(max_length=100)
    blog = models.ForeignKey(Blog, on_delete=models.CASCADE, related_name="posts")
    ...lots of other big fields

And you create a few items for each:

blog = Blog.objects.create(name="Django Tricks")
blog2 = Blog.objects.create(name="React Tricks")
Post.objects.create(name="prefetching", blog=blog)
Post.objects.create(name="models", blog=blog)
Post.objects.create(name="templates", blog=blog)
Post.objects.create(name="hooks", blog=blog2)
Post.objects.create(name="components", blog=blog2)

If I wish to pre-fetch the posts for some blogs, but only want the names of each post, rather than the content of each post I can do the following:

Blog.objects.prefetch_related(Prefetch("posts", queryset=Post.objects.only("name")))

I would expect this to result in just 2 database queries. One to fetch the data for the Blog instances, and another to fetch the the data for the related Posts. Instead, there is an n+1 issue where there are 5 extra follow up requests for each of the related Post instances. This is the SQL that is generated:

SELECT "app_blog"."id", "app_blog"."name" FROM "app_blog" LIMIT 21
SELECT "app_post"."id", "app_post"."name" FROM "app_post" WHERE "app_post"."blog_id" IN (1, 2)
SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 1 LIMIT 21
SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 2 LIMIT 21
SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 3 LIMIT 21
SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 4 LIMIT 21
SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 5 LIMIT 21

I can understand it might be a good idea to have the related-id's for the blog on hand should you need them later. But I also think, that by using .only() you are explicitly telling django - I don't need these. This is a real problem for larger data-sets, where you end up with thousands of extra round-trips to the database.

Context

This is an issue that came up in the wild. I'm using a third-party optimiser that improves the performance of graphQL queries by decorating querysets with only(), select_related() etc. It correctly identifies that I am only using certain fields and applies only() to them, knowing I will never need to access certain related fields. This unfortunately results in django producing many additional hits to the database. I don't think this is expected behaviour from django.

Change History (4)

comment:1 by Tim McCurrach, 25 hours ago

Description: modified (diff)

comment:2 by Tim McCurrach, 25 hours ago

Description: modified (diff)

comment:3 by Simon Charette, 24 hours ago

Resolution: duplicate
Status: newclosed

Hello Tim,

This is a duplicate of #33835 which was won't fixed. See how the request out there was the opposite of what you requested here which is to implicitly add the required fields for prefetching.

I can understand it might be a good idea to have the related-id's for the blog on hand should you need them later. But I also think, that by using .only() you are explicitly telling django - I don't need these. This is a real problem for larger data-sets, where you end up with thousands of extra round-trips to the database.

The thing is prefetching must have blog_id otherwise it has no way to build the associative map between Blog instances and Post to populate blog.posts.all(). In other words if I gave you a list of the form posts = [{"id": 1, "name": "Some blog Post"}, {"id": 2, "name": "Some other blog Post"} how would you partition it by blog_id?

Adapting prefetch_related to error out if provided an inadequate Prefetch(queryset) could potentially be done but that would not solve your actual problem.The third-party library you are using to automatically generate these queries is flawed and should include blog_id in the select mask (AKA the only) call if it relies on prefetching.

If you'd like to catch these early I'd suggest looking at the potential solution to have warnings emitted on query leaks in ticket:33835#comment:2.

Version 1, edited 24 hours ago by Simon Charette (previous) (next) (diff)

comment:4 by Tim McCurrach, 23 hours ago

Hello Simon,

Thanks for the speedy reply, and the explanation. That's very useful, and makes a lot of sense.

I'll raise the issue with the third-party, and check out your library.

Note: See TracTickets for help on using tickets.
Back to Top