Opened 6 months ago
Last modified 4 months ago
#35442 assigned Bug
N+1 queries from RelatedManager + only("pk")
Reported by: | REGNIER Guillaume | Owned by: | Rish |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 4.2 |
Severity: | Normal | Keywords: | |
Cc: | Simon Charette | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
When iterating over a queryset constructed from a RelatedManager and a .only(...)
call that does not include the related field, a query occurs when instances are produced from the queryset.
Steps to Reproduce:
class Company(models.Model): pass class Employee(models.Model): company = models.ForeignKey(Company, on_delete=models.CASCADE, related_name="employees")
company = Company.objects.create() Employee.objects.bulk_create(Employee(company=company) for _ in range(10)) for employee in company.employees.only("pk"): # Some code that only access pk _ = employee.pk
Expected Behavior:
One query like
SELECT "employee"."id" FROM "employee" WHERE "employee"."company_id" = {COMPANY_ID}
Actual Behavior:
10 additional queries like:
SELECT "employee"."id", "employee"."company_id" FROM "employee" WHERE "employee"."id" = {EMPLOYEE_ID}
Analysis:
My understanding is that there is an optimization that fills the parent model on related instances without needing additional SQL join/query. However, when only a subset of fields is selected (in this case, only the primary key), the parent ID might not be loaded from the database, resulting in additional queries to perform said optimization.
Workaround:
company = Company.objects.create() Employee.objects.bulk_create(Employee(company=company) for _ in range(10)) for employee in Employee.objects.filter(company=company).only("pk"): # Some code that only access pk _ = employee.pk
Change History (12)
comment:2 by , 6 months ago
Hello Guillaume!
I'm having a hard time understanding the goal of your code. I assume you have provided a simplified version to be used as a small example, which I appreciate, but with the information given, the proper way to fetch records from the database without generating N+1 queries would be:
>>> from django.db import connection, reset_queries >>> reset_queries(); connection.queries [] >>> company = Company.objects.prefetch_related("employees").last() >>> for e in company.employees.all(): >>> print(e.pk) 7 8 9 10 >>> connection.queries [{'sql': 'SELECT "ticket_35442_company"."id", "ticket_35442_company"."name" FROM "ticket_35442_company" ORDER BY "ticket_35442_company"."id" DESC LIMIT 1', 'time': '0.002'}, {'sql': 'SELECT "ticket_35442_employee"."id", "ticket_35442_employee"."company_id" FROM "ticket_35442_employee" WHERE "ticket_35442_employee"."company_id" IN (5)', 'time': '0.001'}]
Or, my preferred (a single query and the same workaround as was posted originally):
>>> for e in Employee.objects.filter(company=company).only("pk"): print(e.pk) 7 8 9 10 >>> connection.queries [{'sql': 'SELECT "ticket_35442_employee"."id" FROM "ticket_35442_employee" WHERE "ticket_35442_employee"."company_id" = 5', 'time': '0.000'}]
comment:3 by , 6 months ago
Hey Simon, thank you for your message. I fail to see how Django is at fault here, considering that there are alternative usages of the ORM and its queries to accomplish the same result (as far as we understand the use case) that would not execute N+1 queries. This seems like a (likely?) dupe of #20923 which was closed as wontfix
.
comment:4 by , 6 months ago
Resolution: | → duplicate |
---|---|
Status: | new → closed |
I'll close as duplicate
of #20923 for now following the ticket triaging process, but any further commenting is welcomed and I'm happy to reconsider the ticket status with more information.
comment:5 by , 6 months ago
Thank you both for answering!
Yes, i did provide a minimal exemple. Here i just wanted to use company.employees.all()
instead of Employee.objects.filter(company=company)
.
I often use RelatedManagers this way to avoid importing the related model and i do find the final code easier to read.
I thought the two pieces of code above were meant to be equivalent.
I might have understood the RelatedManager wrong but when i'm passing a queryset along, i have no way to know if it originated from a RelatedManager or a regular Manager, meaning i have no way to know if .only()
is safe to use or if it'll cause N+1 queries.
If this behavior is intended/not an issue, it means that i either need to ban .only()
or RelatedManagers from my coding habbits since i don't have any way to ensure this won't happen again.
comment:6 by , 6 months ago
Accessing objects though a related manager should still allow for only
to work as expected. The fact that the ORM doesn't even warn you when it silently issues queries on field deferral leaks (#22492) makes this behavior really insidious and prevents safe usage of related managers as Guillaume brought up.
I think the the ORM should either have RelatedManager.only
include the reverse field implicitly (which we've kind of rules out against in #33835) or we should find a way to more safely assign known related objects in the face of deferred fields (e.g. make sure required attrs
are in obj.__dict__
).
Potentially naive patch
-
django/db/models/query.py
diff --git a/django/db/models/query.py b/django/db/models/query.py index cb5c63c0d1..883a7a71e4 100644
a b def __iter__(self): (this hunk was shorter than expected) 107 107 ( 108 108 field, 109 109 related_objs, 110 operator.attrgetter( 111 *[ 112 ( 113 field.attname 114 if from_field == "self" 115 else queryset.model._meta.get_field(from_field).attname 116 ) 117 for from_field in field.from_fields 118 ] 119 ), 110 attnames := [ 111 ( 112 field.attname 113 if from_field == "self" 114 else queryset.model._meta.get_field(from_field).attname 115 ) 116 for from_field in field.from_fields 117 ], 118 operator.attrgetter(*attnames), 120 119 ) 121 120 for field, related_objs in queryset._known_related_objects.items() 122 121 ] 123 122 for row in compiler.results_iter(results): 124 123 obj = model_cls.from_db( … … def __iter__(self): 131 131 setattr(obj, attr_name, row[col_pos]) 132 132 133 133 # Add the known related objects to the model. 134 for field, rel_objs, rel_ getter in known_related_objects:134 for field, rel_objs, rel_attnames, rel_getter in known_related_objects: 135 135 # Avoid overwriting objects loaded by, e.g., select_related(). 136 136 if field.is_cached(obj): 137 137 continue 138 # Avoid fetching potentially deferred attributes that would 139 # result in unexpected queries. 140 if any(attname not in obj.__dict__ for attname in rel_attnames): 141 continue 138 142 rel_obj_id = rel_getter(obj) 139 143 try: 140 144 rel_obj = rel_objs[rel_obj_id]
comment:7 by , 6 months ago
So I tried running the suite against the above patch and it delivered!
I was pointed at a single failure being defer.tests.DeferTests.test_only
====================================================================== FAIL: test_only (defer.tests.DeferTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/charettes/Workspace/django/tests/defer/tests.py", line 52, in test_only self.assert_delayed(self.s1.primary_set.only("pk")[0], 2) File "/Users/charettes/Workspace/django/tests/defer/tests.py", line 24, in assert_delayed self.assertEqual(count, num) AssertionError: 3 != 2
What's interesting though is that it might have a been a sneaky regression caused by the refactor to how field deferral is implemented that flew under the radar.
It was not discussed at all in the two PRs where #26207 was resolved
comment:8 by , 6 months ago
Cc: | added |
---|
comment:9 by , 6 months ago
For the record, I checked out I checked out Django 1.9 and it seems the issue is present there so while the commit mentioned above that landed in 1.10 did alter a related test it didn't actually break anything in this regard.
comment:10 by , 6 months ago
Resolution: | duplicate |
---|---|
Status: | closed → new |
Triage Stage: | Unreviewed → Accepted |
Thank you for the investigation Simon, reopening 👍
comment:11 by , 6 months ago
Type: | Uncategorized → Bug |
---|
comment:12 by , 4 months ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
This relates to #20927, #18177, and at least another ticket that asked for
"company_id"
to be implicitly included in theonly
call that I just can't find anymore.I think we should try to resolve that in a different way and a potential solution could be something like
django/db/models/query.py
But in order to work properly it would require a refactor of
known_related_objects
as the current approach is naive wrt/ to how it namespace related objects by field as the same model field can be included multiple times in the the same queryset (e.g. see #35356)