Opened 6 months ago
Last modified 4 months ago
#35442 assigned Bug
N+1 queries from RelatedManager + only("pk")
Reported by: | REGNIER Guillaume | Owned by: | Rish |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 4.2 |
Severity: | Normal | Keywords: | |
Cc: | Simon Charette | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
When iterating over a queryset constructed from a RelatedManager and a .only(...)
call that does not include the related field, a query occurs when instances are produced from the queryset.
Steps to Reproduce:
class Company(models.Model): pass class Employee(models.Model): company = models.ForeignKey(Company, on_delete=models.CASCADE, related_name="employees")
company = Company.objects.create() Employee.objects.bulk_create(Employee(company=company) for _ in range(10)) for employee in company.employees.only("pk"): # Some code that only access pk _ = employee.pk
Expected Behavior:
One query like
SELECT "employee"."id" FROM "employee" WHERE "employee"."company_id" = {COMPANY_ID}
Actual Behavior:
10 additional queries like:
SELECT "employee"."id", "employee"."company_id" FROM "employee" WHERE "employee"."id" = {EMPLOYEE_ID}
Analysis:
My understanding is that there is an optimization that fills the parent model on related instances without needing additional SQL join/query. However, when only a subset of fields is selected (in this case, only the primary key), the parent ID might not be loaded from the database, resulting in additional queries to perform said optimization.
Workaround:
company = Company.objects.create() Employee.objects.bulk_create(Employee(company=company) for _ in range(10)) for employee in Employee.objects.filter(company=company).only("pk"): # Some code that only access pk _ = employee.pk
Change History (12)
comment:2 by , 6 months ago
Hello Guillaume!
I'm having a hard time understanding the goal of your code. I assume you have provided a simplified version to be used as a small example, which I appreciate, but with the information given, the proper way to fetch records from the database without generating N+1 queries would be:
>>> from django.db import connection, reset_queries >>> reset_queries(); connection.queries [] >>> company = Company.objects.prefetch_related("employees").last() >>> for e in company.employees.all(): >>> print(e.pk) 7 8 9 10 >>> connection.queries [{'sql': 'SELECT "ticket_35442_company"."id", "ticket_35442_company"."name" FROM "ticket_35442_company" ORDER BY "ticket_35442_company"."id" DESC LIMIT 1', 'time': '0.002'}, {'sql': 'SELECT "ticket_35442_employee"."id", "ticket_35442_employee"."company_id" FROM "ticket_35442_employee" WHERE "ticket_35442_employee"."company_id" IN (5)', 'time': '0.001'}]
Or, my preferred (a single query and the same workaround as was posted originally):
>>> for e in Employee.objects.filter(company=company).only("pk"): print(e.pk) 7 8 9 10 >>> connection.queries [{'sql': 'SELECT "ticket_35442_employee"."id" FROM "ticket_35442_employee" WHERE "ticket_35442_employee"."company_id" = 5', 'time': '0.000'}]
comment:3 by , 6 months ago
Hey Simon, thank you for your message. I fail to see how Django is at fault here, considering that there are alternative usages of the ORM and its queries to accomplish the same result (as far as we understand the use case) that would not execute N+1 queries. This seems like a (likely?) dupe of #20923 which was closed as wontfix
.
comment:4 by , 6 months ago
Resolution: | → duplicate |
---|---|
Status: | new → closed |
I'll close as duplicate
of #20923 for now following the ticket triaging process, but any further commenting is welcomed and I'm happy to reconsider the ticket status with more information.
comment:5 by , 6 months ago
Thank you both for answering!
Yes, i did provide a minimal exemple. Here i just wanted to use company.employees.all()
instead of Employee.objects.filter(company=company)
.
I often use RelatedManagers this way to avoid importing the related model and i do find the final code easier to read.
I thought the two pieces of code above were meant to be equivalent.
I might have understood the RelatedManager wrong but when i'm passing a queryset along, i have no way to know if it originated from a RelatedManager or a regular Manager, meaning i have no way to know if .only()
is safe to use or if it'll cause N+1 queries.
If this behavior is intended/not an issue, it means that i either need to ban .only()
or RelatedManagers from my coding habbits since i don't have any way to ensure this won't happen again.
comment:6 by , 6 months ago
Accessing objects though a related manager should still allow for only
to work as expected. The fact that the ORM doesn't even warn you when it silently issues queries on field deferral leaks (#22492) makes this behavior really insidious and prevents sage usage of related managers as Guillaume brought up.
I think the the ORM should either have RelatedManager.only
include the reverse field implicitly (which we've kind of rules out against in #33835) or we should find a way to more safely assign known related objects in the face of deferred fields.
comment:7 by , 6 months ago
So I tried running the suite against the above patch and it delivered!
I was pointed at a single failure being defer.tests.DeferTests.test_only
====================================================================== FAIL: test_only (defer.tests.DeferTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/charettes/Workspace/django/tests/defer/tests.py", line 52, in test_only self.assert_delayed(self.s1.primary_set.only("pk")[0], 2) File "/Users/charettes/Workspace/django/tests/defer/tests.py", line 24, in assert_delayed self.assertEqual(count, num) AssertionError: 3 != 2
What's interesting though is that it might have a been a sneaky regression caused by the field deferral refactor in Django 1.10 that flew under the radar.
It was not discussed at all in the two PRs where #26207 was resolved
comment:8 by , 6 months ago
Cc: | added |
---|
comment:9 by , 6 months ago
For the record, I checked out I checked out Django 1.9 and it seems the issue is present there so while the commit mentioned above that landed in 1.10 did alter a related test it didn't actually break anything in this regard.
comment:10 by , 6 months ago
Resolution: | duplicate |
---|---|
Status: | closed → new |
Triage Stage: | Unreviewed → Accepted |
Thank you for the investigation Simon, reopening 👍
comment:11 by , 6 months ago
Type: | Uncategorized → Bug |
---|
comment:12 by , 4 months ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
This relates to #20927, #18177, and at least another ticket that asked for
"company_id"
to be implicitly included in theonly
call that I just can't find anymore.I think we should try to resolve that in a different way and a potential solution could be something like
django/db/models/query.py
But in order to work properly it would require a refactor of
known_related_objects
as the current approach is naive wrt/ to how it namespace related objects by field as the same model field can be included multiple times in the the same queryset (e.g. see #35356)