Opened 6 years ago
Closed 9 months ago
#29725 closed Cleanup/optimization (fixed)
Inefficient SQL generated when counting a ManyToMany
Reported by: | Gavin Wahl | Owned by: | ontowhee |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Simon Charette | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
When calling count() on an unfiltered many to many relation, a useless join is included in the SQL that makes it much slower than it should be. On my dataset, the difference is 1000ms to 100ms, because an index-only scan can be used.
This is the SQL that is currently generated:
SELECT COUNT(*) AS "__count" FROM "app_foo" INNER JOIN "app_foo_bar" ON ("app_foo"."id" = "app_foo_bar"."foo_id") WHERE "app_foo_bar"."foo_id" = ?;
This is the SQL that should be generated:
SELECT COUNT(*) AS "__count" FROM "app_foo_bar" WHERE "app_foo_bar"."foo_id" = ?;
This optimization can only be applied when there are no filters applied, because then the join is used to satisfy the filters. In the no-filters case, only the through table needs to be consulted.
Change History (29)
comment:1 by , 6 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
follow-up: 4 comment:2 by , 6 years ago
Triage Stage: | Unreviewed → Accepted |
---|---|
Type: | Uncategorized → Cleanup/optimization |
Version: | 2.1 → master |
comment:3 by , 6 years ago
Cc: | added |
---|
comment:4 by , 6 years ago
Thanks for your advice.
I will try to correct it according to your advice
Replying to Simon Charette:
Hello Olivier,
I think you should be able to achieve this by defining a
count()
method on the dynamic class created bycreate_forward_many_to_many_manager
by filteringself.through._default_manager
based onself.instance
and return itscount()
.
comment:6 by , 6 years ago
Has patch: | set |
---|---|
Needs tests: | set |
comment:7 by , 6 years ago
Have we considered making this change for exists()
as well? I'm sure this will generate the join in the same way that count()
does.
Also somewhat related: https://code.djangoproject.com/ticket/28477
comment:9 by , 6 years ago
Needs tests: | unset |
---|---|
Patch needs improvement: | set |
comment:10 by , 6 years ago
Patch needs improvement: | unset |
---|
comment:14 by , 6 years ago
Resolution: | fixed |
---|---|
Status: | closed → new |
Solution has been reverted because it caused the inconsistent behavior of count()
and exists()
on a reverse many-to-many relationship with a custom manager (see new tests in 9ac8520fcde29840a1345be19d80dbda53aa6d03).
comment:15 by , 6 years ago
Has patch: | unset |
---|
comment:16 by , 6 years ago
Has patch: | set |
---|---|
Patch needs improvement: | set |
Switched back to patch needs improvement instead as it can probably be adapted to skip the optimization when a custom manager is defined.
comment:17 by , 6 years ago
Ticket #30325 revealed one more issue, i.e. optimization doesn't work when chaining all()
after reverse M2M relations.
comment:18 by , 6 years ago
Additionally, the optimization causes a query to be done if the related objects have already been prefetched.
comment:19 by , 19 months ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
Thanks everyone for your insights. Im going to try to implement this optimization, accounting for all the issues listed here so far.
comment:20 by , 18 months ago
Needs tests: | set |
---|
Patch needs a few tweaks to properly disable the optimization when dealing with custom manager. It also lacks coverage for the prefetched case as reported in comment:18.
comment:21 by , 18 months ago
Needs tests: | unset |
---|---|
Patch needs improvement: | unset |
Updated the patch to properly deal with cases involving custom managers and prefetch queries, and implemented test cases for them.
comment:22 by , 17 months ago
Needs tests: | set |
---|---|
Patch needs improvement: | set |
comment:23 by , 10 months ago
Owner: | changed from | to
---|
comment:24 by , 10 months ago
I'm trying to pick up on the work from PR #16863, and I am beginning to learn about M2M relationships, RelatedManagers, and QuerySets. From reading the discussions on this ticket and looking through the comments on the PR, this needs to handle chaining .all() for the reverse M2M relation. Is that the self.article.publications
?
Questions:
- What other chaining needs to be considered?
- How many levels and combinations of chaining need to be considered? For example, I imagine it is desirable to ensure the optimization also occurs for .all().all().all().count() too. Another example would be .all().filter().count(), where the filter is technically an empty filter, but it would return a QuerySet.
- Where would be the best level to implement the optimization? The PR handles it in ManyRelatedManager. Another possibility could be to implement this on compiler.get_from_clause(). I found this idea from reading the comment on another ticket https://code.djangoproject.com/ticket/20535#comment:3.
- If it is handled in the RelatedManager level, is there a way for the RelatedManager to look ahead to determine that the next function in the chain will be a .count() or .exists()?
I'm not sure if I'm asking the right questions. If anyone has any pointers for what areas I should look into, please let me know.
comment:25 by , 10 months ago
Thank you for working on this ticket ontowhee.
From my understanding the closed PR is 95% of the way there as it builds on top of the initial work that had to be reverted and adds extra coverage. I'd start by creating a new PR out of it and deal with the unaddressed left by Mariusz to see how the test suite reacts.
How many levels and combinations of chaining need to be considered? For example, I imagine it is desirable to ensure the optimization also occurs for .all().all().all().count() too. Another example would be .all().filter().count(), where the filter is technically an empty filter, but it would return a QuerySet.
I'd focus on the many-to-many manager case discussed here only, that's already hard enough to get right.
Where would be the best level to implement the optimization? The PR handles it in ManyRelatedManager. Another possibility could be to implement this on compiler.get_from_clause(). I found this idea from reading the comment on another ticket https://code.djangoproject.com/ticket/20535#comment:3.
Exactly where it's defined in the PR you referenced. Skipping of unnecessary intermediate joins as described in #20535 is a much larger problem again that might help here in the long run but I think there are benefits in solving this well scoped issue first.
If it is handled in the RelatedManager level, is there a way for the RelatedManager to look ahead to determine that the next function in the chain will be a .count() or .exists()?
No need to as at that point it will be a QuerySet
. Lets keep this ticket focused on the common many-to-many case where .count
and .exists
are directly called.
comment:26 by , 10 months ago
I'm not sure if this is a viable solution. It passing the ManyRelatedManager
to the queryset as an instance variable _many_related_queryset
. (Note: It is failing tests in foreign_object/, so I will need to look into it.)
No need to as at that point it will be a QuerySet
I briefly looked into QuerySet._query
to see if a new optimized sql query could be generated when the aliasMap
lists certain table types. I ended up passing the ManyRelatedManager
to get something working. I'm open to exploring a different approach.
comment:28 by , 9 months ago
Needs tests: | unset |
---|---|
Patch needs improvement: | unset |
Triage Stage: | Accepted → Ready for checkin |
Hello Olivier,
I think you should be able to achieve this by defining a
count()
method on the dynamic class created bycreate_forward_many_to_many_manager
by filteringself.through._default_manager
based onself.instance
and return itscount()
.