Opened 3 years ago
Last modified 22 months ago
#28477 assigned Cleanup/optimization
Strip unused annotations from count queries
Reported by: | Tom Forbes | Owned by: | Tom Forbes |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | master |
Severity: | Normal | Keywords: | |
Cc: | Reupen Shah | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
The query below produces a SQL statement that includes the Count('chapters'), despite not not being used in any filter operations.
Book.objects.annotate(Count('chapters')).count()
It produces the same results as:
Book.objects.count()
Django could be more intelligent about what annotations to include in the query produced by queryset.count()
, stripping out any annotations that are not referenced by filters, other annotations or ordering. This should speed up calls to count() with complex annotations.
There seems to be precedent for this: select_related calls are ignored with count() queries.
Change History (11)
comment:1 Changed 3 years ago by
Triage Stage: | Unreviewed → Accepted |
---|
comment:2 Changed 3 years ago by
Owner: | changed from nobody to Tom Forbes |
---|---|
Status: | new → assigned |
comment:3 Changed 3 years ago by
comment:5 Changed 2 years ago by
Has patch: | set |
---|
comment:6 Changed 2 years ago by
Patch needs improvement: | set |
---|
The PR is still marked [WIP] and there are test failures.
comment:7 Changed 2 years ago by
I have also run into problems with QuerySet.count()
being very slow on annotated query sets. Django uses a subquery for the count but injects a group by into the subquery. This typically causes the database server to deduplicate all matched rows using the group by columns which is, in general, extremely slow when there are a large number of matched rows.
For example, consider the following model:
class Person(models.Model): """Person model.""" first_name = models.TextField() last_name = models.TextField() country = models.TextField(null=True, blank=True)
and query set:
from django.db.models.functions import Concat from django.db.models import Value queryset = Person.objects.annotate(full_name=Concat('first_name', Value(' '), 'last_name'))
queryset.count()
generates the following query under PostgreSQL:
SELECT COUNT(*) FROM (SELECT "support_person"."id" AS Col1, CONCAT("support_person"."first_name", CONCAT(' ', "support_person"."last_name")) AS "full_name" FROM "support_person" GROUP BY "support_person"."id", CONCAT("support_person"."first_name", CONCAT(' ', "support_person"."last_name"))) subquery
list(queryset)
generates:
SELECT "support_person"."id", "support_person"."first_name", "support_person"."last_name", "support_person"."country", CONCAT("support_person"."first_name", CONCAT(' ', "support_person"."last_name")) AS "full_name" FROM "support_person"
I am not entirely sure why the subquery for the count needs to be any different from the query used when the query set itself is evaluated. There are some relevant comments in the source code here: https://github.com/django/django/blob/5deb7a86e8b54d052a3b1dbed1ae7142d362b1c5/django/db/models/sql/query.py#L404-L414
This has all been tested under Django 2.1.7.
comment:8 Changed 2 years ago by
This is somewhat related to #30158 where the compiler is not smart enough to determine if it can exclude subquery annotations from GROUP BY
on aggregation.
comment:9 Changed 23 months ago by
The behaviour is actually a bit more bizarre than I thought.
With the same person model, here are four count variations and the generated queries:
1.
Person.objects.count()
SELECT COUNT(*) AS "__count" FROM "people_person"
2.
Person.objects.values('pk').count()
SELECT COUNT(*) AS "__count" FROM "people_person"
3.
Person.objects.annotate(full_name=Concat('first_name', Value(' '), 'last_name')).count()
SELECT COUNT(*) FROM ( SELECT "people_person"."id" AS Col1, CONCAT("people_person"."first_name", CONCAT(' ', "people_person"."last_name")) AS "full_name" FROM "people_person" GROUP BY "people_person"."id", CONCAT("people_person"."first_name", CONCAT(' ', "people_person"."last_name")) ) subquery
4.
Person.objects.annotate(full_name=Concat('first_name', Value(' '), 'last_name')).values('pk').count()
SELECT COUNT(*) FROM (SELECT "people_person"."id" AS Col1 FROM "people_person") subquery
So there's a simple workaround for the case I gave in my last comment in calling .values('pk')
before .count()
. However, that's not much help if Django or other libraries are the ones calling .count()
.
On the plus side, I have a better understanding of what the problem is from looking at the Django source code.
comment:10 Changed 23 months ago by
Cc: | Reupen Shah added |
---|
comment:11 Changed 22 months ago by
Just a note to say that the behaviour I was describing was rectified in https://github.com/django/django/pull/11062.
The third case from above now executes the following SQL query:
SELECT COUNT(*) FROM ( SELECT CONCAT("people_person"."first_name", CONCAT(' ', "people_person"."last_name")) AS "full_name" FROM "people_person" ) subquery
Although the original ticket was about eliminating unneeded annotations completely, the result above is good enough for me, as the group by has been eliminated which was the real performance killer.
Same can be done for
exists()