Opened 3 years ago

Last modified 17 months ago

#28477 assigned Cleanup/optimization

Strip unused annotations from count queries

Reported by: Tom Forbes Owned by: Tom Forbes
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords:
Cc: Reupen Shah Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

The query below produces a SQL statement that includes the Count('chapters'), despite not not being used in any filter operations.

Book.objects.annotate(Count('chapters')).count()

It produces the same results as:

Book.objects.count()

Django could be more intelligent about what annotations to include in the query produced by queryset.count(), stripping out any annotations that are not referenced by filters, other annotations or ordering. This should speed up calls to count() with complex annotations.

There seems to be precedent for this: select_related calls are ignored with count() queries.

Change History (11)

comment:1 Changed 3 years ago by Tim Graham

Triage Stage: UnreviewedAccepted

comment:2 Changed 3 years ago by Tom Forbes

Owner: changed from nobody to Tom Forbes
Status: newassigned

comment:3 Changed 3 years ago by Tom Forbes

Same can be done for exists()

comment:5 Changed 2 years ago by Simon Charette

Has patch: set

comment:6 Changed 2 years ago by Tim Graham

Patch needs improvement: set

The PR is still marked [WIP] and there are test failures.

comment:7 Changed 18 months ago by Reupen Shah

I have also run into problems with QuerySet.count() being very slow on annotated query sets. Django uses a subquery for the count but injects a group by into the subquery. This typically causes the database server to deduplicate all matched rows using the group by columns which is, in general, extremely slow when there are a large number of matched rows.

For example, consider the following model:

class Person(models.Model):
    """Person model."""

    first_name = models.TextField()
    last_name = models.TextField()
    country = models.TextField(null=True, blank=True)

and query set:

from django.db.models.functions import Concat
from django.db.models import Value

queryset = Person.objects.annotate(full_name=Concat('first_name', Value(' '), 'last_name'))

queryset.count() generates the following query under PostgreSQL:

SELECT COUNT(*)
FROM
  (SELECT "support_person"."id" AS Col1,
          CONCAT("support_person"."first_name", CONCAT(' ', "support_person"."last_name")) AS "full_name"
   FROM "support_person"
   GROUP BY "support_person"."id",
            CONCAT("support_person"."first_name", CONCAT(' ', "support_person"."last_name"))) subquery

list(queryset) generates:

SELECT "support_person"."id",
       "support_person"."first_name",
       "support_person"."last_name",
       "support_person"."country",
       CONCAT("support_person"."first_name", CONCAT(' ', "support_person"."last_name")) AS "full_name"
FROM "support_person"

I am not entirely sure why the subquery for the count needs to be any different from the query used when the query set itself is evaluated. There are some relevant comments in the source code here: https://github.com/django/django/blob/5deb7a86e8b54d052a3b1dbed1ae7142d362b1c5/django/db/models/sql/query.py#L404-L414

This has all been tested under Django 2.1.7.

comment:8 Changed 18 months ago by Simon Charette

This is somewhat related to #30158 where the compiler is not smart enough to determine if it can exclude subquery annotations from GROUP BY on aggregation.

comment:9 Changed 17 months ago by Reupen Shah

The behaviour is actually a bit more bizarre than I thought.

With the same person model, here are four count variations and the generated queries:

1.

Person.objects.count()
SELECT COUNT(*) AS "__count" FROM "people_person"

2.

Person.objects.values('pk').count()
SELECT COUNT(*) AS "__count" FROM "people_person"

3.

Person.objects.annotate(full_name=Concat('first_name', Value(' '), 'last_name')).count()
SELECT COUNT(*) FROM (
    SELECT "people_person"."id" AS Col1, CONCAT("people_person"."first_name", CONCAT(' ', "people_person"."last_name")) AS "full_name"
    FROM "people_person"
    GROUP BY "people_person"."id", CONCAT("people_person"."first_name", CONCAT(' ', "people_person"."last_name"))
) subquery

4.

Person.objects.annotate(full_name=Concat('first_name', Value(' '), 'last_name')).values('pk').count()
SELECT COUNT(*) FROM (SELECT "people_person"."id" AS Col1 FROM "people_person") subquery

So there's a simple workaround for the case I gave in my last comment in calling .values('pk') before .count(). However, that's not much help if Django or other libraries are the ones calling .count().

On the plus side, I have a better understanding of what the problem is from looking at the Django source code.

Last edited 17 months ago by Reupen Shah (previous) (diff)

comment:10 Changed 17 months ago by Reupen Shah

Cc: Reupen Shah added

comment:11 Changed 17 months ago by Reupen Shah

Just a note to say that the behaviour I was describing was rectified in https://github.com/django/django/pull/11062.

The third case from above now executes the following SQL query:

SELECT COUNT(*) FROM (
    SELECT CONCAT("people_person"."first_name", CONCAT(' ', "people_person"."last_name")) AS "full_name" FROM "people_person"
) subquery

Although the original ticket was about eliminating unneeded annotations completely, the result above is good enough for me, as the group by has been eliminated which was the real performance killer.

Note: See TracTickets for help on using tickets.
Back to Top