Context Navigation

← Previous Ticket
Next Ticket →

#28477 closed Cleanup/optimization (fixed)

Strip unused annotations from count queries

Reported by:	Tom Forbes	Owned by:	Simon Charette
Component:	Database layer (models, ORM)	Version:	dev
Severity:	Normal	Keywords:
Cc:	Reupen Shah	Triage Stage:	Ready for checkin
Has patch:	yes	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description

The query below produces a SQL statement that includes the Count('chapters'), despite not not being used in any filter operations.

Book.objects.annotate(Count('chapters')).count()

It produces the same results as:

Book.objects.count()

Django could be more intelligent about what annotations to include in the query produced by queryset.count(), stripping out any annotations that are not referenced by filters, other annotations or ordering. This should speed up calls to count() with complex annotations.

There seems to be precedent for this: select_related calls are ignored with count() queries.

Attachments (1)

ProductAndServiceUsage.json.zip (2.2 KB ) - added by shazad sarwar 4 years ago.

Download all attachments as: .zip

Change History (29)

comment:1 by Tim Graham, 8 years ago

Triage Stage:	Unreviewed → Accepted

comment:2 by Tom Forbes, 8 years ago

Owner:	changed from nobody to Tom Forbes
Status:	new → assigned

comment:3 by Tom Forbes, 8 years ago

Same can be done for exists()

comment:4 by Tom Forbes, 7 years ago

WIP PR: https://github.com/django/django/pull/8928/files

comment:5 by Simon Charette, 7 years ago

Has patch:	set

comment:6 by Tim Graham, 7 years ago

Patch needs improvement:	set

The PR is still marked [WIP] and there are test failures.

comment:7 by Reupen Shah, 6 years ago

I have also run into problems with QuerySet.count() being very slow on annotated query sets. Django uses a subquery for the count but injects a group by into the subquery. This typically causes the database server to deduplicate all matched rows using the group by columns which is, in general, extremely slow when there are a large number of matched rows.

For example, consider the following model:

class Person(models.Model):
    """Person model."""

    first_name = models.TextField()
    last_name = models.TextField()
    country = models.TextField(null=True, blank=True)

and query set:

from django.db.models.functions import Concat
from django.db.models import Value

queryset = Person.objects.annotate(full_name=Concat('first_name', Value(' '), 'last_name'))

queryset.count() generates the following query under PostgreSQL:

SELECT COUNT(*)
FROM
  (SELECT "support_person"."id" AS Col1,
          CONCAT("support_person"."first_name", CONCAT(' ', "support_person"."last_name")) AS "full_name"
   FROM "support_person"
   GROUP BY "support_person"."id",
            CONCAT("support_person"."first_name", CONCAT(' ', "support_person"."last_name"))) subquery

list(queryset) generates:

SELECT "support_person"."id",
       "support_person"."first_name",
       "support_person"."last_name",
       "support_person"."country",
       CONCAT("support_person"."first_name", CONCAT(' ', "support_person"."last_name")) AS "full_name"
FROM "support_person"

I am not entirely sure why the subquery for the count needs to be any different from the query used when the query set itself is evaluated. There are some relevant comments in the source code here: https://github.com/django/django/blob/5deb7a86e8b54d052a3b1dbed1ae7142d362b1c5/django/db/models/sql/query.py#L404-L414

This has all been tested under Django 2.1.7.

comment:8 by Simon Charette, 6 years ago

This is somewhat related to #30158 where the compiler is not smart enough to determine if it can exclude subquery annotations from GROUP BY on aggregation.

comment:9 by Reupen Shah, 6 years ago

The behaviour is actually a bit more bizarre than I thought.

With the same person model, here are four count variations and the generated queries:

Person.objects.count()

SELECT COUNT(*) AS "__count" FROM "people_person"

Person.objects.values('pk').count()

SELECT COUNT(*) AS "__count" FROM "people_person"

Person.objects.annotate(full_name=Concat('first_name', Value(' '), 'last_name')).count()

SELECT COUNT(*) FROM (
    SELECT "people_person"."id" AS Col1, CONCAT("people_person"."first_name", CONCAT(' ', "people_person"."last_name")) AS "full_name"
    FROM "people_person"
    GROUP BY "people_person"."id", CONCAT("people_person"."first_name", CONCAT(' ', "people_person"."last_name"))
) subquery

Person.objects.annotate(full_name=Concat('first_name', Value(' '), 'last_name')).values('pk').count()

SELECT COUNT(*) FROM (SELECT "people_person"."id" AS Col1 FROM "people_person") subquery

So there's a simple workaround for the case I gave in my last comment in calling .values('pk') before .count(). However, that's not much help if Django or other libraries are the ones calling .count().

On the plus side, I have a better understanding of what the problem is from looking at the Django source code.

Last edited 6 years ago by Reupen Shah (previous) (diff)

comment:10 by Reupen Shah, 6 years ago

Cc:	Reupen Shah added

comment:11 by Reupen Shah, 6 years ago

Just a note to say that the behaviour I was describing was rectified in https://github.com/django/django/pull/11062.

The third case from above now executes the following SQL query:

SELECT COUNT(*) FROM (
    SELECT CONCAT("people_person"."first_name", CONCAT(' ', "people_person"."last_name")) AS "full_name" FROM "people_person"
) subquery

Although the original ticket was about eliminating unneeded annotations completely, the result above is good enough for me, as the group by has been eliminated which was the real performance killer.

by shazad sarwar, 4 years ago

Attachment:	ProductAndServiceUsage.json.zip added

comment:12 by Simon Charette, 3 years ago

Owner:	changed from Tom Forbes to Simon Charette

comment:13 by Simon Charette, 3 years ago

Patch needs improvement:	unset

I submitted a revised PR that strip unused annotations not only for .count() for but any usage of .aggregate.

To take example from comment:7 and comment:9 the resulting query is now

Person.objects.annotate(full_name=Concat('first_name', Value(' '), 'last_name')).count()

SELECT COUNT(*) FROM support_person

The adjusted logic goes along these lines.

Instead of systematically performing a subquery wrapping when there are existing annotations only do so when the pre-existing annotation are aggregate or window functions. In both cases, pre-existing aggregation/window or not, strip annotations that are not referenced by the aggregates.

comment:14 by Mariusz Felisiak, 3 years ago

Triage Stage:	Accepted → Ready for checkin

comment:15 by Mariusz Felisiak <felisiak.mariusz@…>, 3 years ago

Resolution:	→ fixed
Status:	assigned → closed

In 59bea9ef:

Fixed #28477 -- Stripped unused annotations on aggregation.

Also avoid an unnecessary pushdown when aggregating over a query that doesn't
have aggregate annotations.

comment:16 by Mariusz Felisiak <felisiak.mariusz@…>, 3 years ago

In a9d2d8d:

Refs #28477 -- Reduced complexity of aggregation over qualify queries.

comment:17 by GitHub <noreply@…>, 3 years ago

In 1003713:

Refs #28477 -- Fixed handling aliased annotations on aggregation.

Just like when using .annotate(), the .alias() method will generate the
necessary JOINs to resolve the alias even if not selected.

Since these JOINs could be multi-valued non-selected aggregates must be
considered to require subquery wrapping as a GROUP BY is required to
combine duplicated tuples from the base table.

Regression in 59bea9efd2768102fc9d3aedda469502c218e9b7.

comment:18 by Mariusz Felisiak <felisiak.mariusz@…>, 2 years ago

In 9daf8b41:

Fixed #34464 -- Fixed queryset aggregation over group by reference.

Regression in 59bea9efd2768102fc9d3aedda469502c218e9b7.

Refs #28477.

Thanks Ian Cubitt for the report.

comment:19 by Mariusz Felisiak <felisiak.mariusz@…>, 2 years ago

In 511dc3db:

[4.2.x] Fixed #34464 -- Fixed queryset aggregation over group by reference.

Regression in 59bea9efd2768102fc9d3aedda469502c218e9b7.

Refs #28477.

Thanks Ian Cubitt for the report.

Backport of 9daf8b4109c3e133eb57349bb44d73cc60c5773c from main

comment:20 by Mariusz Felisiak <felisiak.mariusz@…>, 2 years ago

In e5c844d6:

Fixed #34551 -- Fixed QuerySet.aggregate() crash when referencing subqueries.

Regression in 59bea9efd2768102fc9d3aedda469502c218e9b7.

Refs #28477.

Thanks Denis Roldán and Mariusz for the test.

comment:21 by Mariusz Felisiak <felisiak.mariusz@…>, 2 years ago

In c78a442:

[4.2.x] Fixed #34551 -- Fixed QuerySet.aggregate() crash when referencing subqueries.

Regression in 59bea9efd2768102fc9d3aedda469502c218e9b7.

Refs #28477.

Thanks Denis Roldán and Mariusz for the test.

Backport of e5c844d6f2a4ac6ae674d741b5f1fa2a688cedf4 from main

comment:22 by Dustin Lorres, 2 years ago

Is there a way to opt out of this behavior from application code? After upgrading from 3.2 to 4.2 some of our pagination stopped working correctly and it was traced back to .count() not providing the correct results, likely from the updates in this ticket.

Here is an example of the failure (using PostgreSQL JSON query):

>>> qs = MyModel.objects.annotate(table_element=Func("data", Value("$.MyArray[*]"),function="jsonb_path_query",output_field=JSONField())).filter(pk=1)
>>> qs.count()
1
>>> len(qs)
2

This assumes a very simple model MyModel with a JSON field data that is has an instance (pk of 1) that has something the data field set to:

{
  "MyArray": [{"id": 1, "name": "test"}, {"id": 2, "name": "test2"}]
}

The issue is now the count is not factoring in the annotation which will actually increase the number of rows returned in the queryset (for this example due to the jsonb_path_query which returns a set). It is only counting the number of rows of MyModel which due to the pk filter will only have one row returned.

Is there anyway to force the count operation to expand the query and include the annotation?

I tried to filter on the count of the jsonb path query, but it failed:

>>> qs = MyModel.objects.annotate(my_count=Count(Func("data", Value("$.MyArray[*]"),function="jsonb_path_query",output_field=JSONField()))).filter(pk=1, my_count__gt=0)
>>> qs.count()
Exception:
django.db.utils.NotSupportedError: set-returning functions are not allowed in HAVING

comment:23 by Simon Charette, 2 years ago

Dustin, I think this answer is your way out.

Basically the optimization should be disabled for any set-returning function but since Django only has single native one,Subquery, the optimization is only disabled through duck typing when the subquery = True attribute is set.

In order to truly solve this issue I think we should introduce a new documented Expression.set_returning: bool (better name welcome!) flag that defaults to False but is set to True for Subquery.

The root of this problem is that the ORM simply doesn't support functions that return rows in a generic way (your assignment of output_field=JSONField() is wrong but there's no way to say output_field=Set(JSONField())). Instead it branches out using getattr(expr, "subquery", False) in all cases that it makes the most sense to support them (e.g. __in looups).

Last edited 2 years ago by Simon Charette (previous) (diff)