#35586 closed Bug (fixed)

Aggregation optimization doesn't account for not referenced set-returning annotations on Postgres

Reported by: Devin Cox Owned by: Devin Cox
Component: Database layer (models, ORM) Version: 5.0
Severity: Normal Keywords: postgres, set-returning, aggregation, annotation
Cc: Simon Charette, David Sanders, Jacob Walls Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

There is a bug where the queryset count is inaccurate when doing an annotation with JSONB_PATH_QUERY. The count is fixed after the queryset is evaluated.

I discovered this bug via PageNumberPagination from rest_framework.pagination, specifically in the function paginate_queryset. This Paginator takes in an object_list (in this case a Queryset) and determines the length by first checking if a .count() method is available, otherwise by doing len(object_list). In this case, it is determining length through Queryset.count().

When the Queryset is annotated, and the length of the results increases, the count remains stale--only returning the original count.

Currently I have a workaround by adding a DISTINCT, INTERSECTION, UNION, or DIFFERENCE to the queryset. Unfortunately, I now also need to have ordering which makes this tricky to continue working around.

I believe this was a regression back in 4.2. This was previously asked about here: https://code.djangoproject.com/ticket/28477#comment:22, and this thread was referred: https://forum.djangoproject.com/t/django-4-2-behavior-change-when-using-arrayagg-on-unnested-arrayfield-postgresql-specific/21547. However, the Unnest approach does not work for our use case.

Here are steps to reproduce:

models.py:

class TestModel(models.Model):
    test_json = models.JSONField(default=dict, blank=True)
    id = models.IntegerField(primary_key=True)

tests.py

from .models import TestModel
from django.contrib.postgres.fields import JSONField
from django.db.models import Func, Value

    def test_bug(self):
        test_model = TestModel.objects.create(
            test_json={
                "test_key" : [
                    {
                        "id" : 1,
                        "name" : "test1"
                    },
                    {
                        "id" : 2,
                        "name" : "test2"
                    }
                ]
            },
            id=1
        )
        test_model.save()
        qs = TestModel.objects.annotate(
            table_element=Func(
                "test_json",
                Value("$.test_key[*]"),
                function="jsonb_path_query",
                output_field=JSONField(),
                subquery=True
            )
        ).filter(pk=1)

        # qs.count() and len(qs) should be equal, but currently they are not. Running qs.count() after len(qs) is currently equal because the queryset was evaluated.

        self.assertEqual(qs.count(), len(qs))

Thank you for any guidance and/or support!

Change History (14)

comment:1 by Devin Cox, 16 months ago

Edit: the subquery=True parameter was from another recommendation I found, however whether or not it is included has had no impact on the net result.

Last edited 16 months ago by Devin Cox (previous) (diff)

comment:2 by Devin Cox, 16 months ago

Component: UncategorizedDatabase layer (models, ORM)

comment:3 by Simon Charette, 16 months ago

Cc: Simon Charette added
Triage Stage: UnreviewedAccepted

This one is tricky, the ORM and its aggregation logic was not built in a way that accounts for annotations spawning rows themselves (instead of through JOINs) like some Postgres set-returning functions (e.g. UNNEST). These do effectively throw a wrench in the optimization introduced in 4.2 (see #28477) if the set-returning annotation is not referenced by the aggregation.

What is really needed here is likely a Expression.set_returning: bool = False attribute that the ORM can consider to enforce the subquery pushdown and preventing of annotation eliding.

Something like

  • django/db/models/expressions.py

    diff --git a/django/db/models/expressions.py b/django/db/models/expressions.py
    index 4ee22420d9..8656979630 100644
    a b class BaseExpression:  
    182182    allowed_default = False
    183183    # Can the expression be used during a constraint validation?
    184184    constraint_validation_compatible = True
     185    # Does the expression possibly returns more than one row?
     186    set_returning = False
    185187
    186188    def __init__(self, output_field=None):
    187189        if output_field is not None:
  • django/db/models/sql/query.py

    diff --git a/django/db/models/sql/query.py b/django/db/models/sql/query.py
    index 438bb5ddbd..4f1210c49e 100644
    a b def get_aggregation(self, using, aggregate_exprs):  
    491491            )
    492492            or having
    493493        )
     494        set_returning_annotations = {
     495            alias
     496            for alias, annotation in self.annotation_select.items()
     497            if getattr(annotation, "set_returning", False)
     498        }
    494499        # Decide if we need to use a subquery.
    495500        #
    496501        # Existing aggregations would cause incorrect results as
    def get_aggregation(self, using, aggregate_exprs):  
    510515            or qualify
    511516            or self.distinct
    512517            or self.combinator
     518            or set_returning_annotations
    513519        ):
    514520            from django.db.models.sql.subqueries import AggregateQuery
    515521
    def get_aggregation(self, using, aggregate_exprs):  
    550556                    for annotation_alias, annotation in self.annotation_select.items():
    551557                        if annotation.get_group_by_cols():
    552558                            annotation_mask.add(annotation_alias)
     559                    # Annotations that possibly return multiple rows cannot
     560                    # be masked as they might have an incidence on the query.
     561                    annotation_mask |= set_returning_annotations
    553562                    inner_query.set_annotation_mask(annotation_mask)
    554563
    555564            # Add aggregates to the outer AggregateQuery. This requires making

That could then be used like

class JSONPathQuery(Func):
    function = "jsonb_path_query"
    output_field = JSONField()
    set_returning = True

JSONPathQuery("test_json", Value("$.test_key[*]"))

The challenge here is that Django doesn't have any core set-returning function (which also explains why this slipped under the radar for so long) so maybe we should also consider adding support for contrib.postgres.Unnest which is a common one that would allow us to ensure proper test coverage and document it as an example of when this flag should be set?

Last edited 16 months ago by Simon Charette (previous) (diff)

comment:4 by Simon Charette, 16 months ago

Keywords: postgres set-returning aggregation annotation added
Summary: Queryset count and length do not match when annotating using JSONB_PATH_QUERYAggregation optimization doesn't account for not referenced set-returning annotations on Postgres

comment:5 by Simon Charette, 16 months ago

Adding a bit more details about set-returning functions close equivalents on SQLite, MySQL, and Oracle.

The gist is that a the query

SELECT
   testmodel.*
   , jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element
FROM testmodel

can also be expressed as

SELECT
   testmodel.*
   , table_element_tbl.value AS table_element
FROM
    testmodel
    , jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element_tbl(value)

And if we added support the automatic addition of set_returning (or table_valued functions?) to alias_map (what is used to generate the FROM clause) it could also possibly allow to solve the long standing problem of adding support for features such as generate_series? That would allow the sql.Query.get_aggregation logic to remain unchanged and keep pruning unreferenced aliased (which is really specific to Postgres) as the reference in the FROM clause would still span the rows

SELECT COUNT(*) FROM (
    SELECT id
    FROM testmodel, jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element_tbl(value)
)
Last edited 16 months ago by Simon Charette (previous) (diff)

comment:6 by David Sanders, 16 months ago

Cc: David Sanders added

comment:7 by Jacob Walls, 16 months ago

Cc: Jacob Walls added

comment:8 by Devin Cox, 15 months ago

Went ahead and implemented/tested your suggestions and they work great for our individual use case. Also pushed a PR for this. I know there may be additional items now within the scope, such as support for Unnest, but I wanted to go ahead and get the ball rolling. Let me know if there are other ways I can help. Thanks!

Last edited 15 months ago by Devin Cox (previous) (diff)

comment:9 by Devin Cox, 15 months ago

Has patch: set

comment:10 by Sarah Boyce, 15 months ago

Owner: set to Devin Cox
Status: newassigned

comment:11 by Sarah Boyce, 15 months ago

Patch needs improvement: set

comment:12 by Sarah Boyce, 15 months ago

Patch needs improvement: unset

comment:13 by Sarah Boyce, 15 months ago

Triage Stage: AcceptedReady for checkin

comment:14 by Sarah Boyce <42296566+sarahboyce@…>, 15 months ago

Resolution: fixed
Status: assignedclosed

In e030839:

Fixed #35586 -- Added support for set-returning database functions.

Aggregation optimization didn't account for not referenced set-returning annotations on Postgres.

Co-authored-by: Simon Charette <charette.s@…>

Note: See TracTickets for help on using tickets.
Back to Top