Context Navigation

← Previous Ticket
Next Ticket →

#35586 closed Bug (fixed)

Aggregation optimization doesn't account for not referenced set-returning annotations on Postgres

Reported by:	Devin Cox	Owned by:	Devin Cox
Component:	Database layer (models, ORM)	Version:	5.0
Severity:	Normal	Keywords:	postgres, set-returning, aggregation, annotation
Cc:	Simon Charette, David Sanders, Jacob Walls	Triage Stage:	Ready for checkin
Has patch:	yes	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description

There is a bug where the queryset count is inaccurate when doing an annotation with JSONB_PATH_QUERY. The count is fixed after the queryset is evaluated.

I discovered this bug via PageNumberPagination from rest_framework.pagination, specifically in the function paginate_queryset. This Paginator takes in an object_list (in this case a Queryset) and determines the length by first checking if a .count() method is available, otherwise by doing len(object_list). In this case, it is determining length through Queryset.count().

When the Queryset is annotated, and the length of the results increases, the count remains stale--only returning the original count.

Currently I have a workaround by adding a DISTINCT, INTERSECTION, UNION, or DIFFERENCE to the queryset. Unfortunately, I now also need to have ordering which makes this tricky to continue working around.

I believe this was a regression back in 4.2. This was previously asked about here: https://code.djangoproject.com/ticket/28477#comment:22, and this thread was referred: https://forum.djangoproject.com/t/django-4-2-behavior-change-when-using-arrayagg-on-unnested-arrayfield-postgresql-specific/21547. However, the Unnest approach does not work for our use case.

Here are steps to reproduce:

models.py:

class TestModel(models.Model):
    test_json = models.JSONField(default=dict, blank=True)
    id = models.IntegerField(primary_key=True)

tests.py

from .models import TestModel
from django.contrib.postgres.fields import JSONField
from django.db.models import Func, Value

    def test_bug(self):
        test_model = TestModel.objects.create(
            test_json={
                "test_key" : [
                    {
                        "id" : 1,
                        "name" : "test1"
                    },
                    {
                        "id" : 2,
                        "name" : "test2"
                    }
                ]
            },
            id=1
        )
        test_model.save()
        qs = TestModel.objects.annotate(
            table_element=Func(
                "test_json",
                Value("$.test_key[*]"),
                function="jsonb_path_query",
                output_field=JSONField(),
                subquery=True
            )
        ).filter(pk=1)

        # qs.count() and len(qs) should be equal, but currently they are not. Running qs.count() after len(qs) is currently equal because the queryset was evaluated.

        self.assertEqual(qs.count(), len(qs))

Thank you for any guidance and/or support!

Change History (14)

comment:1 by Devin Cox, 12 months ago

Edit: the subquery=True component was from another recommendation I found, however whether or not it is included has had no impact on the net result.

Version 0, edited 12 months ago by Devin Cox (next)

comment:2 by Devin Cox, 12 months ago

Component:	Uncategorized → Database layer (models, ORM)

comment:3 by Simon Charette, 12 months ago

Cc:	Simon Charette added
Triage Stage:	Unreviewed → Accepted

This one is tricky, the ORM and its aggregation logic was not built in a way that accounts for annotations spawning rows themselves (instead of through JOINs) like some Postgres set-returning functions (e.g. UNNEST). These do effectively throw a wrench in the optimization introduced in 4.2 (see #28477) if the set-returning annotation is not referenced by the aggregation.

What is really needed here is likely a Expression.set_returning: bool = False attribute that the ORM can consider to enforce the subquery pushdown and preventing of annotation eliding.

Something like

django/db/models/expressions.py

diff --git a/django/db/models/expressions.py b/django/db/models/expressions.py
index 4ee22420d9..8656979630 100644

                class BaseExpression:
     allowed_default = False
     # Can the expression be used during a constraint validation?
     constraint_validation_compatible = True
+    # Does the expression possibly returns more than one row?
+    set_returning = False
     def __init__(self, output_field=None):
         if output_field is not None:

django/db/models/sql/query.py

diff --git a/django/db/models/sql/query.py b/django/db/models/sql/query.py
index 438bb5ddbd..4f1210c49e 100644

                def get_aggregation(self, using, aggregate_exprs):
+            )
             or having
+        )
+        set_returning_annotations = {
+            alias
+            for alias, annotation in self.annotation_select.items()
+            if getattr(annotation, "set_returning", False)
+        }
         # Decide if we need to use a subquery.
+        #
         # Existing aggregations would cause incorrect results as
-…
+               def get_aggregation(self, using, aggregate_exprs):
             or qualify
             or self.distinct
             or self.combinator
+            or set_returning_annotations
         ):
             from django.db.models.sql.subqueries import AggregateQuery
-…
+               def get_aggregation(self, using, aggregate_exprs):
                     for annotation_alias, annotation in self.annotation_select.items():
                         if annotation.get_group_by_cols():
                             annotation_mask.add(annotation_alias)
+                    # Annotations that possibly return multiple rows cannot
+                    # be masked as they might have an incidence on the query.
+                    annotation_mask |= set_returning_annotations
                     inner_query.set_annotation_mask(annotation_mask)
             # Add aggregates to the outer AggregateQuery. This requires making

That could then be used like

class JSONPathQuery(Func):
    function = "jsonb_path_query"
    output_field = JSONField()
    set_returning = True

JSONPathQuery("test_json", Value("$.test_key[*]"))

The challenge here is that Django doesn't have any core set-returning function (which also explains why this slipped under the radar for so long) so maybe we should also consider adding support for contrib.postgres.Unnest which is a common one that would allow us to ensure proper test coverage and document it as an example of when this flag should be set?

Last edited 12 months ago by Simon Charette (previous) (diff)

comment:4 by Simon Charette, 12 months ago

Keywords:	postgres set-returning aggregation annotation added
Summary:	Queryset count and length do not match when annotating using JSONB_PATH_QUERY → Aggregation optimization doesn't account for not referenced set-returning annotations on Postgres

comment:5 by Simon Charette, 12 months ago

Adding a bit more details about set-returning functions close equivalents on SQLite, MySQL, and Oracle.

The gist is that a the query

SELECT
   testmodel.*
   , jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element
FROM testmodel

can also be expressed as

SELECT
   testmodel.*
   , table_element_tbl.value AS table_element
FROM
    testmodel
    , jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element_tbl(value)

And if we added support the automatic addition of set_returning (or table_valued functions?) to alias_map (what is used to generate the FROM clause) it could also possibly allow to solve the long standing problem of adding support for features such as generate_series? That would allow the sql.Query.get_aggregation logic to remain unchanged and keep pruning unreferenced aliased (which is really specific to Postgres) as the reference in the FROM clause would still span the rows

SELECT COUNT(*) FROM (
    SELECT id
    FROM testmodel, jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element_tbl(value)
)

Last edited 12 months ago by Simon Charette (previous) (diff)

comment:6 by David Sanders, 12 months ago

Cc:	David Sanders added

comment:7 by Jacob Walls, 12 months ago

Cc:	Jacob Walls added

comment:8 by Devin Cox, 12 months ago

Went ahead and implemented/tested your suggestions and they work great for our individual use case. Also pushed a PR for this. I know there may be additional items now within the scope, such as support for Unnest, but I wanted to go ahead and get the ball rolling. Let me know if there are other ways I can help. Thanks!

Last edited 12 months ago by Devin Cox (previous) (diff)

comment:9 by Devin Cox, 12 months ago

Has patch:	set

comment:10 by Sarah Boyce, 11 months ago

Owner:	set to Devin Cox
Status:	new → assigned

comment:11 by Sarah Boyce, 11 months ago

Patch needs improvement:	set

comment:12 by Sarah Boyce, 11 months ago

Patch needs improvement:	unset

comment:13 by Sarah Boyce, 11 months ago

Triage Stage:	Accepted → Ready for checkin

comment:14 by Sarah Boyce <42296566+sarahboyce@…>, 11 months ago

Resolution:	→ fixed
Status:	assigned → closed

In e030839:

Fixed #35586 -- Added support for set-returning database functions.

Aggregation optimization didn't account for not referenced set-returning annotations on Postgres.

Co-authored-by: Simon Charette <charette.s@…>

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues