Opened 16 months ago
Closed 15 months ago
#35586 closed Bug (fixed)
Aggregation optimization doesn't account for not referenced set-returning annotations on Postgres
| Reported by: | Devin Cox | Owned by: | Devin Cox | 
|---|---|---|---|
| Component: | Database layer (models, ORM) | Version: | 5.0 | 
| Severity: | Normal | Keywords: | postgres, set-returning, aggregation, annotation | 
| Cc: | Simon Charette, David Sanders, Jacob Walls | Triage Stage: | Ready for checkin | 
| Has patch: | yes | Needs documentation: | no | 
| Needs tests: | no | Patch needs improvement: | no | 
| Easy pickings: | no | UI/UX: | no | 
Description
There is a bug where the queryset count is inaccurate when doing an annotation with JSONB_PATH_QUERY. The count is fixed after the queryset is evaluated.  
I discovered this bug via PageNumberPagination from rest_framework.pagination, specifically in the function paginate_queryset. This Paginator takes in an object_list (in this case a Queryset) and determines the length by first checking if a .count() method is available, otherwise by doing len(object_list). In this case, it is determining length through Queryset.count().
When the Queryset is annotated, and the length of the results increases, the count remains stale--only returning the original count. 
Currently I have a workaround by adding a DISTINCT, INTERSECTION, UNION, or DIFFERENCE to the queryset. Unfortunately, I now also need to have ordering which makes this tricky to continue working around.
I believe this was a regression back in 4.2. This was previously asked about here: https://code.djangoproject.com/ticket/28477#comment:22, and this thread was referred: https://forum.djangoproject.com/t/django-4-2-behavior-change-when-using-arrayagg-on-unnested-arrayfield-postgresql-specific/21547. However, the Unnest approach does not work for our use case.
Here are steps to reproduce:
models.py:
class TestModel(models.Model):
    test_json = models.JSONField(default=dict, blank=True)
    id = models.IntegerField(primary_key=True)
tests.py
from .models import TestModel
from django.contrib.postgres.fields import JSONField
from django.db.models import Func, Value
    def test_bug(self):
        test_model = TestModel.objects.create(
            test_json={
                "test_key" : [
                    {
                        "id" : 1,
                        "name" : "test1"
                    },
                    {
                        "id" : 2,
                        "name" : "test2"
                    }
                ]
            },
            id=1
        )
        test_model.save()
        qs = TestModel.objects.annotate(
            table_element=Func(
                "test_json",
                Value("$.test_key[*]"),
                function="jsonb_path_query",
                output_field=JSONField(),
                subquery=True
            )
        ).filter(pk=1)
        # qs.count() and len(qs) should be equal, but currently they are not. Running qs.count() after len(qs) is currently equal because the queryset was evaluated.
        self.assertEqual(qs.count(), len(qs))
Thank you for any guidance and/or support!
Change History (14)
comment:2 by , 16 months ago
| Component: | Uncategorized → Database layer (models, ORM) | 
|---|
comment:3 by , 16 months ago
| Cc: | added | 
|---|---|
| Triage Stage: | Unreviewed → Accepted | 
This one is tricky, the ORM and its aggregation logic was not built in a way that accounts for annotations spawning rows themselves (instead of through JOINs)  like some Postgres set-returning functions (e.g. UNNEST). These do effectively throw a wrench in the optimization introduced in 4.2 if the set-returning annotation is not referenced by the aggregation.
What is really needed here is likely a Expression.set_returning: bool = False attribute that the ORM can consider to enforce the subquery pushdown and preventing of annotation eliding. 
Something like
- 
      django/db/models/expressions.pydiff --git a/django/db/models/expressions.py b/django/db/models/expressions.py index 4ee22420d9..589fd30b0f 100644 a b class BaseExpression: 182 182 allowed_default = False 183 183 # Can the expression be used during a constraint validation? 184 184 constraint_validation_compatible = True 185 # Does the expression possibly return more than one row? 186 set_returning = False 185 187 186 188 def __init__(self, output_field=None): 187 189 if output_field is not None: 
- 
      django/db/models/sql/query.pydiff --git a/django/db/models/sql/query.py b/django/db/models/sql/query.py index 438bb5ddbd..0eb1a2476c 100644 a b def get_aggregation(self, using, aggregate_exprs): 491 491 ) 492 492 or having 493 493 ) 494 has_set_returning_annotation = any( 495 getattr(annotation, "set_returning", False) 496 for annotation in self.annotations.values() 497 ) 494 498 # Decide if we need to use a subquery. 495 499 # 496 500 # Existing aggregations would cause incorrect results as … … def get_aggregation(self, using, aggregate_exprs): 510 514 or qualify 511 515 or self.distinct 512 516 or self.combinator 517 or has_set_returning_annotation 513 518 ): 514 519 from django.db.models.sql.subqueries import AggregateQuery 
That could then be used like
class JSONPathQuery(Func): function = "jsonb_path_query" output_field = JSONField() set_returning = True JSONPathQuery("test_json", Value("$.test_key[*]"))
The challenge here is that Django doesn't have any core set-returning function (which also explains why this slipped under the radar for so long) so maybe we should also consider adding support for contrib.postgres.Unnest which is a common one that would allow us to ensure proper test coverage and document it as an example of when this flag should be set?
comment:4 by , 16 months ago
| Keywords: | postgres set-returning aggregation annotation added | 
|---|---|
| Summary: | Queryset count and length do not match when annotating using JSONB_PATH_QUERY → Aggregation optimization doesn't account for not referenced set-returning annotations on Postgres | 
comment:5 by , 16 months ago
Adding a bit more details about set-returning functions close equivalents on SQLite, MySQL, and Oracle.
The gist is that a the query
SELECT testmodel.* , jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element FROM testmodel
can also be expressed as
SELECT testmodel.* , table_element_tbl.value AS table_element FROM testmodel , jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element_tbl(value)
And if we added support the automatic addition of set_returning (or table_valued functions?) to alias_map (what is used to generate the FROM clause) it could also possibly allow to solve the long standing problem of adding support for features such as generate_series? That would allow the sql.Query.get_aggregation logic to remain unchanged and keep pruning unreferenced aliased (which is really specific to Postgres) as the reference in the FROM clause would still span the rows
SELECT COUNT(*) FROM ( SELECT id FROM testmodel, jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element_tbl(value) )
comment:6 by , 16 months ago
| Cc: | added | 
|---|
comment:7 by , 16 months ago
| Cc: | added | 
|---|
comment:8 by , 15 months ago
Went ahead and implemented/tested your suggestions and they work great for our individual use case. Also pushed a PR for this. I know there may be additional items now within the scope, such as support for Unnest, but I wanted to go ahead and get the ball rolling. Let me know if there are other ways I can help. Thanks!
comment:9 by , 15 months ago
| Has patch: | set | 
|---|
comment:10 by , 15 months ago
| Owner: | set to | 
|---|---|
| Status: | new → assigned | 
comment:11 by , 15 months ago
| Patch needs improvement: | set | 
|---|
comment:12 by , 15 months ago
| Patch needs improvement: | unset | 
|---|
comment:13 by , 15 months ago
| Triage Stage: | Accepted → Ready for checkin | 
|---|
Edit: the
subquery=Trueparameter was from another recommendation I found, however whether or not it is included has had no impact on the net result.