Context Navigation

← Previous Ticket
Next Ticket →

#35833 closed Bug (invalid)

Annotation yielding an empty set causes entire QuerySet to become empty

Reported by:	Jacob Walls	Owned by:
Component:	Database layer (models, ORM)	Version:	5.1
Severity:	Normal	Keywords:
Cc:	Devin Cox, Simon Charette	Triage Stage:	Unreviewed
Has patch:	no	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description

With a nullable JSONField in my model:

class TileModel(models.Model):
   provisionaledits = JSONField(null=True)

If I then annotate with a set-returning function (e.g. JSONB_ARRAY_ELEMENTS on postgres) that happens to return an empty set, the entire QuerySet is zeroed out.

In [1]: from django.db.models import Func, F

In [2]: from arches.app.models.models import TileModel

In [3]: class JsonbArrayElements(Func):
   ...:     arity = 1
   ...:     contains_subquery = True  # Django 5.2: set_returning = True
   ...:     function = "JSONB_ARRAY_ELEMENTS"
   ...: 

In [4]: TileModel.objects.count()
Out[4]: 9

In [5]: TileModel.objects.annotate(empty_set=JsonbArrayElements(F("provisionaledits")))
Out[5]: <QuerySet []>

I distilled this from a more complicated failure. If it helps, here is a query that fails more helpfully when you actually have JSON in the queried field but still can't extract array elements because it's not an array:

In [14]: TileModel.objects.create(provisionaledits={})
Out[14]: <TileModel: TileModel object (6ee450f7-b516-4c61-b6e7-52e876b878fe)>

In [15]: TileModel.objects.annotate(no_longer_empty=JsonbArrayElements(F("provisionaledits")))
Out[15]: DataError: cannot extract elements from an object

I guess I would expect an error in both cases instead of an empty queryset in the former. At the very least, I would only expect the annotation to yield some sort of empty value, not for the whole queryset to silently become empty.

Change History (6)

comment:1 by Sarah Boyce, 5 months ago

Cc:	Devin Cox added
Triage Stage:	Unreviewed → Accepted

Agree this looks like a bug, here's a rough testy thing in case folks find it useful

tests/annotations/models.py

                class Ticket(models.Model):
 class JsonModel(models.Model):
     data = models.JSONField(default=dict, blank=True)
+    data = models.JSONField(null=True)
     id = models.IntegerField(primary_key=True)
     class Meta:

tests/annotations/tests.py

diff --git a/tests/annotations/tests.py b/tests/annotations/tests.py
index 29660a827e..31dfed6ca4 100644

                from decimal import Decimal
 from unittest import skipUnless
 from django.core.exceptions import FieldDoesNotExist, FieldError
 from django.db import connection
+from django.db import connection, DataError
 from django.db.models import (
     BooleanField,
     Case,
-…
+               class NonAggregateAnnotationTestCase(TestCase):
         self.assertEqual(qs.count(), len(qs))
+    @skipUnless(connection.vendor == "postgresql", "PostgreSQL tests")
+    @skipUnlessDBFeature("supports_json_field")
+    def test_set_returning_functions(self):
+        class JsonbArrayElements(Func):
+            function = "JSONB_ARRAY_ELEMENTS"
+            arity = 1
+            set_returning = True
+        JsonModel.objects.create(id=1)
+        qs = JsonModel.objects.annotate(empty_set=JsonbArrayElements(F("data")))
+        self.assertEqual(qs.count(), 0)  # <-- expect to be a DataError
+        JsonModel.objects.create(id=2, data={})
+        with self.assertRaises(DataError):
+            JsonModel.objects.annotate(empty_set=JsonbArrayElements(F("data"))).count()
 class AliasTests(TestCase):

comment:2 by Jacob Walls, 5 months ago

Thanks for the draft test. Looking again, it seems providing NULL to this function is fine, and doesn't raise an error:

% python manage.py dbshell
psql (16.1, server 15.4)
Type "help" for help.

my_project=# SELECT JSONB_ARRAY_ELEMENTS(NULL);
 jsonb_array_elements 
----------------------
(0 rows)

So I guess my expectation is that I can use the annotation as written without interfering with my queryset as a whole. My use case was providing a path lookup that happened to not exist (returning NULL) rather than just querying the root of the field.

comment:3 by Simon Charette, 5 months ago

Cc:	Simon Charette added

I could not reproduce the difference of count and len locally with the provided test case when set_returning = True is appropriately set; both result in returning no rows when dealing with empty sequence references. This is definitely a bug if we can reproduce the mismatch but I couldn't.

So I guess my expectation is that I can use the annotation as written without interfering with my queryset as a whole.

I'm not sure how much of a this is a bug over a misunderstanding of how JSONB_ARRAY_ELEMENTS and set-returning functions actually works. AFAIK when using a set-returning function in a SELECT clause (which is what annotate does) Postgres performs a Cartesian product of the existing rows and the each values returned for the row. The Cartesian product of any set with the empty set is the empty set.

If you want to avoid the empty set product then you must avoid providing a value to JSONB_ARRAY_ELEMENTS that results in one. In the case of NULL this can be avoided using Coalesce but you also have to deal with the empty array case. Something like the following is possibly what you're after?

class JsonbArrayLength(Func):
    arity = 1
    ouput_field = models.IntegerField()
    function = "jsonb_array_length"


empty_set_condition = Q(provisionaledits=None) | LesserThan(JsonbArrayLength("provisionaledits"), 1)
empty_set_value = Value(["<empty>"], JSONField())

TileModel.objects.annotate(
    annotation=JsonbArrayElements(
        Case(
            When(empty_set_condition, then=empty_set_value),
            default="provisionaledits",
        )
    )
)

So to summarize here I don't think that returning no rows when JSONB_ARRAY_ELEMENTS results in an empty set is a bug (it's just how set-returning annotations work) and I cannot reproduce the mismatch between count() and len against main when setting set_returning on the function used for annotation (it correctly prevents it from being elided when performing the aggregation).

Last edited 5 months ago by Simon Charette (previous) (diff)

comment:4 by Jacob Walls, 5 months ago

Resolution:	→ invalid
Status:	new → closed

Thanks very much, Simon. This is user error, then. I guess rather than a Cartesian product I was expecting an implicit subquery so that anything that happened during annotate would not "filter" the QuerySet. I didn't intend to suggest there was a count/len mismatch; I was just trying to show that I had some number of model instances persisted in general, and I expected to see all of them no matter what happened in annotate(). But your explanation helps me see why that's wrong.

Thanks for the draft query, too. While this ticket was being triaged, I ended up using an explicit subquery, along these lines:

# Start from an FK-related model
ResourceInstance.objects.annotate(
    annotation=ArraySubquery(
        TileModel.objects.filter(resourceinstance_id=OuterRef("resourceinstanceid"))
        .annotate(as_array=JsonbArrayElements(F("more_complex_path_lookup"))
        .values("as_array")
    )
)

Then, in some queries where my json path lookup needed to be resilient against missing keys in I ended up wrapping the expression given to the JsonbArrayElements in a Case as you suggested.

I would have come back and closed the ticket if I could have put all these pieces together more quickly (sorry!), but I guess I was still bewitched by the assumption that whatever I did inside annotate() wouldn't possibly filter down my queryset.

comment:5 by Natalia Bidart, 5 months ago

Triage Stage:	Accepted → Unreviewed

comment:6 by Simon Charette, 5 months ago

The way Postgres set-returning annotations work is definitely unintuitive if you are used to subqueries which forces you to have a single row and implicitly use OUTER instead of INNER semantics so I wouldn't be too harsh on yourself here.

What I mean is that the fact subqueries which are somewhat set-returning constructs behave in an OUTER JOIN LATERAL manner

SELECT 
    author.id, (
    SELECT post.id
    FROM post
    WHERE post.author_id = author.id
    ORDER BY id LIMIT 1
) first_post_id
FROM author

Is equivalent to

SELECT
  author.id,
  first_post.id
FROM author
LEFT OUTER JOIN LATERAL (
  SELECT post.*
  FROM post
  WHERE post.author_id = author.id
  ORDER BY id
  LIMIT 1
) first_post ON true

id	first_post_id
1	1
2	null

While other set-returning functions behave in a INNER JOIN LATERAL manner

SELECT
  author.id,
  jsonb_array_elements(jsonb_path_query_array(post_ids, '$[0 to 0]')) first_post_id
FROM author

Is equivalent to

SELECT
  author.id,
  first_post.id AS first_post_id
FROM author
INNER JOIN LATERAL (
  SELECT * FROM jsonb_array_elements(jsonb_path_query_array(post_ids, '$[0 to 0]'))
) first_post(id) ON true

id	first_post_id
1	1

Definitely breaks the principle of least astonishment, at least it did for me!

Last edited 5 months ago by Simon Charette (previous) (diff)

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues