Opened 23 months ago

Closed 23 months ago

Last modified 22 months ago

#34255 closed Bug (fixed)

Annotation/group by with an expression on psycopg3

Reported by: Guillaume Andreu Sabater Owned by: Mariusz Felisiak
Component: Database layer (models, ORM) Version: dev
Severity: Release blocker Keywords: orm postgres psycopg3 annotation groupby
Cc: Florian Apolloner, Simon Charette Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Given the following code:

import zoneinfo

from django.db import models
from django.db.models.functions import ExtractYear

from django.contrib.postgres.fields import ArrayField

TZ = zoneinfo.ZoneInfo("Europe/Paris")


class JsonBuildObject(models.Func):
    function = "jsonb_build_object"
    output_field = models.JSONField()


class SubqueryArray(models.Subquery):
    template = "ARRAY(%(subquery)s)"
    output_field = ArrayField(base_field=models.JSONField())


class CurveQuerySet(models.QuerySet["Curve"]):
    """Curve QuerySet."""

    def annotate_loads(self) -> "CurveQuerySet":
        """Annotate baseload by year."""
        baseload_qs = (
            Point.objects.filter(curve=models.OuterRef("pk"))
            .annotate(year=ExtractYear("start_at", tzinfo=TZ))
            .values("year")
            .alias(baseload=models.Avg("value"))
            .annotate(
                json=JsonBuildObject(
                    models.Value("year"),
                    models.F("year"),
                    models.Value("baseload"),
                    models.F("baseload"),
                )
            )
            .values("json")
        )

        return self.annotate(_baseloads=SubqueryArray(baseload_qs))


CurveManager = models.Manager.from_queryset(CurveQuerySet)


class Curve(models.Model):
    """Curve."""

    objects = CurveManager()


class Point(models.Model):
    """Curve point."""

    curve = models.ForeignKey(
        Curve,
        on_delete=models.CASCADE,
        related_name="points",
        related_query_name="point",
    )
    start_at = models.DateTimeField()
    value = models.FloatField()

I use the annotate_loads to compute yearly averages (with .values("year") acting as a GROUP BY) and dump the results in a json field.

With psycopg3, from what I've seen, the query params/values are not interpolated in the query anymore, but sent alongside the query to the server.

In my case, it looks like this:

SELECT
    "fail_curve"."id",
    ARRAY(
        SELECT
            jsonb_build_object(
                $1,
                EXTRACT(
                    YEAR
                    FROM
                        U0."start_at" AT TIME ZONE $2
                ),
                $3,
                AVG(U0."value")
            ) AS "json"
        FROM
            "fail_point" U0
        WHERE
            U0."curve_id" = ("fail_curve"."id")
        GROUP BY
            EXTRACT(
                YEAR
                FROM
                    U0."start_at" AT TIME ZONE $4
            )
    ) AS "_baseloads"
FROM
    "fail_curve"
WHERE
    "fail_curve"."id" = $5

But postgres doesn't like: django.db.utils.ProgrammingError: column "u0.start_at" must appear in the GROUP BY clause or be used in an aggregate function

because

EXTRACT(
    YEAR
    FROM
        U0."start_at" AT TIME ZONE $2
)

is different from

EXTRACT(
    YEAR
    FROM
        U0."start_at" AT TIME ZONE $4
)

I tested an updated query using the same placeholder ($4 -> $2) and it worked as expected:

PREPARE working (text, text, text, int) AS
    SELECT
        "fail_curve"."id",
        ARRAY(
            SELECT
                jsonb_build_object(
                    $1,
                    EXTRACT(
                        YEAR
                        FROM
                            U0."start_at" AT TIME ZONE $2
                    ),
                    $3,
                    AVG(U0."value")
                ) AS "json"
            FROM
                "fail_point" U0
            WHERE
                U0."curve_id" = ("fail_curve"."id")
            GROUP BY
                EXTRACT(
                    YEAR
                    FROM
                        U0."start_at" AT TIME ZONE $2
                )
        ) AS "_baseloads"
    FROM
        "fail_curve"
    WHERE
        "fail_curve"."id" = $4
    LIMIT
        1;
EXECUTE working('year', 'Europe/Paris', 'base', 1);

My understanding is as follow:

  • group by is an expression
  • this expression is also used in select
  • they have a different placeholder in the query generated by django/psycopg3
  • postgres rejects it

Let me know if this needs extra details.

Change History (22)

comment:1 by Mariusz Felisiak, 23 months ago

Cc: Florian Apolloner Simon Charette added
Severity: NormalRelease blocker
Triage Stage: UnreviewedAccepted

Thanks for the report. I was to able to reproduce the issue without SubqueryArray() and with built-in JSONObject:

    @override_settings(USE_TZ=True)
    def test_group_by_crash(self):
        tz = zoneinfo.ZoneInfo("Europe/Paris")
        qs = Point.objects.annotate(
            year=ExtractYear("start_at", tzinfo=tz),
        ).values("year").annotate(
            baseload=Avg("value"),
            json=JSONObject(
                year=F("year"),
                baseload=F("baseload"),
            )   
        ).values("json")
        list(qs)

Bug in 09ffc5c1212d4ced58b708cbbf3dfbfb77b782ca.

Last edited 23 months ago by Mariusz Felisiak (previous) (diff)

comment:2 by Simon Charette, 23 months ago

This will be hard to solve as group by aliasing cannot be used since year is stripped from the query. All I can think of right now is doing something along the lines of

SELECT
    "fail_curve"."id",
    ARRAY(
        SELECT "json" FROM (
        SELECT
            EXTRACT(
                YEAR
                FROM
                    U0."start_at" AT TIME ZONE $1
            ) AS "year",
            jsonb_build_object(
                $2,
                "year",
                $3,
                AVG(U0."value")
            ) AS "json"
        FROM
            "fail_point" U0
        WHERE
            U0."curve_id" = ("fail_curve"."id")
        GROUP BY "year"
        )
    ) AS "_baseloads"
FROM
    "fail_curve"
WHERE
    "fail_curve"."id" = $4

When a group alias is ultimately masked, similarly to how we've done it with window functions filtering and value masking.

---

Edit, another solution could be to have the psycopg3 query %s -> $n logic de-duplicate equal values and avoid using multiple placeholders for them. That would result in the query being written as the user reported here at the expense of equality checks (or possibly identity/hash checks which are cheaper) on each query creation which I believe would require adjustments to the current caching strategy.

Version 1, edited 23 months ago by Simon Charette (previous) (next) (diff)

comment:3 by Mariusz Felisiak, 23 months ago

This will be hard to solve as group by aliasing cannot be used since year is stripped from the query.

What about grouping by the entire function? as we do when it's not pushed into subquery, e.g.

Point.objects.annotate(
    year=ExtractYear("start_at", tzinfo=tz),
).values("year").annotate(
    baseload=Avg("value"),
).values("year", "baseload")
SELECT
    EXTRACT(YEAR FROM "ticket_34255_point"."start_at" AT TIME ZONE 'Europe/Paris') AS "year",
    AVG("ticket_34255_point"."value") AS "baseload"
FROM "ticket_34255_point"
GROUP BY 1

comment:4 by Simon Charette, 23 months ago

I believe the issue here is that the year annotation is elided from the query for a good reason, only one column must be present in the ARRAY/SubqueryArray construct.

In other words, the problem is not about grouping by alias or column index but grouping by an expression that is elided from the select clause and thus cannot be grouped by reference.

We need to perform some column masking for the query to ensure the query preserves it semantic

SELECT baseload FROM (
    SELECT
        EXTRACT(YEAR FROM "ticket_34255_point"."start_at" AT TIME ZONE 'Europe/Paris') AS "year",
        AVG("ticket_34255_point"."value") AS "baseload"
    FROM "ticket_34255_point"
    GROUP BY 1
) masked_subquery
Last edited 23 months ago by Simon Charette (previous) (diff)

comment:5 by Mariusz Felisiak, 23 months ago

Just want to add that it also crashes without ARRAY, e.g. for a queryset from comment:1:

SELECT
    JSONB_BUILD_OBJECT(
        (%s)::text,
        EXTRACT(YEAR FROM "ticket_34255_point"."start_at" AT TIME ZONE %s),
        (%s)::text,
        AVG("ticket_34255_point"."value")
    ) AS "json"
FROM "ticket_34255_point"
GROUP BY
    EXTRACT(YEAR FROM "ticket_34255_point"."start_at" AT TIME ZONE %s)
psycopg.errors.GroupingError: column "ticket_34255_point.start_at" must appear in the GROUP BY clause or be used in an aggregate function

comment:6 by Simon Charette, 23 months ago

Right, it doesn't need ARRAY to be present to reproduce the crash but the reason why we can't simply augment the select clause with year in order to GROUP BY 1 (or GROUP BY year) is that it would yield tuples with more columns than the user is asking for when doing an explicit values("json").

If we want to augment the SELECT clause when explicitly grouping by a parameterized expression that is not selected (ExtractYear("start_at", tzinfo=TZ) is this case) then we must make sure to de-augment the resulting query (mask it) to ensure the query preserves its semantic. The usage of ARRAY/SubqueryArray just happens to be a good example of why this is important but the same could be said of an __in lookup that also only expects a single column to be selected.

Last edited 23 months ago by Simon Charette (previous) (diff)

comment:7 by Mariusz Felisiak, 23 months ago

Edit, another solution could be to have the psycopg3 query %s -> $n logic de-duplicate equal values and avoid using multiple placeholders for them. That would result in the query being written as the user reported here at the expense of equality checks (or possibly identity/hash checks which are cheaper) on each query creation which I believe would require adjustments to the current caching strategy.

Yeah, the same as we do on Oracle. Unfortunately, it has side-effects :|

in reply to:  2 comment:8 by Florian Apolloner, 23 months ago

Replying to Simon Charette:

Edit, another solution could be to have the psycopg3 query %s -> $n logic de-duplicate equal values and avoid using multiple placeholders for them.

I would love to see this done for Django in general and not at the individual backend level. That said it is probably to late for 4.1

If we cannot fix the issue at hand nicely I think we could fall back to client side cursors and provide a setting (backend option) to switch back to server side cursors so people can experiment with them. Btw from the looks of it and issues we ran into developing psycopg3 support, this also means that the postgresql backend is basically the only backend using server side bindings or is it just stricter than every other backend?

comment:9 by Mariusz Felisiak, 23 months ago

Regression test for the Django test suite:

  • tests/aggregation/tests.py

    diff --git a/tests/aggregation/tests.py b/tests/aggregation/tests.py
    index 54e1e6f13a..bd3ea55647 100644
    a b from django.db.models.functions import (  
    3434    Cast,
    3535    Coalesce,
    3636    Greatest,
     37    Least,
    3738    Lower,
    3839    Now,
    3940    Pi,
    class AggregateTestCase(TestCase):  
    16141615        ).annotate(total=Count("*"))
    16151616        self.assertEqual(dict(has_long_books_breakdown), {True: 2, False: 3})
    16161617
     1618    def test_group_by_nested_expression_with_params(self):
     1619        books_qs = (
     1620            Book.objects.annotate(
     1621                greatest_pages=Greatest("pages", Value(600))
     1622            )
     1623            .values(
     1624                "greatest_pages",
     1625            )
     1626            .annotate(
     1627                min_pages=Min("pages"),
     1628                least=Least("min_pages", "greatest_pages"),
     1629            )
     1630            .values_list("least", flat=True)
     1631        )
     1632        self.assertCountEqual(books_qs, [300, 946, 1132])
     1633
    16171634    @skipUnlessDBFeature("supports_subqueries_in_group_by")
    16181635    def test_aggregation_subquery_annotation_related_field(self):
    16191636        publisher = Publisher.objects.create(name=self.a9.name, num_awards=2)
Last edited 23 months ago by Mariusz Felisiak (previous) (diff)

comment:10 by Simon Charette, 23 months ago

I would love to see this done for Django in general and not at the individual backend level. That said it is probably to late for 4.1

Agreed that it would be nice to get but there's no way we can squeeze it in for 4.2.

If we cannot fix the issue at hand nicely I think we could fall back to client side cursors and provide a setting (backend option) to switch back to server side cursors so people can experiment with them. Btw from the looks of it and issues we ran into developing psycopg3 support, this also means that the postgresql backend is basically the only backend using server side bindings or is it just stricter than every other backend?

I'm not sure what you mean here by server side cursors. Did you mean server side bindings?

I'm not sure what's causing it to be stricter but if pyscopg>=3 allows for psycopg2 style of parameter bindings we should default to it until we can figure out an adequate solution here. If that doesn't work I can commit time to solving this issue but I fear the solution won't be pretty.

Basically we'd have to adapt SQLCompiler.get_group_by to do the following

  1. Change its return signature so it's allowed to return members that should be added to the SELECT clause
  2. Adjust its compile logic so when a compiled expression is parametrized (len(params) >= 1) and is not part of selected_expr_indices then it's added to of the members that must be added to the SELECT and are returned by the method.
  3. Then adjust .as_sql to the SELECT *original_select (SELECT *original_select, *extra_group_by_select ... GROUP BY ... *extra_group_by_select_indices) when there are extra_group_by_select

The only interesting bit is that we might be able to reuse some logic that was added to implement window function filtering.

That's a lot of work to support server side bindings and that's only for the group by clause, there might be other cases lurking around.

in reply to:  10 comment:11 by Florian Apolloner, 23 months ago

Replying to Simon Charette:

I'm not sure what you mean here by server side cursors. Did you mean server side bindings?

Lol yes

I'm not sure what's causing it to be stricter but if pyscopg>=3 allows for psycopg2 style of parameter bindings we should default to it until we can figure out an adequate solution here. If that doesn't work I can commit time to solving this issue but I fear the solution won't be pretty.

ACK. As Mariusz pointed out we already do some shenanigans for Oracle -- so maybe Oracle does server-side bindings as well? Still I wouldn't surprised if Postgresql is stricter and complains about more things :) But with psycopg3 using server-side bindings we might have more motivation to fix this than for Oracle -- though I think everyone would benefit from named params if we manage to do this. I am just not sure if we can do this easily, we might have to support a mix of %s and %(name) in queries for a while or so :/

As for psycopg2 style parameter bindings: Yes psycopg3 supports this since 3.1 https://www.psycopg.org/psycopg3/docs/advanced/cursors.html#client-side-binding-cursors and those should bring it back to a more psycopg2 like behavior. This was to be my escape hatch all along ;) We still might need a fix or two in Django itself since I am not 100% sure that the geo adapters might work fully in "text"-mode which is required for client-side bindings.

comment:12 by Florian Apolloner, 23 months ago

If my tests didn't fail me, this allows us to switch to client-side binding cursors again while enabling the server-side binding via OPTIONS:

  • django/db/backends/postgresql/base.py

    diff --git a/django/db/backends/postgresql/base.py b/django/db/backends/postgresql/base.py
    index 17a3c7a377..ccf483cebf 100644
    a b class DatabaseWrapper(BaseDatabaseWrapper):  
    222222            conn_params = {**settings_dict["OPTIONS"]}
    223223
    224224        conn_params.pop("assume_role", None)
     225        conn_params.pop("server_side_binding", None)
    225226        conn_params.pop("isolation_level", None)
    226227        if settings_dict["USER"]:
    227228            conn_params["user"] = settings_dict["USER"]
    class DatabaseWrapper(BaseDatabaseWrapper):  
    268269        connection = self.Database.connect(**conn_params)
    269270        if set_isolation_level:
    270271            connection.isolation_level = self.isolation_level
    271         if not is_psycopg3:
     272        if is_psycopg3:
     273            connection.cursor_factory = (
     274                ServerBindingCursor if options.get("server_side_binding") else Cursor
     275            )
     276        else:
    272277            # Register dummy loads() to avoid a round trip from psycopg2's
    273278            # decode to json.dumps() to json.loads(), when using a custom
    274279            # decoder in JSONField.
    275280            psycopg2.extras.register_default_jsonb(
    276281                conn_or_curs=connection, loads=lambda x: x
    277282            )
    278         connection.cursor_factory = Cursor
     283            connection.cursor_factory = Cursor
    279284        return connection
    280285
    281286    def ensure_timezone(self):
    class DatabaseWrapper(BaseDatabaseWrapper):  
    436441
    437442if is_psycopg3:
    438443
    439     class Cursor(Database.Cursor):
    440         """
    441         A subclass of psycopg cursor implementing callproc.
    442         """
    443 
     444    class CursorMixin:
    444445        def callproc(self, name, args=None):
    445446            if not isinstance(name, sql.Identifier):
    446447                name = sql.Identifier(name)
    if is_psycopg3:  
    457458            self.execute(stmt)
    458459            return args
    459460
     461    class ServerBindingCursor(CursorMixin, Database.Cursor):
     462        pass
     463
     464    class Cursor(CursorMixin, Database.ClientCursor):
     465        pass
     466
    460467    class CursorDebugWrapper(BaseCursorDebugWrapper):
    461468        def copy(self, statement):
    462469            with self.debug_sql(statement):

comment:13 by Mariusz Felisiak, 23 months ago

Has patch: set
Owner: changed from nobody to Mariusz Felisiak
Status: newassigned

comment:14 by Mariusz Felisiak <felisiak.mariusz@…>, 23 months ago

In c8a7605:

Refs #34255 -- Bumped required psycopg version to 3.1.8.

comment:15 by Mariusz Felisiak <felisiak.mariusz@…>, 23 months ago

Resolution: fixed
Status: assignedclosed

In 0e2649fd:

Fixed #34255 -- Made PostgreSQL backend use client-side parameters binding with psycopg version 3.

Thanks Guillaume Andreu Sabater for the report.

Co-authored-by: Florian Apolloner <apollo13@…>

comment:16 by Tim Graham, 22 months ago

I suggest that CI have a build that uses server_side_binding so that regressions aren't introduced for that configuration. For example, in the current PostgreSQL build that defaults to client-side binding, ConcatPair.as_postgresql can be removed without any test failures.

test_group_by_nested_expression_with_params needs to be marked as expected failure when using server_side_binding. Is it appropriate to create a follow-up ticket if it might be possible to fix that in the future?

When I read the current "Server-side parameters binding" section in Django's docs, it left me wondering how to choose between the two. Perhaps there's not much more to be said than what's in the psycopg documentation. Though we typically don't document bugs, I wonder if Django's documentation should note the limitation for queries like the one reported here.

comment:17 by Simon Charette, 22 months ago

I agree with Tim, having a bit more documentation on the known limitations/bugs of server-side binding (we should probably create an issue to track the group by one or maybe use #34262 with a dual purpose) and its gotchas which have ties to the usage of prepared statements (#20516).

FWIW I was surprised to learn that psycopg>=3 also prepares statement automatically on the fifth execution which might result in unexpected behaviour.

I'm not sure if this is the case when client-side binding is used though, I assume not as that requires the usage of server-side bindings? confirmed from the docs

Client-side cursors don’t support binary parameters and return values and don’t support prepared statements.

All that to say that we should probably have a section with a summary of what to expect when switching to psycopg>=3 and enabling server-side binding and maybe advertise the support as experimental for this first release?

Last edited 22 months ago by Simon Charette (previous) (diff)

comment:18 by Florian Apolloner, 22 months ago

Yes more docs will certainly not hurt :) @tim Can you submit PRs for the expected failures etc?

FWIW I was surprised to learn that psycopg>=3 also ​prepares statement automatically on the fifth execution which might result in unexpected behaviour.

Yes, but Django disables that, we probably should expose that via options as well ;)

comment:19 by Mariusz Felisiak, 22 months ago

I added pull-requests-pg-server-side-binding builds.

Last edited 22 months ago by Mariusz Felisiak (previous) (diff)

in reply to:  16 comment:20 by Mariusz Felisiak, 22 months ago

Replying to Tim Graham:

I suggest that CI have a build that uses server_side_binding so that regressions aren't introduced for that configuration. For example, in the current PostgreSQL build that defaults to client-side binding, ConcatPair.as_postgresql can be removed without any test failures.

test_group_by_nested_expression_with_params needs to be marked as expected failure when using server_side_binding.

Done, see PR.

comment:21 by GitHub <noreply@…>, 22 months ago

In 82dad11b:

Refs #34255 -- Skipped test_group_by_nested_expression_with_params test on PostgreSQL when server-side binding cursors are used.

Thanks Tim Graham for the review.

comment:22 by Mariusz Felisiak <felisiak.mariusz@…>, 22 months ago

In d42e47f:

[4.2.x] Refs #34255 -- Skipped test_group_by_nested_expression_with_params test on PostgreSQL when server-side binding cursors are used.

Thanks Tim Graham for the review.
Backport of 82dad11bfe45f96f15e2330f58f62919cab9f14c from main

Note: See TracTickets for help on using tickets.
Back to Top