Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#31097 closed Bug (fixed)

StringAgg And ArrayAgg with filtering in subquery generates invalid string_agg() SQL function call

Reported by: Laurent Tramoy Owned by: David Wobrock
Component: contrib.postgres Version: 3.0
Severity: Normal Keywords:
Cc: Simon Charette Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

This issue is similar to issue #30315 but using the filter keyword instead of ordering, so I'll reuse the same structure. I tested it on Django 2.2.8

Consider the following models (in the people app):

from django.db import models
from django.contrib.postgres.fields import ArrayField


class Person(models.Model):
    """Person model."""

    first_name = models.TextField()
    last_name = models.TextField()
    country = models.TextField(null=True, blank=True)


class Book(models.Model):
    """Book model."""
    category = ArrayField(models.CharField(max_length=32), null=True, default=list)
    people = models.ManyToManyField(Person)

with the following objects:

p1 = Person.objects.create(first_name="John", last_name="Doe")
p2 = Person.objects.create(first_name="Jane", last_name="Doe")

b1 = Book.objects.create()
b1.people.add(p1)

b2 = Book.objects.create()
b2.people.add(p1, p2)

b3 = Book.objects.create()

This fails:

from django.contrib.postgres.aggregates import StringAgg
from django.db.models import Subquery, OuterRef, Q

from people.models import  Book

subquery = (
    Book.objects.annotate(
        _annotated_value=StringAgg(
            "people__first_name", ",", filter=Q(people__first_name__startswith="Ja")
        ),
    )
    .filter(pk=OuterRef("pk"),)
    .exclude(_annotated_value="")
    .values("id")
)
Book.objects.filter(id__in=Subquery(subquery))

The generated SQL is

SELECT
    "people_book"."id",
    "people_book"."category"
FROM
    "people_book"
WHERE
    "people_book"."id" IN (
        SELECT
            U0."id"
        FROM
            "people_book" U0
        LEFT OUTER JOIN "people_book_people" U1 ON (U0."id" = U1."book_id")
    LEFT OUTER JOIN "people_person" U2 ON (U1."person_id" = U2."id")
WHERE
    U0."id" = ("people_book"."id")
GROUP BY
    U0."id"
HAVING
    NOT (STRING_AGG(, ',') FILTER (WHERE U2."first_name") =))

as we can see, the `STRING_AGG argument is wrong.

The same query without the filter works fine:

subquery = (
    Book.objects.annotate(
        _annotated_value=StringAgg(
            "people__first_name", ","
        ),
    )
    .filter(pk=OuterRef("pk"),)
    .exclude(_annotated_value="")
    .values("id")
)
Book.objects.filter(id__in=Subquery(subquery))

SQL query:

SELECT
    "people_book"."id",
    "people_book"."category"
FROM
    "people_book"
WHERE
    "people_book"."id" IN (
        SELECT
            U0."id"
        FROM
            "people_book" U0
        LEFT OUTER JOIN "people_book_people" U1 ON (U0. "id" = U1."book_id")
    LEFT OUTER JOIN "people_person" U2 ON (U1."person_id" = U2."id")
WHERE
    U0."id" = ("people_book"."id")
GROUP BY
    U0."id"
HAVING
    NOT (STRING_AGG(U2."first_name", ',') =))

as well as the same query without using Subquery:

query = (
    Book.objects.annotate(
        _annotated_value=StringAgg(
            "people__first_name", ",", filter=Q(people__first_name__startswith="Ja")
        ),
    )
    .exclude(_annotated_value="")
)


SQL query:

SELECT
    "people_book"."id",
    "people_book"."category",
    STRING_AGG("people_person"."first_name", ',') FILTER (WHERE "people_person"."first_name"::text LIKE Ja %) AS "_annotated_value"
FROM
    "people_book"
    LEFT OUTER JOIN "people_book_people" ON ("people_book"."id" = "people_book_people"."book_id")
    LEFT OUTER JOIN "people_person" ON ("people_book_people"."person_id" = "people_person"."id")
GROUP BY
    "people_book"."id"
HAVING
    NOT (STRING_AGG("people_person"."first_name", ',') FILTER (WHERE ("people_person"."first_name"::text LIKE Ja %)) =)

Just to make sure I wasn't using an old version, I tried the query from #30315, which works fine.

NB: I originally noticed that bug using ArrayAgg instead of StringAgg, but I encountered another bug using ArrayAgg (or at least what I think is a bug) while writing the example code, I'll report it later if needed

Change History (29)

comment:1 Changed 4 years ago by Simon Charette

The fact the grouping conditions are surfaced to the outer query makes me believe it could be related to another issue that was addressed in 3.0. Could you try reproducing with Django 3.0 as well.

comment:2 Changed 4 years ago by Baptiste Mispelon

Triage Stage: UnreviewedAccepted
Version: 2.23.0

Reproduced on current master (ff00a053478fee06bdfb4206c6d4e079e98640ff) using the code in the ticket.

Here's the prettified query reported by --debug-sql after the testcase failure:

Line 
1SELECT
2  "aaaaaaticket31097_book"."id",
3  "aaaaaaticket31097_book"."category"
4FROM
5  "aaaaaaticket31097_book"
6WHERE
7  "aaaaaaticket31097_book"."id" IN (
8    SELECT
9      U0."id"
10    FROM
11      "aaaaaaticket31097_book" U0
12      LEFT OUTER JOIN "aaaaaaticket31097_book_people" U1 ON (U0."id" = U1."book_id")
13      LEFT OUTER JOIN "aaaaaaticket31097_person" U2 ON (U1."person_id" = U2."id")
14    WHERE
15      U0."id" = "aaaaaaticket31097_book"."id"
16    GROUP BY
17      U0."id"
18    HAVING
19      NOT (
20        STRING_AGG(, ',') FILTER (
21          WHERE
22            U2."first_name"
23        ) = ''
24      )
25  )

comment:3 Changed 4 years ago by David Wobrock

Owner: set to David Wobrock
Status: newassigned

Hi, I'm claiming the ticket if nobody bothers.
I already wrote a testcase that reproduced the bug and started to pin down the source of the issue.

The first (very) technical info I found:
It seems to be linked to the logic of relabeling aliases of the the filter clause in django.db.models.sql.query.Query.change_aliases, in the self.where.relabel_aliases(change_map) logic.
The first node itself is a WhereNode containing (AND: (NOT (AND: <django.db.models.lookups.Exact object at 0x7f9ea9bfc9b0>)))
When going down the filter tree, we first hit again a WhereNode with the negation: (NOT (AND: <django.db.models.lookups.Exact object at 0x7f9ea9bfc9b0>)) for our exclude clause.
Finally coming to the Exact object, we relabel the left hand-side of the condition, being the StringAgg.
This calls the django.db.models.expressions.BaseExpression.relabeled_clone where we depend on the fields found by self.get_source_expressions()

Either, those source expressions are incomplete because of the way the OrderableAggMixin plays with the self.get_source_expressions() (not calling the super), or the change_map which is passed down is missing some potential aliases.

I'll try to keep investigating and propose a patch in the coming weeks :)

comment:4 Changed 4 years ago by Simon Charette

Cc: Simon Charette added

Did you manage to reproduce against the latest master as well? It looks like it might have been fixed by commits related to #31094.

If it's still the case you seem to be heading in the right direction; there's likely an issue with OrderableAggMixin and the use of filter given how both require some special source expressions handling.

comment:5 Changed 4 years ago by David Wobrock

Yes, I do reproduce on the latest master. I'll keep digging in that direction.

comment:6 Changed 4 years ago by David Wobrock

Has patch: set

comment:7 Changed 4 years ago by Simon Charette

Patch needs improvement: set

Left comments for improvement on the tests but the corrective seems good to me.

comment:8 Changed 4 years ago by Simon Charette

Patch needs improvement: unset

Adjusted patch seems ready for a final review.

comment:9 Changed 4 years ago by Mariusz Felisiak <felisiak.mariusz@…>

In 7d44aeb3:

Refs #31097 -- Added tests for filter in ArrayAgg and StringAgg.

comment:10 Changed 4 years ago by Mariusz Felisiak <felisiak.mariusz@…>

Resolution: fixed
Status: assignedclosed

In 2f565f84:

Fixed #31097 -- Fixed crash of ArrayAgg and StringAgg with filter when used in Subquery.

comment:11 in reply to:  10 ; Changed 4 years ago by John Fieldman

Replying to Mariusz Felisiak <felisiak.mariusz@…>:

In 2f565f84:

Fixed #31097 -- Fixed crash of ArrayAgg and StringAgg with filter when used in Subquery.

Can this fix be backported to 3.0.x and 2.2.x also, please?

Last edited 4 years ago by John Fieldman (previous) (diff)

comment:12 in reply to:  11 ; Changed 4 years ago by Mariusz Felisiak

Replying to John Fieldman:

Can this fix be backported to 3.0.x and 2.2.x also, please?

This patch doesn't qualify for a backport.

comment:13 in reply to:  12 Changed 4 years ago by John Fieldman

Replying to felixxm:

This patch doesn't qualify for a backport.

The issue is happening for 3.0.x and 2.2.x also, should it be solved differently for these versions, or won't be solved at all?

comment:14 Changed 4 years ago by Simon Charette

John, based on the investigation that lead to this patch the issue has been present since the feature was introduced.

Per our backporting policy this means it doesn't qualify for a backport to 3.0.x or 2.2.x anymore.

See https://docs.djangoproject.com/en/3.0/internals/release-process/ for more details.

comment:15 Changed 4 years ago by Reupen Shah

It does seem that there was a regression in Django 2.2 to me.

The following query (using the models from above) works in Django 2.1.15 and master, but not 2.2.9 and 3.0.x:

from django.contrib.postgres.aggregates import StringAgg
from django.db.models import CharField, OuterRef, Q, Subquery

subquery = Book.objects.annotate(
    _annotated_value=StringAgg(
        'people__first_name', ', ',
        filter=Q(people__first_name='Jane'),
        output_field=CharField(),
    ),
).filter(
    pk=OuterRef('pk'),
).values(
    '_annotated_value',
)

queryset = Book.objects.annotate(names=Subquery(subquery))

print([book.names for book in queryset])

(I can file a separate bug for that if you want, though.)

Last edited 4 years ago by Reupen Shah (previous) (diff)

comment:16 Changed 4 years ago by Simon Charette

Reupen, even if this was a regression in 2.2 this release series is only receiving security and data loss backport at this point.

Can you confirm this is the case Mariusz?

comment:17 Changed 4 years ago by Mariusz Felisiak

Yes it's a regression in 96199e562dcc409ab4bdc2b2146fa7cf73c7c5fe (Django 2.2) so when we added support for ordering we also introduced a regression in filter. Nevertheless Django 2.2 is already in extended support so this doesn't qualify for a backport.

comment:18 Changed 4 years ago by Claude Paroz

But maybe a backport to 3.0.x?

comment:19 Changed 4 years ago by Mariusz Felisiak

Normally we backport only regression from the latest version (in the mainstream support), e.g. we'll not backport regressions introduced in Django < 3.0 to the Django 3.0, unless it's "critical".

comment:20 Changed 4 years ago by Claude Paroz

Our documentation is not clear about that. As for me, a regression is a regression and as such should be backported to supported versions.

comment:21 Changed 4 years ago by Mariusz Felisiak

Do you think that we should backport fixes for regressions created few versions ago? and treat them as release blockers. IMO backporting regressions from the latest version is a reasonable policy, i.e. if no one reported a regression in 9 months then it's not really crucial.

Last edited 4 years ago by Mariusz Felisiak (previous) (diff)

comment:22 Changed 4 years ago by Carlton Gibson

Tim beat me with a stick plenty of times when I was starting :) — Data loss bugs and security issues to all supported versions. Regressions and bugs in new features to current main version. I did think we should be more liberal but backporting itself isn't risk free, so the more I see it the more I think this is correct.

A change would be a change in policy though so we'd need to go via the discussion group and technical board I think.

comment:23 Changed 4 years ago by Claude Paroz

Carlton, I'm not sure about your comment. Does "Regressions and bugs in new features to current main version." mean you also are in favor of a 3.0.x backport?
Mariusz is suggesting treating differently the case where a regression is found after a release cycle passed. I'm not in favor of this distinction.

comment:24 Changed 4 years ago by Carlton Gibson

Hi Claude. Good. I think "Regressions" here and in the supported versions docs isn't quite clear: it should say "Regressions introduced since the last release" — that's how it's been applied that I've seen, and I think it's also the historical intention. (Guess but, I suspect widening that might lead us to the land of several hundred release blockers.) I shall propose a small edit to the docs.

Having discussed with Mariusz, we're happy to backport this fix to 3.0. It's similar enough to #30315, which is in 3.0, and is a quite central use-case.

Last edited 4 years ago by Carlton Gibson (previous) (diff)

comment:25 Changed 4 years ago by Claude Paroz

Carlton,

Guess but, I suspect widening that might lead us to the land of several hundred release blockers

I don't think so. Thanks to its extensive test suite and its wide usage, I think regressions from older versions are very rare in Django.

Good if this is backported to 3.0.x.

comment:26 Changed 4 years ago by Carlton Gibson <carlton@…>

In 927c903f:

Refs #31097 -- Added release notes for 2f565f84aca136d9cc4e4d061f3196ddf9358ab8.

.

comment:27 Changed 4 years ago by Carlton Gibson <carlton.gibson@…>

In 0e6cf439:

[3.0.x] Fixed #31097 -- Fixed crash of ArrayAgg and StringAgg with filter when used in Subquery.

Backport of 2f565f84aca136d9cc4e4d061f3196ddf9358ab8 from master

comment:28 Changed 4 years ago by Carlton Gibson <carlton.gibson@…>

In 6aac9f6b:

[3.0.x] Refs #31097 -- Added release notes for 2f565f84aca136d9cc4e4d061f3196ddf9358ab8.

.

Backport of 927c903f3cd25c817c21738328b53991c035b415 from master

comment:29 Changed 4 years ago by Mariusz Felisiak <felisiak.mariusz@…>

In 5b6778b8:

[3.0.x] Refs #31097 -- Added django.db.models.Q import to contrib.postgres aggregates tests.

Note: See TracTickets for help on using tickets.
Back to Top