Opened 10 months ago

Closed 8 months ago

Last modified 8 months ago

#31150 closed Bug (fixed)

Subquery annotations are omitted in group by query section if multiple annotation are declared

Reported by: Johannes Hoppe Owned by: Mariusz Felisiak
Component: Database layer (models, ORM) Version: 3.0
Severity: Release blocker Keywords:
Cc: Simon Charette Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Johannes Hoppe)

Sadly there is more regression in Django 3.0.2 even after #31094.

Background: It's the same query as #31094. I tried upgrading to Django 3.0.2 and now I get duplicate results. Even tho they query should be distinct. Where on 2.2 the queryset yields 490 results, it's 519 on 3.0.

A quick diff on the queries still reveals a different grouped by section:

This is the new query on 3.0.2:

SELECT DISTINCT "camps_offer"."id",
                "camps_offer"."title",
                "camps_offer"."slug",
                "camps_offer"."is_active",
                "camps_offer"."modified",
                "camps_offer"."created",
                "camps_offer"."provider_id",
                "camps_offer"."activity_type",
                "camps_offer"."description",
                "camps_offer"."highlights",
                "camps_offer"."important_information",
                "camps_offer"."min_age",
                "camps_offer"."max_age",
                "camps_offer"."food",
                "camps_offer"."video",
                "camps_offer"."accommodation",
                "camps_offer"."accommodation_type",
                "camps_offer"."room_type",
                "camps_offer"."room_size_min",
                "camps_offer"."room_size_max",
                "camps_offer"."external_url",
                "camps_offer"."application_form",
                "camps_offer"."caseload",
                "camps_offer"."field_trips",
                MIN(T4."retail_price") AS "min_retail_price",
                (SELECT U0."id"
                 FROM "camps_servicepackage" U0
                          INNER JOIN "camps_region" U2 ON (U0."region_id" = U2."id")
                 WHERE (U0."company_id" = 1 AND U0."option" = "camps_offer"."activity_type" AND
                        ST_Contains(U2."locations", T4."position"))
                 LIMIT 1)              AS "in_package",
                "camps_provider"."id",
                "camps_provider"."title",
                "camps_provider"."slug",
                "camps_provider"."is_active",
                "camps_provider"."modified",
                "camps_provider"."created",
                "camps_provider"."logo",
                "camps_provider"."description",
                "camps_provider"."video",
                "camps_provider"."external_url",
                "camps_provider"."terms",
                "camps_provider"."cancellation_policy",
                "camps_provider"."privacy_policy",
                "camps_provider"."application_form"
FROM "camps_offer"
         LEFT OUTER JOIN "camps_bookingoption" ON ("camps_offer"."id" = "camps_bookingoption"."offer_id")
         INNER JOIN "camps_provider" ON ("camps_offer"."provider_id" = "camps_provider"."id")
         INNER JOIN "camps_bookingoption" T4 ON ("camps_offer"."id" = T4."offer_id")
WHERE ("camps_offer"."is_active" = True AND "camps_provider"."is_active" = True AND
       T4."end" >= STATEMENT_TIMESTAMP() AND T4."is_active" = True AND "camps_offer"."max_age" >= 5 AND
       "camps_offer"."min_age" <= 13 AND (SELECT U0."id"
                                          FROM "camps_servicepackage" U0
                                                   INNER JOIN "camps_region" U2 ON (U0."region_id" = U2."id")
                                          WHERE (U0."company_id" = 1 AND U0."option" = "camps_offer"."activity_type" AND
                                                 ST_Contains(U2."locations", T4."position"))
                                          LIMIT 1) IS NOT NULL)
GROUP BY "camps_offer"."id", T4."position", "camps_provider"."id"
ORDER BY "camps_offer"."created" ASC

And what it was (and should be) on 2.2.9:

SELECT DISTINCT "camps_offer"."id",
                "camps_offer"."title",
                "camps_offer"."slug",
                "camps_offer"."is_active",
                "camps_offer"."modified",
                "camps_offer"."created",
                "camps_offer"."provider_id",
                "camps_offer"."activity_type",
                "camps_offer"."description",
                "camps_offer"."highlights",
                "camps_offer"."important_information",
                "camps_offer"."min_age",
                "camps_offer"."max_age",
                "camps_offer"."food",
                "camps_offer"."video",
                "camps_offer"."accommodation",
                "camps_offer"."accommodation_type",
                "camps_offer"."room_type",
                "camps_offer"."room_size_min",
                "camps_offer"."room_size_max",
                "camps_offer"."external_url",
                "camps_offer"."application_form",
                "camps_offer"."caseload",
                "camps_offer"."field_trips",
                MIN(T4."retail_price") AS "min_retail_price",
                (SELECT U0."id"
                 FROM "camps_servicepackage" U0
                          INNER JOIN "camps_region" U2 ON (U0."region_id" = U2."id")
                 WHERE (U0."company_id" = 1 AND U0."option" = ("camps_offer"."activity_type") AND
                        ST_Contains(U2."locations", (T4."position")))
                 LIMIT 1)              AS "in_package",
                "camps_provider"."id",
                "camps_provider"."title",
                "camps_provider"."slug",
                "camps_provider"."is_active",
                "camps_provider"."modified",
                "camps_provider"."created",
                "camps_provider"."logo",
                "camps_provider"."description",
                "camps_provider"."video",
                "camps_provider"."external_url",
                "camps_provider"."terms",
                "camps_provider"."cancellation_policy",
                "camps_provider"."privacy_policy",
                "camps_provider"."application_form"
FROM "camps_offer"
         LEFT OUTER JOIN "camps_bookingoption" ON ("camps_offer"."id" = "camps_bookingoption"."offer_id")
         INNER JOIN "camps_provider" ON ("camps_offer"."provider_id" = "camps_provider"."id")
         INNER JOIN "camps_bookingoption" T4 ON ("camps_offer"."id" = T4."offer_id")
WHERE ("camps_offer"."is_active" = True AND "camps_provider"."is_active" = True AND
       T4."end" >= (STATEMENT_TIMESTAMP()) AND T4."is_active" = True AND (SELECT U0."id"
                                                                          FROM "camps_servicepackage" U0
                                                                                   INNER JOIN "camps_region" U2 ON (U0."region_id" = U2."id")
                                                                          WHERE (U0."company_id" = 1 AND
                                                                                 U0."option" = ("camps_offer"."activity_type") AND
                                                                                 ST_Contains(U2."locations", (T4."position")))
                                                                          LIMIT 1) IS NOT NULL)
GROUP BY "camps_offer"."id",
         (SELECT U0."id"
          FROM "camps_servicepackage" U0
                   INNER JOIN "camps_region" U2 ON (U0."region_id" = U2."id")
          WHERE (U0."company_id" = 1 AND U0."option" = ("camps_offer"."activity_type") AND
                 ST_Contains(U2."locations", (T4."position")))
          LIMIT 1), "camps_provider"."id"
ORDER BY "camps_offer"."created" ASC

Change History (17)

comment:1 Changed 10 months ago by Johannes Hoppe

Description: modified (diff)

comment:2 Changed 10 months ago by Mariusz Felisiak

Johannes, I need to repeat my gentle request for a queryset (see comment:2 and comment:6 ) Can you provide a queryset? It's really hard to restore the original queryset from a raw SQL.

comment:3 Changed 9 months ago by Mariusz Felisiak

Resolution: needsinfo
Status: newclosed

I really try but without a queryset I was not able to reproduce this issue.

comment:4 Changed 8 months ago by Johannes Hoppe

Resolution: needsinfo
Status: closednew

@felixxm, it seems that Subquery annotation are omitted from the grouping section in 3.0. I will try to create a test case today.

If you add the following test to 2.2.* anywhere in tests.aggregation it will pass, in 3.0 it will fail, because the subquery is missing from the grouping bit:

    def test_filtered_aggregate_ref_subquery_annotation_e_31150(self):
        from django.db.models import OuterRef, Subquery
        aggs = Author.objects.annotate(
            earliest_book_year=Subquery(
                Book.objects.filter(
                    contact__pk=OuterRef('pk'),
                ).order_by('pubdate').values('pubdate__year')[:1]
            ),
        ).annotate(max_id=Max('id'))
        print(aggs.query)
        self.assertIn(str(aggs.query), """
        SELECT "aggregation_author"."id", "aggregation_author"."name", "aggregation_author"."age", (SELECT django_date_extract('year', U0."pubdate") FROM "aggregation_book" U0 WHERE U0."contact_id" = ("aggregation_author"."id") ORDER BY U0."pubdate" ASC  LIMIT 1) AS "earliest_book_year", MAX("aggregation_author"."id") AS "max_id" FROM "aggregation_author" GROUP BY "aggregation_author"."id", "aggregation_author"."name", "aggregation_author"."age", (SELECT django_date_extract('year', U0."pubdate") FROM "aggregation_book" U0 WHERE U0."contact_id" = ("aggregation_author"."id") ORDER BY U0."pubdate" ASC  LIMIT 1)
        """)
Last edited 8 months ago by Carlton Gibson (previous) (diff)

comment:5 Changed 8 months ago by Johannes Hoppe

Summary: SELECT DISTINCT is not not so distictSubquery annotations are omitted in group by query section if multiple annotation are declared

comment:6 Changed 8 months ago by Carlton Gibson

Triage Stage: UnreviewedAccepted

OK, thank you for the test case Joe.
I added the import line so it pops straight into ,aggregation.tests.AggregateTestCase and then, yes, there's a change of behaviour between 2.2 and 3.0.

comment:7 Changed 8 months ago by Simon Charette

The generated query effectively changed but the returned resultset should be the same since the subquery is a function of the outer query's primary key and the query is grouped by the outer query primary key.

I don't think asserting against the generated query string is a proper test case; the generated SQL will change between Django versions but the returned result set should be the same.

comment:8 Changed 8 months ago by Carlton Gibson

Yes. I didn’t take the test case as final. Just illustrating the issue. 😀

If it’s equivalent then there’s no issue, but I’m assuming Joe is seeing what he thinks is a substantive regression?

comment:9 Changed 8 months ago by Johannes Hoppe

Replying to Simon Charette:

The generated query effectively changed but the returned resultset should be the same since the subquery is a function of the outer query's primary key and the query is grouped by the outer query primary key.

I came from a wrong result set and deduced the error from here on out. It took me a while to figure it out to, but sadly grouping by the outer reference is not the same as grouping by the query. Here an example:

with t_w_douplicates as (
    select abs(n) as pk, n as val from generate_series(-2, 2, 1) n
)

select pk, (
    select n
    from generate_series(2, 2) n
    where n = val
)
from t_w_douplicates
where (
    select n from generate_series(2, 2) n where n = val
    ) is null
GROUP BY pk, (
    select n
    from generate_series(2, 2) n
    where n = val
);

Which returns

(0, null)
(1, null)
(2, null)

And just using the outer ref:

with t_w_douplicates as (
    select abs(n) as pk, n as val from generate_series(-2, 2, 1) n
)

select pk, (
    select n
    from generate_series(2, 2) n
    where n = val
)
from t_w_douplicates
where (
    select n from generate_series(2, 2) n where n = val
    ) is null
GROUP BY pk, val;

Which returns 4 results:

(2, null)
(1, null)
(0, null)
(1, null)

I don't think asserting against the generated query string is a proper test case; the generated SQL will change between Django versions but the returned result set should be the same.

Certainly, haha, that was just to illustrate how to get from the ORM to the SQL query in the description. The SQL sample now, shows how they produce different result sets.

comment:10 Changed 8 months ago by Simon Charette

Thanks for the extra details, I think I have a better picture of what's happening here.

In the cases where an outer query spawns a multi-valued relationship and the subquery references one of these multi-valued relationship we must keep grouping by the subquery.

In your original reports the INNER JOIN "camps_bookingoption" T4 joins spawns multiple rows for the same camps.Offer and then your subquery filter against (T4."position").

In ORM terms that means we must keep returning self in Subquery.get_group_by_cols when any of our OuterRef include a __ which could point to a multi-valued relationship.

We could go a bit further and only disable the optimization if any of the outer-ref in the __ chain is multi-valued by relying on getattr(rel, 'many_to_many', True) introspection but in all cases that means adding a way to taint Col instances with whether or not they are/could be multi-valued so get_group_by_cols can return self if any of get_external_cols is multi-valued

I could see this tainting as being useful to warn about #10060 in the future since the logic could be based on this column or alias tainting.

I'm happy to submit a patch or discuss any alternatives but it feels like the above solution solves the reported problem while maintaining the optimization for most of the cases.

Last edited 8 months ago by Simon Charette (previous) (diff)

comment:11 Changed 8 months ago by Simon Charette

I give a quick shot at the above solution and came with this rough patch

https://github.com/django/django/compare/master...charettes:ticket-31150

Johannes, could you give it a shot and report whether or not it addresses your problem? Do you confirm your original queryset was using an OuterRef with a reference to an outer-query multi-valued relationship?

comment:12 Changed 8 months ago by Johannes Hoppe

I tested Simon's patch applied to 3.0.3 and it now works as expected. So yes, I'd say go ahead!

comment:13 Changed 8 months ago by Mariusz Felisiak

Owner: changed from nobody to Mariusz Felisiak
Status: newassigned

comment:14 Changed 8 months ago by Mariusz Felisiak

Has patch: set

comment:15 Changed 8 months ago by Carlton Gibson

Triage Stage: AcceptedReady for checkin

comment:16 Changed 8 months ago by Mariusz Felisiak <felisiak.mariusz@…>

Resolution: fixed
Status: assignedclosed

In 7b8fa16:

Fixed #31150 -- Included subqueries that reference related fields in GROUP BY clauses.

Thanks Johannes Hoppe for the report.

Regression in fb3f034f1c63160c0ff13c609acd01c18be12f80.

Co-authored-by: Simon Charette <charette.s@…>

comment:17 Changed 8 months ago by Mariusz Felisiak <felisiak.mariusz@…>

In c5cfaad2:

[3.0.x] Fixed #31150 -- Included subqueries that reference related fields in GROUP BY clauses.

Thanks Johannes Hoppe for the report.

Regression in fb3f034f1c63160c0ff13c609acd01c18be12f80.

Co-authored-by: Simon Charette <charette.s@…>

Backport of 7b8fa1653fde578ab3a496d9974ab1d4261b8b26 from master

Note: See TracTickets for help on using tickets.
Back to Top