Opened 5 years ago

Closed 5 years ago

Last modified 2 years ago

#31150 closed Bug (fixed)

Subquery annotations are omitted in group by query section if multiple annotation are declared

Reported by: Johannes Maron Owned by: Mariusz Felisiak
Component: Database layer (models, ORM) Version: 3.0
Severity: Release blocker Keywords:
Cc: Simon Charette Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Johannes Maron)

Sadly there is more regression in Django 3.0.2 even after #31094.

Background: It's the same query as #31094. I tried upgrading to Django 3.0.2 and now I get duplicate results. Even tho they query should be distinct. Where on 2.2 the queryset yields 490 results, it's 519 on 3.0.

A quick diff on the queries still reveals a different grouped by section:

This is the new query on 3.0.2:

SELECT DISTINCT "camps_offer"."id",
                "camps_offer"."title",
                "camps_offer"."slug",
                "camps_offer"."is_active",
                "camps_offer"."modified",
                "camps_offer"."created",
                "camps_offer"."provider_id",
                "camps_offer"."activity_type",
                "camps_offer"."description",
                "camps_offer"."highlights",
                "camps_offer"."important_information",
                "camps_offer"."min_age",
                "camps_offer"."max_age",
                "camps_offer"."food",
                "camps_offer"."video",
                "camps_offer"."accommodation",
                "camps_offer"."accommodation_type",
                "camps_offer"."room_type",
                "camps_offer"."room_size_min",
                "camps_offer"."room_size_max",
                "camps_offer"."external_url",
                "camps_offer"."application_form",
                "camps_offer"."caseload",
                "camps_offer"."field_trips",
                MIN(T4."retail_price") AS "min_retail_price",
                (SELECT U0."id"
                 FROM "camps_servicepackage" U0
                          INNER JOIN "camps_region" U2 ON (U0."region_id" = U2."id")
                 WHERE (U0."company_id" = 1 AND U0."option" = "camps_offer"."activity_type" AND
                        ST_Contains(U2."locations", T4."position"))
                 LIMIT 1)              AS "in_package",
                "camps_provider"."id",
                "camps_provider"."title",
                "camps_provider"."slug",
                "camps_provider"."is_active",
                "camps_provider"."modified",
                "camps_provider"."created",
                "camps_provider"."logo",
                "camps_provider"."description",
                "camps_provider"."video",
                "camps_provider"."external_url",
                "camps_provider"."terms",
                "camps_provider"."cancellation_policy",
                "camps_provider"."privacy_policy",
                "camps_provider"."application_form"
FROM "camps_offer"
         LEFT OUTER JOIN "camps_bookingoption" ON ("camps_offer"."id" = "camps_bookingoption"."offer_id")
         INNER JOIN "camps_provider" ON ("camps_offer"."provider_id" = "camps_provider"."id")
         INNER JOIN "camps_bookingoption" T4 ON ("camps_offer"."id" = T4."offer_id")
WHERE ("camps_offer"."is_active" = True AND "camps_provider"."is_active" = True AND
       T4."end" >= STATEMENT_TIMESTAMP() AND T4."is_active" = True AND "camps_offer"."max_age" >= 5 AND
       "camps_offer"."min_age" <= 13 AND (SELECT U0."id"
                                          FROM "camps_servicepackage" U0
                                                   INNER JOIN "camps_region" U2 ON (U0."region_id" = U2."id")
                                          WHERE (U0."company_id" = 1 AND U0."option" = "camps_offer"."activity_type" AND
                                                 ST_Contains(U2."locations", T4."position"))
                                          LIMIT 1) IS NOT NULL)
GROUP BY "camps_offer"."id", T4."position", "camps_provider"."id"
ORDER BY "camps_offer"."created" ASC

And what it was (and should be) on 2.2.9:

SELECT DISTINCT "camps_offer"."id",
                "camps_offer"."title",
                "camps_offer"."slug",
                "camps_offer"."is_active",
                "camps_offer"."modified",
                "camps_offer"."created",
                "camps_offer"."provider_id",
                "camps_offer"."activity_type",
                "camps_offer"."description",
                "camps_offer"."highlights",
                "camps_offer"."important_information",
                "camps_offer"."min_age",
                "camps_offer"."max_age",
                "camps_offer"."food",
                "camps_offer"."video",
                "camps_offer"."accommodation",
                "camps_offer"."accommodation_type",
                "camps_offer"."room_type",
                "camps_offer"."room_size_min",
                "camps_offer"."room_size_max",
                "camps_offer"."external_url",
                "camps_offer"."application_form",
                "camps_offer"."caseload",
                "camps_offer"."field_trips",
                MIN(T4."retail_price") AS "min_retail_price",
                (SELECT U0."id"
                 FROM "camps_servicepackage" U0
                          INNER JOIN "camps_region" U2 ON (U0."region_id" = U2."id")
                 WHERE (U0."company_id" = 1 AND U0."option" = ("camps_offer"."activity_type") AND
                        ST_Contains(U2."locations", (T4."position")))
                 LIMIT 1)              AS "in_package",
                "camps_provider"."id",
                "camps_provider"."title",
                "camps_provider"."slug",
                "camps_provider"."is_active",
                "camps_provider"."modified",
                "camps_provider"."created",
                "camps_provider"."logo",
                "camps_provider"."description",
                "camps_provider"."video",
                "camps_provider"."external_url",
                "camps_provider"."terms",
                "camps_provider"."cancellation_policy",
                "camps_provider"."privacy_policy",
                "camps_provider"."application_form"
FROM "camps_offer"
         LEFT OUTER JOIN "camps_bookingoption" ON ("camps_offer"."id" = "camps_bookingoption"."offer_id")
         INNER JOIN "camps_provider" ON ("camps_offer"."provider_id" = "camps_provider"."id")
         INNER JOIN "camps_bookingoption" T4 ON ("camps_offer"."id" = T4."offer_id")
WHERE ("camps_offer"."is_active" = True AND "camps_provider"."is_active" = True AND
       T4."end" >= (STATEMENT_TIMESTAMP()) AND T4."is_active" = True AND (SELECT U0."id"
                                                                          FROM "camps_servicepackage" U0
                                                                                   INNER JOIN "camps_region" U2 ON (U0."region_id" = U2."id")
                                                                          WHERE (U0."company_id" = 1 AND
                                                                                 U0."option" = ("camps_offer"."activity_type") AND
                                                                                 ST_Contains(U2."locations", (T4."position")))
                                                                          LIMIT 1) IS NOT NULL)
GROUP BY "camps_offer"."id",
         (SELECT U0."id"
          FROM "camps_servicepackage" U0
                   INNER JOIN "camps_region" U2 ON (U0."region_id" = U2."id")
          WHERE (U0."company_id" = 1 AND U0."option" = ("camps_offer"."activity_type") AND
                 ST_Contains(U2."locations", (T4."position")))
          LIMIT 1), "camps_provider"."id"
ORDER BY "camps_offer"."created" ASC

Change History (20)

comment:1 by Johannes Maron, 5 years ago

Description: modified (diff)

comment:2 by Mariusz Felisiak, 5 years ago

Johannes, I need to repeat my gentle request for a queryset (see comment:2 and comment:6 ) Can you provide a queryset? It's really hard to restore the original queryset from a raw SQL.

comment:3 by Mariusz Felisiak, 5 years ago

Resolution: needsinfo
Status: newclosed

I really try but without a queryset I was not able to reproduce this issue.

comment:4 by Johannes Maron, 5 years ago

Resolution: needsinfo
Status: closednew

@felixxm, it seems that Subquery annotation are omitted from the grouping section in 3.0. I will try to create a test case today.

If you add the following test to 2.2.* anywhere in tests.aggregation it will pass, in 3.0 it will fail, because the subquery is missing from the grouping bit:

    def test_filtered_aggregate_ref_subquery_annotation_e_31150(self):
        from django.db.models import OuterRef, Subquery
        aggs = Author.objects.annotate(
            earliest_book_year=Subquery(
                Book.objects.filter(
                    contact__pk=OuterRef('pk'),
                ).order_by('pubdate').values('pubdate__year')[:1]
            ),
        ).annotate(max_id=Max('id'))
        print(aggs.query)
        self.assertIn(str(aggs.query), """
        SELECT "aggregation_author"."id", "aggregation_author"."name", "aggregation_author"."age", (SELECT django_date_extract('year', U0."pubdate") FROM "aggregation_book" U0 WHERE U0."contact_id" = ("aggregation_author"."id") ORDER BY U0."pubdate" ASC  LIMIT 1) AS "earliest_book_year", MAX("aggregation_author"."id") AS "max_id" FROM "aggregation_author" GROUP BY "aggregation_author"."id", "aggregation_author"."name", "aggregation_author"."age", (SELECT django_date_extract('year', U0."pubdate") FROM "aggregation_book" U0 WHERE U0."contact_id" = ("aggregation_author"."id") ORDER BY U0."pubdate" ASC  LIMIT 1)
        """)
Last edited 5 years ago by Carlton Gibson (previous) (diff)

comment:5 by Johannes Maron, 5 years ago

Summary: SELECT DISTINCT is not not so distictSubquery annotations are omitted in group by query section if multiple annotation are declared

comment:6 by Carlton Gibson, 5 years ago

Triage Stage: UnreviewedAccepted

OK, thank you for the test case Joe.
I added the import line so it pops straight into ,aggregation.tests.AggregateTestCase and then, yes, there's a change of behaviour between 2.2 and 3.0.

comment:7 by Simon Charette, 5 years ago

The generated query effectively changed but the returned resultset should be the same since the subquery is a function of the outer query's primary key and the query is grouped by the outer query primary key.

I don't think asserting against the generated query string is a proper test case; the generated SQL will change between Django versions but the returned result set should be the same.

comment:8 by Carlton Gibson, 5 years ago

Yes. I didn’t take the test case as final. Just illustrating the issue. 😀

If it’s equivalent then there’s no issue, but I’m assuming Joe is seeing what he thinks is a substantive regression?

comment:9 by Johannes Maron, 5 years ago

Replying to Simon Charette:

The generated query effectively changed but the returned resultset should be the same since the subquery is a function of the outer query's primary key and the query is grouped by the outer query primary key.

I came from a wrong result set and deduced the error from here on out. It took me a while to figure it out to, but sadly grouping by the outer reference is not the same as grouping by the query. Here an example:

with t_w_douplicates as (
    select abs(n) as pk, n as val from generate_series(-2, 2, 1) n
)

select pk, (
    select n
    from generate_series(2, 2) n
    where n = val
)
from t_w_douplicates
where (
    select n from generate_series(2, 2) n where n = val
    ) is null
GROUP BY pk, (
    select n
    from generate_series(2, 2) n
    where n = val
);

Which returns

(0, null)
(1, null)
(2, null)

And just using the outer ref:

with t_w_douplicates as (
    select abs(n) as pk, n as val from generate_series(-2, 2, 1) n
)

select pk, (
    select n
    from generate_series(2, 2) n
    where n = val
)
from t_w_douplicates
where (
    select n from generate_series(2, 2) n where n = val
    ) is null
GROUP BY pk, val;

Which returns 4 results:

(2, null)
(1, null)
(0, null)
(1, null)

I don't think asserting against the generated query string is a proper test case; the generated SQL will change between Django versions but the returned result set should be the same.

Certainly, haha, that was just to illustrate how to get from the ORM to the SQL query in the description. The SQL sample now, shows how they produce different result sets.

comment:10 by Simon Charette, 5 years ago

Thanks for the extra details, I think I have a better picture of what's happening here.

In the cases where an outer query spawns a multi-valued relationship and the subquery references one of these multi-valued relationship we must keep grouping by the subquery.

In your original reports the INNER JOIN "camps_bookingoption" T4 joins spawns multiple rows for the same camps.Offer and then your subquery filter against (T4."position") which

In ORM terms that means we must keep returning self in Subquery.get_group_by_cols when any of our OuterRef include a __ which could point to a multi-valued relationship.

We could go a bit further and only disable the optimization if any of the outer-ref in the __ chain is multi-valued by relying on getattr(rel, 'many_to_many', True) introspection but in all cases that means adding a way to taint Col instances with whether or not they are/could be multi-valued so get_group_by_cols can return self if any of get_external_cols is multi-valued

I could see this tainting as being useful to warn about #10060 in the future since the logic could be based on this column or alias tainting.

I'm happy to submit a patch or discuss any alternatives but it feels like the above solution solves the reported problem while maintaining the optimization for most of the cases.

Version 0, edited 5 years ago by Simon Charette (next)

comment:11 by Simon Charette, 5 years ago

I give a quick shot at the above solution and came with this rough patch

https://github.com/django/django/compare/master...charettes:ticket-31150

Johannes, could you give it a shot and report whether or not it addresses your problem? Do you confirm your original queryset was using an OuterRef with a reference to an outer-query multi-valued relationship?

comment:12 by Johannes Maron, 5 years ago

I tested Simon's patch applied to 3.0.3 and it now works as expected. So yes, I'd say go ahead!

comment:13 by Mariusz Felisiak, 5 years ago

Owner: changed from nobody to Mariusz Felisiak
Status: newassigned

comment:14 by Mariusz Felisiak, 5 years ago

Has patch: set

comment:15 by Carlton Gibson, 5 years ago

Triage Stage: AcceptedReady for checkin

comment:16 by Mariusz Felisiak <felisiak.mariusz@…>, 5 years ago

Resolution: fixed
Status: assignedclosed

In 7b8fa16:

Fixed #31150 -- Included subqueries that reference related fields in GROUP BY clauses.

Thanks Johannes Hoppe for the report.

Regression in fb3f034f1c63160c0ff13c609acd01c18be12f80.

Co-authored-by: Simon Charette <charette.s@…>

comment:17 by Mariusz Felisiak <felisiak.mariusz@…>, 5 years ago

In c5cfaad2:

[3.0.x] Fixed #31150 -- Included subqueries that reference related fields in GROUP BY clauses.

Thanks Johannes Hoppe for the report.

Regression in fb3f034f1c63160c0ff13c609acd01c18be12f80.

Co-authored-by: Simon Charette <charette.s@…>

Backport of 7b8fa1653fde578ab3a496d9974ab1d4261b8b26 from master

comment:18 by Mariusz Felisiak <felisiak.mariusz@…>, 4 years ago

In 277eea8f:

Fixed #32478 -- Included nested columns referenced by subqueries in GROUP BY on aggregations.

Regression in fb3f034f1c63160c0ff13c609acd01c18be12f80.

Refs #31094, #31150.

Thanks Igor Pejic for the report.

comment:19 by Mariusz Felisiak <felisiak.mariusz@…>, 4 years ago

In 7a6ca01:

[3.2.x] Fixed #32478 -- Included nested columns referenced by subqueries in GROUP BY on aggregations.

Regression in fb3f034f1c63160c0ff13c609acd01c18be12f80.

Refs #31094, #31150.

Thanks Igor Pejic for the report.

Backport of 277eea8fcced7f04f3800617f189beb349a3212e from master

comment:20 by Mariusz Felisiak <felisiak.mariusz@…>, 2 years ago

In b7b28c7c:

Refs #31150 -- Enabled implicit GROUP BY aliases.

This ensures implicit grouping from aggregate function annotations
groups by uncollapsed selected aliases if supported.

The feature is disabled on Oracle because it doesn't support it.

Note: See TracTickets for help on using tickets.
Back to Top