Opened 2 years ago

Closed 2 years ago

#33655 closed Cleanup/optimization (fixed)

Unnecessary column in a GROUP BY clause with QuerySet.exists()

Reported by: Marc Perrin Owned by: Marc Perrin
Component: Database layer (models, ORM) Version: 4.0
Severity: Normal Keywords: exists group by
Cc: Simon Charette Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I've got a question about the interaction between exists() and group by.

For example something like:

Manager.values('field').annotate(cnt=Count('id')).filter(cnt__gt=1).exists()

The corresp. query with PostgreSQL looks like this:

SELECT (1) AS "a"
FROM "app_model"
GROUP BY "app_model"."field", (1)
HAVING COUNT("app_model"."id") > 1
LIMIT 1

exists() (see https://github.com/django/django/blob/470708f50d8c13a50770893b8d7181f5218bf3ac/django/db/models/sql/query.py#L563) clears the SELECT clause and replaces it by (if I understand correctly) a hardcoded value 1 (as "a"), along with a limit of 1, which makes sense to me.

But get_group_by() (see https://github.com/django/django/blob/6b53114dd862ec97c282fdfdc83579cbd6d1560d/django/db/models/sql/compiler.py#L79) pushes this hardcoded value to the GROUP BY clause and we end up with the query above.

Now, on PostgreSQL, that works, but to me it sounds like it works by luck/lucky robustness... and certainly the same query without the , (1) in the GROUP BY clause yields the same result.
The problem is that outside of PostgreSQL that GROUP BY clause can be downright invalid...
Note that if I alter exists() to use {'a': 2} instead of {'a': 1}, it does not work anymore (on PostgreSQL), because (2) in the SELECT clause means the hardcoded number 2 while (2) in the GROUP BY clause presumably (??) refers to the 2nd expr of the SELECT clause, but we only got one...

My feeling is that get_group_by() should avoid adding extra/raw sql (or at least extra/raw pure values, if we can detect that...) to the group by expressions?

NB: the existing/old ticket that seems most related to my question would probably be: https://code.djangoproject.com/ticket/24835

Change History (15)

comment:1 by Mariusz Felisiak, 2 years ago

Cc: Simon Charette added
Summary: Interaction between exists() and group byUnnecessary column in a GROUP BY clause with QuerySet.exists()
Triage Stage: UnreviewedAccepted
Type: UncategorizedCleanup/optimization

Thanks for the report. I agree that (1) is unnecessary in the GROUP BY clause unfortunately we cannot always remove constants from it because folks can use them as an alias for n-th column (as you already noticed). Maybe we could select only the first column from the GROUP BY clause instead:

  • django/db/models/sql/query.py

    diff --git a/django/db/models/sql/query.py b/django/db/models/sql/query.py
    index 64e7927f7a..940d3141d1 100644
    a b class Query(BaseExpression):  
    582582        q.clear_ordering(force=True)
    583583        if limit:
    584584            q.set_limits(high=1)
    585         q.add_extra({"a": 1}, None, None, None, None, None)
    586         q.set_extra_mask(["a"])
     585        if q.group_by and not q.select:
     586            q.add_select_col(q.group_by[0], "a")
     587        else:
     588            q.add_extra({"a": 1}, None , None, None, None, None)
     589            q.set_extra_mask(["a"])
    587590        return q
    588591
    589592    def has_results(self, using):

Tentatively accepted for future investigation.

in reply to:  1 comment:2 by Marc Perrin, 2 years ago

Thanks for the quick answer.

Your suggested workaround seems interesting indeed (provided the value of said column does not affect the exists() return value - for example if it's 0?)

NB: I didn't mean removing all constants from GROUP BY, rather: removing the automatic transmission of constants from the SELECT clause to the GROUP BY clause, since their meaning automatically switches from "constant number n" to "nth element of the select clause" which imo can never be intended.

comment:3 by Mariusz Felisiak, 2 years ago

provided the value of said column does not affect the exists() return value - for example if it's 0?

No, it's not. It's enough for the query to return any rows.

comment:4 by Simon Charette, 2 years ago

Maybe we should stop relying on extra and use Value(1) instead which already disables grouping.

  • django/db/models/sql/query.py

    diff --git a/django/db/models/sql/query.py b/django/db/models/sql/query.py
    index 64e7927f7a..1097de4f79 100644
    a b  
    2727    OuterRef,
    2828    Ref,
    2929    ResolvedOuterRef,
     30    Value,
    3031)
    3132from django.db.models.fields import Field
    3233from django.db.models.fields.related_lookups import MultiColSource
    def exists(self, using, limit=True):  
    582583        q.clear_ordering(force=True)
    583584        if limit:
    584585            q.set_limits(high=1)
    585         q.add_extra({"a": 1}, None, None, None, None, None)
    586         q.set_extra_mask(["a"])
     586        q.add_annotation(Value(1), "a")
    587587        return q
    588588
    589589    def has_results(self, using):

Anything that relies on extra is a black box to the compiler as it could be an aggregate function or not.

in reply to:  4 ; comment:5 by Mariusz Felisiak, 2 years ago

Replying to Simon Charette:

Maybe we should stop relying on extra and use Value(1) instead which already disables grouping.

Great idea!

in reply to:  5 comment:6 by Marc Perrin, 2 years ago

Sounds good indeed!

Replying to Mariusz Felisiak:

Replying to Simon Charette:

Maybe we should stop relying on extra and use Value(1) instead which already disables grouping.

Great idea!

comment:7 by Mariusz Felisiak, 2 years ago

Marc, Would you like to prepare a patch?

comment:8 by Marc Perrin, 2 years ago

Yeah, I guess I can do that, with Simon's suggestion the whole thing should be straightforward enough.
I'll take the ticket - I'll have to look up the contribution guide etc. / especially about the tests, as I haven't really contributed before.

comment:9 by Marc Perrin, 2 years ago

Owner: changed from nobody to Marc Perrin
Status: newassigned

in reply to:  8 comment:10 by Mariusz Felisiak, 2 years ago

Replying to Marc Perrin:

Yeah, I guess I can do that, with Simon's suggestion the whole thing should be straightforward enough.
I'll take the ticket - I'll have to look up the contribution guide etc. / especially about the tests, as I haven't really contributed before.

Great, thanks! In this ticket we are forced to check a generated SQL, IMO it's fine to add assertions to an existing test e.g. aggregation.tests.AggregateTestCase.test_aggregation_subquery_annotation_exists:

        with self.assertNumQueries(1) as ctx:
            self.assertTrue(publisher_qs.exists())
        sql = ctx[0]["sql"]
        self.assertIn("SELECT 1 ", sql)
        self.assertNotIn("(1)", sql)
        # There is just one GROUP BY clause (zero commas means at most one
        # clause).
        self.assertEqual(sql[sql.index("GROUP BY") :].count(", "), 0)

comment:11 by Marc Perrin, 2 years ago

Yeah, looks like the most relevant existing test case indeed, thanks for the pointer

comment:12 by Marc Perrin, 2 years ago

Though after debugging this test, it might be a bit more complex than we want/need (e.g. there are 4 fields in the GROUP BY anyway...) and I might just add a new test e.g. "test_aggregation_annotation_exists" that gets close to what I mentioned originally (e.g.

Book.objects.values('publisher').annotate(cnt=Count('isbn')).filter(cnt__gt=1).exists()

)

For reference, with the fix, this query goes from:

SELECT (1) AS "a" FROM "aggregation_book" GROUP BY "aggregation_book"."publisher_id", (1) HAVING COUNT("aggregation_book"."isbn") > 1 LIMIT 1

to:

SELECT 1 AS "a" FROM "aggregation_book" GROUP BY "aggregation_book"."publisher_id" HAVING COUNT("aggregation_book"."isbn") > 1 LIMIT 1

(with the test default settings i.e. sqlite I guess)
I'll try to be diligent with the asserts though, because if the test must work on all backends, even

self.assertIn("SELECT 1 ", sql)

isn't good, because we can get SELECT TOP 1 1 (...) instead of SELECT 1 (...) LIMIT 1

So perhaps a simple

self.assertNotIn(", (1)", sql)

would be enough (I'm not really fond of your assert about GROUP BY either, because it's not really targeting the GROUP BY clause, it includes everything after it incl. HAVING clause)

Last edited 2 years ago by Marc Perrin (previous) (diff)

comment:13 by Marc Perrin, 2 years ago

But anyway I've started the PR: https://github.com/django/django/pull/15630 se we can discuss the details in there

comment:14 by Mariusz Felisiak, 2 years ago

Has patch: set
Triage Stage: AcceptedReady for checkin

comment:15 by Mariusz Felisiak <felisiak.mariusz@…>, 2 years ago

Resolution: fixed
Status: assignedclosed

In 4282fd4:

Fixed #33655 -- Removed unnecessary constant from GROUP BY clause for QuerySet.exists().

Note: See TracTickets for help on using tickets.
Back to Top