Opened 12 months ago

Closed 11 months ago

Last modified 11 months ago

#34978 closed Bug (fixed)

Annotating through an aggregate with RawSQL() raises 1056 "Can't group on" on MySQL/MariaDB.

Reported by: Matthew Somerville Owned by: Mariusz Felisiak
Component: Database layer (models, ORM) Version: 4.2
Severity: Release blocker Keywords:
Cc: Simon Charette Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I have some code I am trying to update from Django 3.2 to 4.2, using MariaDB, that has worked in previous Django versions, and works fine in 4.1, but fails in 4.2. You can see an example GitHub Action output at https://github.com/dracos/Theatricalia/actions/runs/6922955832 showing 3 and 4.1 passing, but 4.2 failing.

One problem query is the following code:

    seen = user.visit_set.annotate(min_press_date=Min('production__place__press_date')).annotate(best_date=RawSQL('MIN(IFNULL(productions_place.press_date, IF(productions_place.end_date!="", productions_place.end_date, productions_place.start_date)))', ())).order_by('-best_date')

In Django 4.1, this produces the following SQL, which works fine:

SELECT `productions_visit`.`id`, `productions_visit`.`production_id`, `productions_visit`.`user_id`, `productions_visit`.`recommend`, `productions_visit`.`date`, MIN(`productions_place`.`press_date`) AS `min_press_date`, (MIN(IFNULL(productions_place.press_date, IF(productions_place.end_date!="", productions_place.end_date, productions_place.start_date)))) AS `best_date` FROM `productions_visit` INNER JOIN `productions_production` ON (`productions_visit`.`production_id` = `productions_production`.`id`) LEFT OUTER JOIN `productions_place` ON (`productions_production`.`id` = `productions_place`.`production_id`) WHERE `productions_visit`.`user_id` = 1 GROUP BY `productions_visit`.`id` ORDER BY `best_date` DESC

Whilst the SQL produced by Django 4.2 is:

SELECT `productions_visit`.`id`, `productions_visit`.`production_id`, `productions_visit`.`user_id`, `productions_visit`.`recommend`, `productions_visit`.`date`, MIN(`productions_place`.`press_date`) AS `min_press_date`, (MIN(IFNULL(productions_place.press_date, IF(productions_place.end_date!="", productions_place.end_date, productions_place.start_date)))) AS `best_date` FROM `productions_visit` INNER JOIN `productions_production` ON (`productions_visit`.`production_id` = `productions_production`.`id`) LEFT OUTER JOIN `productions_place` ON (`productions_production`.`id` = `productions_place`.`production_id`) WHERE `productions_visit`.`user_id` = 1 GROUP BY `productions_visit`.`id`, 7 ORDER BY 7 DESC LIMIT 21

It has added a group by on column 7 (which is best_date) and this then gives a "1056 Can't group by best_date" error from MySQL/MariaDB.

I have bisected Django between 4.1 and 4.2, and the problem was introduced by the fix for #31331 in 041551d716b69ee7c81199eee86a2d10a72e15ab. Somehow that fix means my annotation is now being included in the group by when it shouldn't be, as it's an aggregate per visit ID, as far as I understand. Let me know if you need any other details.

Change History (17)

comment:1 by Mariusz Felisiak, 12 months ago

Cc: Simon Charette added
Severity: NormalRelease blocker
Summary: Annotating through an aggregate with MySQL/MariaDB raises 1056 "Can't group on" errorAnnotating through an aggregate with RawSQL() raises 1056 "Can't group on" on MySQL/MariaDB.
Triage Stage: UnreviewedAccepted

Thanks for the report. Regression in 041551d716b69ee7c81199eee86a2d10a72e15ab.

comment:2 by David Sanders, 12 months ago

Failing test in case that helps:

--- a/tests/aggregation/tests.py
+++ b/tests/aggregation/tests.py
@@ -2135,6 +2135,39 @@ class AggregateTestCase(TestCase):
         )
         self.assertEqual(list(author_qs), [337])

+    def test_annotate_raw_expression(self):
+        qs = (
+            Book.objects.values("publisher")
+            .annotate(min_price=Min("price"))
+            .annotate(max_price=RawSQL("MAX(price)", params=[]))
+            .values("publisher__name", "min_price", "max_price")
+        )
+        self.assertEqual(
+            list(qs),
+            [
+                {
+                    "max_price": Decimal("30.00"),
+                    "min_price": Decimal("29.69"),
+                    "publisher__name": "Apress",
+                },
+                {
+                    "max_price": Decimal("23.09"),
+                    "min_price": Decimal("23.09"),
+                    "publisher__name": "Sams",
+                },
+                {
+                    "max_price": Decimal("82.80"),
+                    "min_price": Decimal("29.69"),
+                    "publisher__name": "Prentice Hall",
+                },
+                {
+                    "max_price": Decimal("75.00"),
+                    "min_price": Decimal("75.00"),
+                    "publisher__name": "Morgan Kaufmann",
+                },
+            ],
+        )
+

 class AggregateAnnotationPruningTests(TestCase):
     @classmethod

This one's interesting because we can't dig into RawSQL to determine whether it's valid in a GROUP BY or not.

EDIT: Actually this test fails before 041551d716b69ee7c81199eee86a2d10a72e15ab. The difference is the values('publisher') vs the OP's example of sorting by the raw expression. The latter includes the pk so it can be reduced, the former cannot.

Last edited 12 months ago by David Sanders (previous) (diff)

comment:3 by Simon Charette, 12 months ago

Owner: changed from nobody to Simon Charette
Status: newassigned

comment:4 by Simon Charette, 12 months ago

See #26602 Provide a way to manage grouping with RawSQL which was closed because due to a lack of justified use case.

I would say that the same can be said about this ticket as the reported problem can be fixed in multiple ways with the provided ORM interfaces.

The most obvious and non-invasive one is to use Min(RawSQL(...)) instead

annotate(
    best_date=Min(
        RawSQL(
            'IFNULL(productions_place.press_date, IF(productions_place.end_date!="", productions_place.end_date, productions_place.start_date))',
            (),
        ),
    )
)

a second one is to use expressions entirely

annotate(best_date=Min(Coalesce("press_date", Case(When(end_date="", then=F("start_date")), default=F("end_date"))))

a third one, assuming the user wants to stick to IF and IFNULL

from django.db.models import Func

class If(Func):
    function = "IF"

class IfNull(Func):
    function = "IFNULL"
 

annotate(best_date=Min(
    IfNull("press_date", If(end_date="", "start_date", "end_date"))
))

This one's interesting because we can't dig into RawSQL to determine whether it's valid in a GROUP BY or not.

That's right David, it's the crux of the issue. The ORM must choose between grouping or not by the expression and since most RawSQL usage are assumed to not be aggregation it determines that it must do so as it's a blackbox from an introspection perspective.

The reason why 041551d716b69ee7c81199eee86a2d10a72e15ab broke the reported use that is that prior to this change the ORM supported a non-standard feature of MySQL disabled in recent versions that allowed grouping solely by the primary key of the first entry in FROM. It's important to note that using RawSQL to aggregate was only working on MySQL due to this feature and never worked on any of the other backends that follow the functional dependency detection in GROUP BY clauses as specified by the SQL:1999 standard.

The nail in the coffin of this feature was went it was discovered that it had a peculiarity when dealing with subqueries #31331 that would have required a significant amount of work to get working.

I could see us go three ways about dealing with this issue

  1. Revert the changes made in 041551d716b69ee7c81199eee86a2d10a72e15ab while making allows_group_by_pk based on the absence of ONLY_FULL_GROUP_BY. Note that this won't resolve the aggregation over the annotation of a dependant subquery but will restore the usage of RawSQL for aggregation on MySQL only when ONLY_FULL_GROUP_BY is disabled.
  2. 1 + adjustments to the allows_group_by_pk to special case dependant subquery annotations
  3. Adjust the 4.2 existing release notes about this change to better communicate that this version of Django removed support for doing RawSQL aggregations on MySQL and that they should use proper expressions instead going forward.

Due to lack of demonstration that some aggregates or window function cannot be expressed using the documented primitives offered by the ORM, that non-ONLY_FULL_GROUP_BY model is a non standard MySQL feature that is not enabled by default since 8.0, and that this change happens to standardize the usage of RawSQL for aggregations on all backends I'd be inclined to go with 3.

Last edited 12 months ago by Simon Charette (previous) (diff)

in reply to:  4 comment:5 by Natalia Bidart, 12 months ago

Replying to Simon Charette:

I could see us go three ways about dealing with this issue

  1. Revert the changes made in 041551d716b69ee7c81199eee86a2d10a72e15ab while making allows_group_by_pk based on the absence of ONLY_FULL_GROUP_BY. Note that this won't resolve the aggregation over the annotation of a dependant subquery but will restore the usage of RawSQL for aggregation on MySQL only when ONLY_FULL_GROUP_BY is disabled.
  2. 1 + adjustments to the allows_group_by_pk to special case dependant subquery annotations
  3. Adjust the 4.2 existing release notes about this change to better communicate that this version of Django removed support for doing RawSQL aggregations on MySQL and that they should use proper expressions instead going forward.

Given your rationale and considering that there are multiple workarounds as you proposed, I'm also in favor of option 3.

comment:6 by Matthew Somerville, 12 months ago

Given the same query does already use Min(), as quoted, I am at a loss as to why I wasn't using it in the other part! Thanks for the investigation; an addition to the 4.2 release notes explaining the change further (I did read those, but only saw the reference to changes in "third-party database backends") would be welcome, thank you.

With the change to your first option¹, my code passes on Django 4.2 fine with ONLY_FULL_GROUP_BY turned off, thank you; turning that option on I get a lot of 1055 errors, even with a query that's only e.g. Play.objects.annotate(Count('authors')), without any RawSQL, I get (1055, "'theatricalia.plays_play.title' isn't in GROUP BY"), but assume that's my issue somehow.

¹ If you're interested, regarding your second/third code change options, press_date is a DateField but start_date/end_date are ApproximateDateFields from my https://pypi.org/project/django-date-extensions/ so that becomes a bit more complex.

comment:7 by Simon Charette, 12 months ago

I get a lot of 1055 errors, even with a query that's only e.g. Play.objects.annotate(Count('authors')), without any RawSQL, I get (1055, "'theatricalia.plays_play.title' isn't in GROUP BY"), but assume that's my issue somehow.

That's interesting. If it's happening for models of the form

class Author(models.Model):
    pass

class Play(models.Model):
    title = models.CharField()
    authors = models.ManyToManyField(Author)

I would expect Play.objects.annotate(Count('authors')) to generate

SELECT play.id, play.name, COUNT(author.id)
FROM play
LEFT JOIN play_authors ON (play_authors.play_id = play.id)
LEFT JOIN author ON (play_authors.author_id = author.id)
GROUP BY play.id

And by MySQL docs

MySQL implements detection of functional dependence. If the ONLY_FULL_GROUP_BY SQL mode is enabled (which it is by default), MySQL rejects queries for which the select list, HAVING condition, or ORDER BY list refer to nonaggregated columns that are neither named in the GROUP BY clause nor are functionally dependent on them.

So in this case play.name is functionally dependant on play.id (as it's the primary key of play) so if you're using a version of MySQL supported on Django 4.2 we'd definitely like to learn more about it as it's unexpected.

in reply to:  4 comment:8 by David Sanders, 12 months ago

Replying to Simon Charette:

  1. Revert the changes made in 041551d716b69ee7c81199eee86a2d10a72e15ab while making allows_group_by_pk based on the absence of ONLY_FULL_GROUP_BY. Note that this won't resolve the aggregation over the annotation of a dependant subquery but will restore the usage of RawSQL for aggregation on MySQL only when ONLY_FULL_GROUP_BY is disabled.
  2. 1 + adjustments to the allows_group_by_pk to special case dependant subquery annotations
  3. Adjust the 4.2 existing release notes about this change to better communicate that this version of Django removed support for doing RawSQL aggregations on MySQL and that they should use proper expressions instead going forward.

We could also have a user-definable attribute RawSQL.contains_aggregates though I think that's making things too complex.

Option 3 sounds good 👍

comment:9 by Mariusz Felisiak, 12 months ago

Agreed, let's document it.

comment:10 by Matthew Somerville, 12 months ago

So in this case play.name is functionally dependant on play.id (as it's the primary key of play) so if you're using a version of MySQL supported on Django 4.2 we'd definitely like to learn more about it as it's unexpected.

So it turns out the database is MariaDB (11.1.2), not MySQL, and MariaDB does not appear to include/have the functional dependency requirement that this is based on. I don't know if you'd like me to raise that as a separate ticket, if Django is supposed to support both entirely equally, but yes, it looks like the code will not work at all with MariaDB with ONLY_FULL_GROUP_BY turned on.

comment:11 by Simon Charette, 12 months ago

it looks like the code will not work at all with MariaDB with ONLY_FULL_GROUP_BY turned on.

That's good to know thanks for investigating further that's appreciated! I think it's worth having a separate ticket for it yes. The solution will likely be to turn off the allows_group_by_selected_pks feature on MariaDB when ONLY_FULL_GROUP_BY mode is turned on. Note that the allows_group_by_selected_pks feature is different from the allows_group_by_pk feature removed in 041551d716b69ee7c81199eee86a2d10a72e15ab.

comment:12 by Matthew Somerville, 12 months ago

Have opened #34992. Thanks :)

comment:13 by Mariusz Felisiak, 11 months ago

Has patch: set
Owner: changed from Simon Charette to Mariusz Felisiak

comment:14 by Natalia Bidart, 11 months ago

Triage Stage: AcceptedReady for checkin

comment:15 by Mariusz Felisiak <felisiak.mariusz@…>, 11 months ago

Resolution: fixed
Status: assignedclosed

In a652f07:

Fixed #34978, Refs #31331 -- Added backward incompatibility note about raw aggregations on MySQL.

Thanks Matthew Somerville for the report.

comment:16 by Natalia <124304+nessita@…>, 11 months ago

In cdb14cc1:

[4.2.x] Fixed #34978, Refs #31331 -- Added backward incompatibility note about raw aggregations on MySQL.

Thanks Matthew Somerville for the report.

Backport of a652f0759651dd7103ed04336ef85dc410f680c1 from main

comment:17 by Natalia <124304+nessita@…>, 11 months ago

In cbd1e91:

[5.0.x] Fixed #34978, Refs #31331 -- Added backward incompatibility note about raw aggregations on MySQL.

Thanks Matthew Somerville for the report.

Backport of a652f0759651dd7103ed04336ef85dc410f680c1 from main

Note: See TracTickets for help on using tickets.
Back to Top