#34978 closed Bug (fixed)
Annotating through an aggregate with RawSQL() raises 1056 "Can't group on" on MySQL/MariaDB.
Reported by: | Matthew Somerville | Owned by: | Mariusz Felisiak |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 4.2 |
Severity: | Release blocker | Keywords: | |
Cc: | Simon Charette | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
I have some code I am trying to update from Django 3.2 to 4.2, using MariaDB, that has worked in previous Django versions, and works fine in 4.1, but fails in 4.2. You can see an example GitHub Action output at https://github.com/dracos/Theatricalia/actions/runs/6922955832 showing 3 and 4.1 passing, but 4.2 failing.
One problem query is the following code:
seen = user.visit_set.annotate(min_press_date=Min('production__place__press_date')).annotate(best_date=RawSQL('MIN(IFNULL(productions_place.press_date, IF(productions_place.end_date!="", productions_place.end_date, productions_place.start_date)))', ())).order_by('-best_date')
In Django 4.1, this produces the following SQL, which works fine:
SELECT `productions_visit`.`id`, `productions_visit`.`production_id`, `productions_visit`.`user_id`, `productions_visit`.`recommend`, `productions_visit`.`date`, MIN(`productions_place`.`press_date`) AS `min_press_date`, (MIN(IFNULL(productions_place.press_date, IF(productions_place.end_date!="", productions_place.end_date, productions_place.start_date)))) AS `best_date` FROM `productions_visit` INNER JOIN `productions_production` ON (`productions_visit`.`production_id` = `productions_production`.`id`) LEFT OUTER JOIN `productions_place` ON (`productions_production`.`id` = `productions_place`.`production_id`) WHERE `productions_visit`.`user_id` = 1 GROUP BY `productions_visit`.`id` ORDER BY `best_date` DESC
Whilst the SQL produced by Django 4.2 is:
SELECT `productions_visit`.`id`, `productions_visit`.`production_id`, `productions_visit`.`user_id`, `productions_visit`.`recommend`, `productions_visit`.`date`, MIN(`productions_place`.`press_date`) AS `min_press_date`, (MIN(IFNULL(productions_place.press_date, IF(productions_place.end_date!="", productions_place.end_date, productions_place.start_date)))) AS `best_date` FROM `productions_visit` INNER JOIN `productions_production` ON (`productions_visit`.`production_id` = `productions_production`.`id`) LEFT OUTER JOIN `productions_place` ON (`productions_production`.`id` = `productions_place`.`production_id`) WHERE `productions_visit`.`user_id` = 1 GROUP BY `productions_visit`.`id`, 7 ORDER BY 7 DESC LIMIT 21
It has added a group by on column 7 (which is best_date) and this then gives a "1056 Can't group by best_date" error from MySQL/MariaDB.
I have bisected Django between 4.1 and 4.2, and the problem was introduced by the fix for #31331 in 041551d716b69ee7c81199eee86a2d10a72e15ab. Somehow that fix means my annotation is now being included in the group by when it shouldn't be, as it's an aggregate per visit ID, as far as I understand. Let me know if you need any other details.
Change History (17)
comment:1 by , 12 months ago
Cc: | added |
---|---|
Severity: | Normal → Release blocker |
Summary: | Annotating through an aggregate with MySQL/MariaDB raises 1056 "Can't group on" error → Annotating through an aggregate with RawSQL() raises 1056 "Can't group on" on MySQL/MariaDB. |
Triage Stage: | Unreviewed → Accepted |
comment:2 by , 12 months ago
Failing test in case that helps:
--- a/tests/aggregation/tests.py +++ b/tests/aggregation/tests.py @@ -2135,6 +2135,39 @@ class AggregateTestCase(TestCase): ) self.assertEqual(list(author_qs), [337]) + def test_annotate_raw_expression(self): + qs = ( + Book.objects.values("publisher") + .annotate(min_price=Min("price")) + .annotate(max_price=RawSQL("MAX(price)", params=[])) + .values("publisher__name", "min_price", "max_price") + ) + self.assertEqual( + list(qs), + [ + { + "max_price": Decimal("30.00"), + "min_price": Decimal("29.69"), + "publisher__name": "Apress", + }, + { + "max_price": Decimal("23.09"), + "min_price": Decimal("23.09"), + "publisher__name": "Sams", + }, + { + "max_price": Decimal("82.80"), + "min_price": Decimal("29.69"), + "publisher__name": "Prentice Hall", + }, + { + "max_price": Decimal("75.00"), + "min_price": Decimal("75.00"), + "publisher__name": "Morgan Kaufmann", + }, + ], + ) + class AggregateAnnotationPruningTests(TestCase): @classmethod
This one's interesting because we can't dig into RawSQL to determine whether it's valid in a GROUP BY or not.
EDIT: Actually this test fails before 041551d716b69ee7c81199eee86a2d10a72e15ab. The difference is the values('publisher')
vs the OP's example of sorting by the raw expression. The latter includes the pk so it can be reduced, the former cannot.
comment:3 by , 12 months ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
follow-ups: 5 8 comment:4 by , 12 months ago
See #26602 Provide a way to manage grouping with RawSQL which was closed because of a justified use case.
I would say that the same can be said about this ticket as the reported problem can be fixed in multiple ways with the provided ORM interfaces.
The most obvious and non-invasive one is to use Min(RawSQL(...))
instead
annotate( best_date=Min( RawSQL( 'IFNULL(productions_place.press_date, IF(productions_place.end_date!="", productions_place.end_date, productions_place.start_date))', (), ), ) )
a second one is to use expressions entirely
annotate(best_date=Min(Coalesce("press_date", Case(When(end_date="", then=F("start_date")), default=F("end_date"))))
a third one, assuming the user wants to stick to IF
and IFNULL
from django.db.models import Func class If(Func): function = "IF" class IfNull(Func): function = "IFNULL" annotate(best_date=Min( IfNull("press_date", If(end_date="", "start_date", "end_date")) ))
This one's interesting because we can't dig into RawSQL to determine whether it's valid in a GROUP BY or not.
That's right David, it's the crux of the issue. The ORM must choose between grouping or not by the expression and since most RawSQL
usage are assumed to not be aggregation it determines that it must do so as it's a blackbox from an introspection perspective.
The reason why 041551d716b69ee7c81199eee86a2d10a72e15ab broke the reported use that is that prior to this change the ORM supported a non-standard feature of MySQL disabled in recent versions that allowed grouping solely by the primary key of the first entry in FROM
. It's important to note that using RawSQL
to aggregate was only working on MySQL due to this feature and never worked on any of the other backends that follow the functional dependency detection in GROUP BY
clauses as specified by the SQL:1999
standard.
The nail in the coffin of this feature was went it was discovered that it had a peculiarity when dealing with subqueries #31331 that would have required a significant amount of work to get working.
I could see us go three ways about dealing with this issue
- Revert the changes made in 041551d716b69ee7c81199eee86a2d10a72e15ab while making
allows_group_by_pk
based on the absence ofONLY_FULL_GROUP_BY
. Note that this won't resolve the aggregation over the annotation of a dependant subquery but will restore the usage ofRawSQL
for aggregation on MySQL only whenONLY_FULL_GROUP_BY
is disabled. - 1 + adjustments to the
allows_group_by_pk
to special case dependant subquery annotations - Adjust the 4.2 existing release notes about this change to better communicate that this version of Django removed support for doing
RawSQL
aggregations on MySQL and that they should use proper expressions instead going forward.
Due to lack of demonstration that some aggregates or window function cannot be expressed using the advanced ORM primitives, that non-ONLY_FULL_GROUP_BY
model is a non standard MySQL feature that is not enabled by default since 8.0, and that this change happen to standardize the usage of RawSQL
for aggregations on all backends I'd be inclined to go with 3.
comment:5 by , 12 months ago
Replying to Simon Charette:
I could see us go three ways about dealing with this issue
- Revert the changes made in 041551d716b69ee7c81199eee86a2d10a72e15ab while making
allows_group_by_pk
based on the absence ofONLY_FULL_GROUP_BY
. Note that this won't resolve the aggregation over the annotation of a dependant subquery but will restore the usage ofRawSQL
for aggregation on MySQL only whenONLY_FULL_GROUP_BY
is disabled.- 1 + adjustments to the
allows_group_by_pk
to special case dependant subquery annotations- Adjust the 4.2 existing release notes about this change to better communicate that this version of Django removed support for doing
RawSQL
aggregations on MySQL and that they should use proper expressions instead going forward.
Given your rationale and considering that there are multiple workarounds as you proposed, I'm also in favor of option 3.
comment:6 by , 12 months ago
Given the same query does already use Min()
, as quoted, I am at a loss as to why I wasn't using it in the other part! Thanks for the investigation; an addition to the 4.2 release notes explaining the change further (I did read those, but only saw the reference to changes in "third-party database backends") would be welcome, thank you.
With the change to your first option¹, my code passes on Django 4.2 fine with ONLY_FULL_GROUP_BY
turned off, thank you; turning that option on I get a lot of 1055 errors, even with a query that's only e.g. Play.objects.annotate(Count('authors'))
, without any RawSQL, I get (1055, "'theatricalia.plays_play.title' isn't in GROUP BY")
, but assume that's my issue somehow.
¹ If you're interested, regarding your second/third code change options, press_date is a DateField but start_date/end_date are ApproximateDateFields from my https://pypi.org/project/django-date-extensions/ so that becomes a bit more complex.
comment:7 by , 12 months ago
I get a lot of 1055 errors, even with a query that's only e.g.
Play.objects.annotate(Count('authors'))
, without anyRawSQL
, I get(1055, "'theatricalia.plays_play.title' isn't in GROUP BY")
, but assume that's my issue somehow.
That's interesting. If it's happening for models of the form
class Author(models.Model): pass class Play(models.Model): title = models.CharField() authors = models.ManyToManyField(Author)
I would expect Play.objects.annotate(Count('authors'))
to generate
SELECT play.id, play.name, COUNT(author.id) FROM play LEFT JOIN play_authors ON (play_authors.play_id = play.id) LEFT JOIN author ON (play_authors.author_id = author.id) GROUP BY play.id
And by MySQL docs
MySQL implements detection of functional dependence. If the
ONLY_FULL_GROUP_BY
SQL mode is enabled (which it is by default), MySQL rejects queries for which the select list, HAVING condition, or ORDER BY list refer to nonaggregated columns that are neither named in theGROUP BY
clause nor are functionally dependent on them.
So in this case play.name
is functionally dependant on play.id
(as it's the primary key of play
) so if you're using a version of MySQL supported on Django 4.2 we'd definitely like to learn more about it as it's unexpected.
comment:8 by , 12 months ago
Replying to Simon Charette:
- Revert the changes made in 041551d716b69ee7c81199eee86a2d10a72e15ab while making
allows_group_by_pk
based on the absence ofONLY_FULL_GROUP_BY
. Note that this won't resolve the aggregation over the annotation of a dependant subquery but will restore the usage ofRawSQL
for aggregation on MySQL only whenONLY_FULL_GROUP_BY
is disabled.- 1 + adjustments to the
allows_group_by_pk
to special case dependant subquery annotations- Adjust the 4.2 existing release notes about this change to better communicate that this version of Django removed support for doing
RawSQL
aggregations on MySQL and that they should use proper expressions instead going forward.
We could also have a user-definable attribute RawSQL.contains_aggregates
though I think that's making things too complex.
Option 3 sounds good 👍
comment:10 by , 12 months ago
So in this case play.name is functionally dependant on play.id (as it's the primary key of play) so if you're using a version of MySQL supported on Django 4.2 we'd definitely like to learn more about it as it's unexpected.
So it turns out the database is MariaDB (11.1.2), not MySQL, and MariaDB does not appear to include/have the functional dependency requirement that this is based on. I don't know if you'd like me to raise that as a separate ticket, if Django is supposed to support both entirely equally, but yes, it looks like the code will not work at all with MariaDB with ONLY_FULL_GROUP_BY turned on.
comment:11 by , 12 months ago
it looks like the code will not work at all with MariaDB with ONLY_FULL_GROUP_BY turned on.
That's good to know thanks for investigating further that's appreciated! I think it's worth having a separate ticket for it yes. The solution will likely be to turn off the allows_group_by_selected_pks
feature on MariaDB when ONLY_FULL_GROUP_BY
mode is turned on. Note that the allows_group_by_selected_pks
feature is different from the allows_group_by_pk
feature removed in 041551d716b69ee7c81199eee86a2d10a72e15ab.
comment:14 by , 12 months ago
Triage Stage: | Accepted → Ready for checkin |
---|
Thanks for the report. Regression in 041551d716b69ee7c81199eee86a2d10a72e15ab.