Opened 2 years ago
Last modified 3 days ago
#34262 assigned Bug
Queryset grouped by annotation with aggregates on another annotated expression crashes on MySQL with sql_mode=only_full_group_by.
Reported by: | Mariusz Felisiak | Owned by: | ontowhee |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 4.1 |
Severity: | Normal | Keywords: | mysql only_full_group_by |
Cc: | Simon Charette, David Wobrock | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Queryset grouped by annotation with aggregates on another annotated expression crashed on MySQL with sql_mode=only_full_group_by
, e.g.
def test_group_by_nested_expression_with_params(self): books_qs = ( Book.objects.annotate(greatest_pages=Greatest("pages", Value(600))) .values( "greatest_pages", ) .annotate( min_pages=Min("pages"), least=Least("min_pages", "greatest_pages"), ) .values_list("least", flat=True) ) self.assertCountEqual(books_qs, [300, 946, 1132])
crashes with:
django.db.utils.OperationalError: (1055, "Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'test_django_2.aggregation_book.pages' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by")
Change History (15)
comment:1 by , 2 years ago
Cc: | added |
---|
comment:2 by , 2 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:3 by , 2 years ago
Cc: | added |
---|
follow-up: 5 comment:4 by , 2 years ago
Thanks for the analysis David, option 1. and 2. were also the conclusion of my limited investigation on the subject so it's great to have cross peer validation on the subject.
I think there might be a way to implement 3. by reusing some of the logic used to implement masking of columns when filtering against window functions which requires two level of subquery wrapping.
A different of approaching 3. is to think that any form of masking of annotations/aliases used for grouping purposes would result in a subquery pushdown. So to reuse your example, instead of performing a subquery pushdown to compute expressions used with aggregate queries we'd do it the other way around to have MySQL group against top level SELECT
expressions which it's good at inferring dependencies from
SELECT LEAST(min_pages, greatest_pages) AS `least` FROM ( SELECT GREATEST(`aggregation_book`.`pages`, 600) greatest_pages, MIN(`aggregation_book`.`pages`) min_pages FROM `aggregation_book` GROUP BY 1 ORDER BY NULL ) masked
This should reduce the area of impact of this change to aggregating queries that group against a value that isn't explicitly selected and would also happen to solve the last remaining known issue with using server-side parameters binding for Postgres (#34255).
follow-up: 6 comment:5 by , 19 months ago
Replying to Simon Charette:
Thanks for the analysis David, option 1. and 2. were also the conclusion of my limited investigation on the subject so it's great to have cross peer validation on the subject.
I think there might be a way to implement 3. by reusing some of the logic used to implement masking of columns when filtering against window functions which requires two level of subquery wrapping.
A different of approaching 3. is to think that any form of masking of annotations/aliases used for grouping purposes would result in a subquery pushdown. So to reuse your example, instead of performing a subquery pushdown to compute expressions used with aggregate queries we'd do it the other way around to have MySQL group against top level
SELECT
expressions which it's good at inferring dependencies from
SELECT LEAST(min_pages, greatest_pages) AS `least` FROM ( SELECT GREATEST(`aggregation_book`.`pages`, 600) greatest_pages, MIN(`aggregation_book`.`pages`) min_pages FROM `aggregation_book` GROUP BY 1 ORDER BY NULL ) maskedThis should reduce the area of impact of this change to aggregating queries that group against a value that isn't explicitly selected and would also happen to solve the last remaining known issue with using server-side parameters binding for Postgres (#34255).
I'm curios to know what happened with this issue. Any updates?
comment:6 by , 19 months ago
Replying to Amir Karimi:
I'm curios to know what happened with this issue. Any updates?
Feel-free to work on this issue. Please don't leave comments like any updates?
they don't help and cause unnecessary noise to the history.
comment:7 by , 19 months ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:8 by , 18 months ago
I think @David Wobrock's query is easier to implement and covers more cases.
For example, if we have a followinfg queryset:
books_qs = ( Book.objects.annotate(greatest_pages=Greatest("pages", Value(600))) .values( "greatest_pages", ) .annotate( min_pages=Min("pages"), least=Least("min_pages", "greatest_pages"), ) )
Creating the following query that @David Wobrock presented seems like more sense to me and covers many other cases.
SELECT greatest_pages, MIN(pages), LEAST(MIN(pages), greatest_pages) AS least FROM (SELECT GREATEST(pages, 600) greatest_pages, pages FROM aggregation_book2) AS t GROUP BY greatest_pages ORDER BY NULL;
If we were to take @Simon Charette's query, it could be like this:
SELECT geatest_pages, min_pages, LEAST(min_pages, greatest_pages) AS `least` FROM ( SELECT GREATEST(`aggregation_book`.`pages`, 600) greatest_pages, MIN(`aggregation_book`.`pages`) min_pages FROM `aggregation_book` GROUP BY 1 ORDER BY NULL ) masked
I think the position of "MIN(aggregation_book
.pages
) min_pages" looks awkward.
With .values_list("least", flat=True)
clause present, there was a obvious reason for "MIN(aggregation_book
.pages
) min_pages" to be pushed down because it is a dependency for least
, but without .values_list("least", flat=True)
it loses it's reason to be pushed down.
I am a bit suspicious that choosing which additional item to be pushed down by looking at values_list
worth it's effort considering frequency of this use case is thought to be small.
comment:9 by , 15 months ago
Owner: | removed |
---|---|
Status: | assigned → new |
comment:10 by , 15 months ago
FWIW this relates to #34992 where we had to disable allows_group_by_selected_pks
on MariaDB entirely as it doesn't implement any form of functional dependence resolition.
comment:11 by , 12 months ago
A recent article on any_value and functional dependency if it can be of help to anyone working on this issue.
comment:12 by , 6 days ago
Is there an expression in Django for ANY_VALUE()? I did a quick search for "anyvalue” and “any.*value” but it did not come up with results. Would it be useful to support such an expression?
My thought is, option 2 seems to be the lowest effort and risk to implement (the other options can potentially be added in later if there is a good solution). It can raise an error and suggest that the user apply such an expression to the offending column. This way, django is not making an arbitrary decision on wrapping the column with ANY_VALUE(). I haven’t dived into what it would take to support such an expression, so this may be naive. Any thoughts here?
comment:13 by , 6 days ago
There's none but you can easily write your own and circumvent the problems MySQL exhibits
from django.db.models.aggregate import Aggregate class AnyValue(Aggregate): function = "ANY_VALUE"
Given MySQL, Postgres 16+, and Oracle 19c+ support it it might be worth considering adding it to core and documenting that it must be used under some circumstances on MySQL? It can also be useful under other circumstances on Postgres and Oracle which adhere more closely to the spec regarding GROUP BY
rules.
comment:14 by , 5 days ago
Owner: | set to |
---|---|
Status: | new → assigned |
Should a new ticket be created for adding support for ANY_VALUE()? For MySQL, since ANY_VALUE() is not an aggregate function, would it be implemented as an Aggregate or Func or a different expression type?
Just thinking out loud here. If option 2 is the path forward,
- Would the warning be raised before the query is evaluated? That means the query needs to detect if there are nonaggregated expressions. I'll need to dig more to understand how that might work.
- Or, raise the warning after the query is evaluated by catching the OperationalError in the CursorWrapper?
codes_for_warnings = ( 1055, # Expression not in GROUP BY and contains nonaggregated colum ) ... def execute(self, query, args=None): try: # args is None means no string interpolation return self.cursor.execute(query, args) except Database.OperationalError as e: # Map some error codes to IntegrityError, since they seem to be # misclassified and Django would prefer the more logical place. if e.args[0] in self.codes_for_integrityerror: raise IntegrityError(*tuple(e.args)) else if e.args[0] in self.codes_for_warnings: warnings.warn( "%s " "Consider wrapping the nonaggregated expression using AnyValue." %(e), RuntimeWarning, ) return raise
I'm going to dig around and see if the warning can be raised before the query is evaluated. I might start looking through the resolve_expression() functions, since there seem to be patterns of raising errors there.
comment:15 by , 3 days ago
Should a new ticket be created for adding support for ANY_VALUE()?
I think this that this ticket can be re-purposed for the introduction of AnyValue(Aggregate)
given that the reported problem here is that MySQL query planer isn't smart enough to transitively infer the presence of members in the GROUP BY
clause and that explicit usage of ANY_VALUE
solves that.
For MySQL, since ANY_VALUE() is not an aggregate function, would it be implemented as an Aggregate or Func or a different expression type?
It's effectively not an aggregate function per-se on MySQL but more of a sentinel to tell the query planner to ignore the only_full_group_by
check only for a single expression but in the Django sense it must be defined as an Aggregate
subclass otherwise it will be included in the GROUP BY
clause. In other words, it should be implemented as an Aggregate
subclass even if MySQL doesn't follow the SQL spec by clearly defining what kind of function ANY_VALUE
is.
Just thinking out loud here. If option 2 is the path forward,
I think that raising awareness is the way to go here but I don't think that the proposed implementation of capturing errors and emitting warning is how we should do it. First because it would be brittle given we don't know the extent of this problem and the MySQL implementation could change and secondly because it's not an approach we've taken with this class of problems.
IMO this problem is very similar to how we let inappropriate casting errors bubble through and expect users to use CAST
where necessary from the error messages over trying to detect such errors and point them directly at django.db.models.functions.Cast
. In other words I think that it's better to have the user go through the following chain of thoughts
- Encounter type /
only_full_group_by
error - Do their research on the subject and learn about
CAST
/ANY_VALUE
- Search for Django
CAST
/ANY_VALUE
and endup on theCast
/AnyValue
docs
than committing to building a bullet proof solution that hides away these details from users. For all we know MySQL could finally implement functional dependency detection properly and all of our efforts (and bugs trying to get it right but failing) would be wasted.
For these reasons I believe that the proper way of raising awareness here is to
- Introduce an
AnyValue(Aggregate)
- Make sure the documentation mentions that its usage might be necessary on MySQL when mixing aggregate and non-aggregate functions when
sql_mode=only_full_group_by
in a.. note
- Link to
AnyValue
documentation from the aggregating annotations section of the docs.
Hey there,
Took a look at what is happening and why MySQL is failing with
ONLY_FULL_GROUP_BY
.In short and simplified, this statement works:
And when you try to add a third expression, that uses the two first:
MySQL is not happy, even though it seems rather straightforward that this query should work.
The resources I could find on the topic are here:
And the blog post explains in more depth why it's not working.
That leaves us with a choice to make for Django's behavior I reckon :)
Some options:
1. Add an
ANY_VALUE
around the problematic expressionThat solves the issue here for instance:
However, I fear that detecting when to wrap the expression with an
ANY_VALUE
is a rabbit hole we don't want to go down, as we might end up implementing what the MySQL team didn't want to implement.2. Raise awareness
We could, firstly, document the potential issue, and secondly raise a warning when such an error occurs when executing a query a Django.
That way, users are at least aware that's not entirely their or Django's fault.
3. Generally change query generation
Another type of workaround suggested by the MySQL blog post is to use a subquery/derived table:
So that we always try to group on a column, and not an expression.
Even though, it might be worse in terms of performances, depending the DB implementation I guess.
This change would then affect all databases I reckon, which is a much larger change, and therefore riskier.
4. Any other option! :D
I hope all of this makes sense, happy to read any thoughts on this :)
See ya!