Context Navigation

← Previous Ticket
Next Ticket →

#34262 closed Bug (fixed)

Queryset grouped by annotation with aggregates on another annotated expression crashes on MySQL with sql_mode=only_full_group_by.

Reported by:	Mariusz Felisiak	Owned by:	ontowhee
Component:	Database layer (models, ORM)	Version:	4.1
Severity:	Normal	Keywords:	mysql only_full_group_by
Cc:	Simon Charette, David Wobrock	Triage Stage:	Ready for checkin
Has patch:	yes	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description

Queryset grouped by annotation with aggregates on another annotated expression crashed on MySQL with sql_mode=only_full_group_by, e.g.

    def test_group_by_nested_expression_with_params(self):
        books_qs = (
            Book.objects.annotate(greatest_pages=Greatest("pages", Value(600)))
            .values(
                "greatest_pages",
            )
            .annotate(
                min_pages=Min("pages"),
                least=Least("min_pages", "greatest_pages"),
            )
            .values_list("least", flat=True)
        )
        self.assertCountEqual(books_qs, [300, 946, 1132])

crashes with:

django.db.utils.OperationalError: (1055, "Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'test_django_2.aggregation_book.pages' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by")

Change History (25)

comment:1 by Mariusz Felisiak, 3 years ago

Cc:	Simon Charette added

comment:2 by Carlton Gibson, 3 years ago

Triage Stage:	Unreviewed → Accepted

comment:3 by David Wobrock, 3 years ago

Cc:	David Wobrock added

Hey there,

Took a look at what is happening and why MySQL is failing with ONLY_FULL_GROUP_BY.

In short and simplified, this statement works:

mysql> SELECT GREATEST(pages, 600), MIN(pages) FROM aggregation_book GROUP BY GREATEST(pages, 600) ORDER BY NULL;
+----------------------+------------+
| GREATEST(pages, 600) | MIN(pages) |
+----------------------+------------+
|                  600 |        300 |
|                 1132 |       1132 |
|                  946 |        946 |
+----------------------+------------+
3 rows in set (0,01 sec)

And when you try to add a third expression, that uses the two first:

mysql> SELECT GREATEST(pages, 600), MIN(pages), LEAST(MIN(pages), GREATEST(pages, 600)) AS least FROM aggregation_book GROUP BY GREATEST(pages, 600) ORDER BY NULL;
ERROR 1055 (42000): Expression #3 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'test_django_tests.aggregation_book2.pages' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

MySQL is not happy, even though it seems rather straightforward that this query should work.

The resources I could find on the topic are here:

And the blog post explains in more depth why it's not working.

That leaves us with a choice to make for Django's behavior I reckon :)
Some options:

1. Add an `ANY_VALUE` around the problematic expression

That solves the issue here for instance:

mysql> SELECT GREATEST(pages, 600), MIN(pages), ANY_VALUE(LEAST(MIN(pages), GREATEST(pages, 600))) AS least FROM aggregation_book2 GROUP BY GREATEST(pages, 600) ORDER BY NULL;
+----------------------+------------+-------+
| GREATEST(pages, 600) | MIN(pages) | least |
+----------------------+------------+-------+
|                  600 |        300 |   300 |
|                 1132 |       1132 |  1132 |
|                  946 |        946 |   946 |
+----------------------+------------+-------+
3 rows in set (0,00 sec)

However, I fear that detecting when to wrap the expression with an ANY_VALUE is a rabbit hole we don't want to go down, as we might end up implementing what the MySQL team didn't want to implement.

2. Raise awareness

We could, firstly, document the potential issue, and secondly raise a warning when such an error occurs when executing a query a Django.
That way, users are at least aware that's not entirely their or Django's fault.

3. Generally change query generation

Another type of workaround suggested by the MySQL blog post is to use a subquery/derived table:

mysql> SELECT greatest_pages,
           MIN(pages),
           LEAST(MIN(pages), greatest_pages) AS least
    FROM (SELECT GREATEST(pages, 600) greatest_pages,
                 pages
          FROM aggregation_book2) AS t
    GROUP BY greatest_pages
    ORDER BY NULL;
+----------------+------------+-------+
| greatest_pages | MIN(pages) | least |
+----------------+------------+-------+
|            600 |        300 |   300 |
|           1132 |       1132 |  1132 |
|            946 |        946 |   946 |
+----------------+------------+-------+
3 rows in set (0,00 sec)

So that we always try to group on a column, and not an expression.
Even though, it might be worse in terms of performances, depending the DB implementation I guess.

This change would then affect all databases I reckon, which is a much larger change, and therefore riskier.

4. Any other option! :D

I hope all of this makes sense, happy to read any thoughts on this :)
See ya!

follow-up: 5 comment:4 by Simon Charette, 3 years ago

Thanks for the analysis David, option 1. and 2. were also the conclusion of my limited investigation on the subject so it's great to have cross peer validation on the subject.

I think there might be a way to implement 3. by reusing some of the logic used to implement masking of columns when filtering against window functions which requires two level of subquery wrapping.

A different of approaching 3. is to think that any form of masking of annotations/aliases used for grouping purposes would result in a subquery pushdown. So to reuse your example, instead of performing a subquery pushdown to compute expressions used with aggregate queries we'd do it the other way around to have MySQL group against top level SELECT expressions which it's good at inferring dependencies from

SELECT LEAST(min_pages, greatest_pages) AS `least` FROM (
  SELECT
    GREATEST(`aggregation_book`.`pages`, 600) greatest_pages,
    MIN(`aggregation_book`.`pages`) min_pages
  FROM `aggregation_book`
  GROUP BY 1 ORDER BY NULL
) masked

This should reduce the area of impact of this change to aggregating queries that group against a value that isn't explicitly selected and would also happen to solve the last remaining known issue with using server-side parameters binding for Postgres (#34255).

in reply to: 4 ; follow-up: 6 comment:5 by Amir Karimi, 2 years ago

Replying to Simon Charette:

Thanks for the analysis David, option 1. and 2. were also the conclusion of my limited investigation on the subject so it's great to have cross peer validation on the subject.

I think there might be a way to implement 3. by reusing some of the logic used to implement masking of columns when filtering against window functions which requires two level of subquery wrapping.

A different of approaching 3. is to think that any form of masking of annotations/aliases used for grouping purposes would result in a subquery pushdown. So to reuse your example, instead of performing a subquery pushdown to compute expressions used with aggregate queries we'd do it the other way around to have MySQL group against top level SELECT expressions which it's good at inferring dependencies from
SELECT LEAST(min_pages, greatest_pages) AS `least` FROM (
  SELECT
    GREATEST(`aggregation_book`.`pages`, 600) greatest_pages,
    MIN(`aggregation_book`.`pages`) min_pages
  FROM `aggregation_book`
  GROUP BY 1 ORDER BY NULL
) masked
This should reduce the area of impact of this change to aggregating queries that group against a value that isn't explicitly selected and would also happen to solve the last remaining known issue with using server-side parameters binding for Postgres (#34255).

I'm curios to know what happened with this issue. Any updates?

in reply to: 5 comment:6 by Mariusz Felisiak, 2 years ago

Replying to Amir Karimi:

I'm curios to know what happened with this issue. Any updates?

Feel-free to work on this issue. Please don't leave comments like any updates? they don't help and cause unnecessary noise to the history.

comment:7 by Jonny Park, 2 years ago

Owner:	changed from nobody to Jonny Park
Status:	new → assigned

comment:8 by Jonny Park, 2 years ago

I think @David Wobrock's query is easier to implement and covers more cases.

For example, if we have a followinfg queryset:

        books_qs = (
            Book.objects.annotate(greatest_pages=Greatest("pages", Value(600)))
            .values(
                "greatest_pages",
            )
            .annotate(
                min_pages=Min("pages"),
                least=Least("min_pages", "greatest_pages"),
            )
        )

Creating the following query that @David Wobrock presented seems like more sense to me and covers many other cases.

SELECT greatest_pages,
           MIN(pages),
           LEAST(MIN(pages), greatest_pages) AS least
    FROM (SELECT GREATEST(pages, 600) greatest_pages,
                 pages
          FROM aggregation_book2) AS t
    GROUP BY greatest_pages
    ORDER BY NULL;

If we were to take @Simon Charette's query, it could be like this:

SELECT geatest_pages, min_pages, LEAST(min_pages, greatest_pages) AS `least` FROM (
  SELECT
    GREATEST(`aggregation_book`.`pages`, 600) greatest_pages,
    MIN(`aggregation_book`.`pages`) min_pages
  FROM `aggregation_book`
  GROUP BY 1 ORDER BY NULL
) masked

I think the position of "MIN(aggregation_book.pages) min_pages" looks awkward.
With .values_list("least", flat=True) clause present, there was a obvious reason for "MIN(aggregation_book.pages) min_pages" to be pushed down because it is a dependency for least, but without .values_list("least", flat=True) it loses it's reason to be pushed down.
I am a bit suspicious that choosing which additional item to be pushed down by looking at values_list worth it's effort considering frequency of this use case is thought to be small.

Last edited 2 years ago by Jonny Park (previous) (diff)

comment:9 by Jonny Park, 2 years ago

Owner:	Jonny Park removed
Status:	assigned → new

comment:10 by Simon Charette, 2 years ago

FWIW this relates to #34992 where we had to disable allows_group_by_selected_pks on MariaDB entirely as it doesn't implement any form of functional dependence resolition.

comment:11 by Simon Charette, 22 months ago

A recent article on any_value and functional dependency if it can be of help to anyone working on this issue.

comment:12 by ontowhee, 11 months ago

Is there an expression in Django for ANY_VALUE()? I did a quick search for "anyvalue” and “any.*value” but it did not come up with results. Would it be useful to support such an expression?

My thought is, option 2 seems to be the lowest effort and risk to implement (the other options can potentially be added in later if there is a good solution). It can raise an error and suggest that the user apply such an expression to the offending column. This way, django is not making an arbitrary decision on wrapping the column with ANY_VALUE(). I haven’t dived into what it would take to support such an expression, so this may be naive. Any thoughts here?

comment:13 by Simon Charette, 11 months ago

There's none but you can easily write your own and circumvent the problems MySQL exhibits

from django.db.models.aggregate import Aggregate

class AnyValue(Aggregate):
    function = "ANY_VALUE"

Given MySQL, Postgres 16+, and Oracle 19c+ support it it might be worth considering adding it to core and documenting that it must be used under some circumstances on MySQL? It can also be useful under other circumstances on Postgres and Oracle which adhere more closely to the spec regarding GROUP BY rules.

Last edited 11 months ago by Simon Charette (previous) (diff)

comment:14 by ontowhee, 11 months ago

Owner:	set to ontowhee
Status:	new → assigned

Should a new ticket be created for adding support for ANY_VALUE()? For MySQL, since ANY_VALUE() is not an aggregate function, would it be implemented as an Aggregate or Func or a different expression type?

Just thinking out loud here. If option 2 is the path forward,

Would the warning be raised before the query is evaluated? That means the query needs to detect if there are nonaggregated expressions. I'll need to dig more to understand how that might work.
Or, raise the warning after the query is evaluated by catching the OperationalError in the CursorWrapper?

    codes_for_warnings = (
        1055,  # Expression not in GROUP BY and contains nonaggregated colum
    )
    ...

    def execute(self, query, args=None):
        try:
            # args is None means no string interpolation
            return self.cursor.execute(query, args)
        except Database.OperationalError as e:
            # Map some error codes to IntegrityError, since they seem to be
            # misclassified and Django would prefer the more logical place.
            if e.args[0] in self.codes_for_integrityerror:
                raise IntegrityError(*tuple(e.args))
            else if e.args[0] in self.codes_for_warnings:
                warnings.warn(
                    "%s "
                    "Consider wrapping the nonaggregated expression using AnyValue." %(e),
                    RuntimeWarning,
                )
                return
            raise

I'm going to dig around and see if the warning can be raised before the query is evaluated. I might start looking through the resolve_expression() functions, since there seem to be patterns of raising errors there.

comment:15 by Simon Charette, 11 months ago

Should a new ticket be created for adding support for ANY_VALUE()?

I think this that this ticket can be re-purposed for the introduction of AnyValue(Aggregate) given that the reported problem here is that MySQL query planer isn't smart enough to transitively infer the presence of members in the GROUP BY clause and that explicit usage of ANY_VALUE solves that.

For MySQL, since ANY_VALUE() is not an aggregate function, would it be implemented as an Aggregate or Func or a different expression type?

It's effectively not an aggregate function per-se on MySQL but more of a sentinel to tell the query planner to ignore the only_full_group_by check only for a single expression but in the Django sense it must be defined as an Aggregate subclass otherwise it will be included in the GROUP BY clause. In other words, it should be implemented as an Aggregate subclass even if MySQL doesn't follow the SQL spec by clearly defining what kind of function ANY_VALUE is.

Just thinking out loud here. If option 2 is the path forward,

I think that raising awareness is the way to go here but I don't think that the proposed implementation of capturing errors and emitting warning is the way to go here. First because it's brittle given we don't know the extent of this problem and the MySQL implementation could change and secondly because it's not an approach we've taken with this class of problems.

IMO this problem is very similar to how we let inappropriate casting errors bubble through and expect users to use CAST where necessary from the error messages over trying to detect such errors and point them directly at django.db.models.functions.Cast. In other words I think that it's better to have the user go through the following chain of thoughts

Encounter type / only_full_group_by error
Do their research on the subject and learn about CAST / ANY_VALUE
Search for Django CAST / ANY_VALUE and endup on the Cast / AnyValue docs

than committing to building a bullet proof solution that hides away these details from users. For all we know MySQL could finally implement functional dependency detection properly and all of our efforts (and bugs trying to get it right but failing) would be wasted efforts.

For these reasons I believe that raising awareness is effectively the way to go but I think the way to do so is

Introduce an AnyValue(Aggregate)
Make sure the documentation mentions that its usage might be necessary on MySQL when mixing aggregate and non-aggregate functions when sql_mode=only_full_group_by in a .. note
Link to AnyValue documentation from the aggregating annotations section of the docs

Version 1, edited 11 months ago by Simon Charette (previous) (next) (diff)

comment:16 by ontowhee, 10 months ago

Thank you so much for the detailed clarifications! It helps me understand the context. The chain of thoughts for the user that you outlined makes a lot of sense, and it is what I would prefer in my own workflow, too.

I have WIP in this branch. I’ve added AnyValue, along with some tests.

One thing I'm not sure about is, in MySQL, the expression ANY_VALUE(LEAST(MIN(pages), GREATEST(pages, 600))) is valid, but creating the expression in django with AnyValue(Least("min_pages", "greatest_page")) leads to the following error during the resolving phase,

Cannot compute AnyValue(’Least(F(min_pages), F(greatest_pages))’): ‘Least(F(min_pages), F(greatest_pages))’ is an aggregate

For Postgresql and Oracle, it makes sense to raise this error, because ANY_VALUE is considered an aggregate function, and nested aggregates are not allowed. However, for MySQL, would it make sense to not raise the error? Would that mean changing how resolve_expression is checking the contains_aggregate attribute? The resolving phase is backend agnostic. Trying to catch it in the compilation phase in as_mysql() would be too late, because the error would have already been raise. I don't have a clear path forward at this point, but will dig around and think about this more. Any ideas and suggestions are welcomed, of course!

I'm also a bit stuck on understanding how to write the SQL statement for ANY_VALUE(…) FILTER(…) OVER(…). I was running statements directly in MySQL to better understand the different combinations of the FILTER, OVER, GROUP BY clauses, trying add support according to the chart in the article, but I'm getting syntax errors.

Here is a simple example I was running,

CREATE TABLE person(id int, first_name varchar(255), dob date);
INSERT INTO person(id, first_name, dob)
VALUES(1, 'alice', '2025-01-01'),
(2, 'alice', '2024-02-02'),
(3, 'alice', '2025-03-03'),
(4, 'bob', '2025-04-04'),
(5, 'charlie', '2025-05-05');

These statement worked in MySQL.

SELECT ANY_VALUE(id) FROM person;
SELECT ANY_VALUE(id) FROM person GROUP BY first_name;

These statements did not work. They returned a syntax error. I expected the last one, ANY_VALUE(..) FILTER(...) OVER(...), to be valid, according to the chart, but was surprised to see the syntax error. All the statements work in Postgresql.

SELECT ANY_VALUE(id) OVER() FROM person;
SELECT ANY_VALUE(id) FILTER(WHERE id > 1) FROM person;
SELECT ANY_VALUE(id) FILTER(WHERE id > 1) OVER() FROM person;
SELECT ANY_VALUE(id) FILTER(WHERE id > 1) OVER(partition by first_name) FROM person;

MySQL does not support FILTER clause, and django will emulate using CASE instead. I wonder if the chart in the article is accurate, or if the FILTER clause in ANY_VALUE is a special case.

If anyone can see what I'm doing wrong here, and give an example of a valid statement with ANY_VALUE(..) FILTER(...) OVER(...), that would be helpful!

comment:17 by ontowhee, 10 months ago

Draft PR

I opened a draft PR. There are some CI test failures, but I would love to get someone's opinion on to make sure this is going in the right direction. This table shows what is implemented. It strays a bit from the chart:

+-----------------------------+---------------+------------+------------+
|                             | MySQL         | Oracle     | PostgreSQL |
+-----------------------------+---------------+------------+------------+
| ANY_VALUE()                 | Yes           | Yes        | Yes        |
| ANY_VALUE() FILTER()        | No, uses CASE | No         | Yes        |
| ANY_VALUE() OVER()          | No            | Yes        | Yes        |
| ANY_VALUE() FILTER() OVER() | No, uses CASE | No         | Yes        |
+-----------------------------+---------------+------------+------------+

The over clause has not been implemented for AnyValue(). The tests are using the Window function and partition_by parameter to achieve this.
- Question: Is this a good approach? Or should AnyValue() implement over clause?
The filter clause is leveraging the existing filter implementation for expressions. For MySQL, this means it creates CASE() instead of FILTER(). MySQL does not support filter clause.
The chart in the article shows a lightning bolt MySQL. I may be misunderstanding what that means, but I have not been able to directly run a “ANY_VALUE() FILTER() OVER()” expression. I tried to keep the scope of this ticket simple by not addressing this one for now.
For the specific case described in this ticket, instead of creating the django expression AnyValue(Least("min_pages", "greatest_page")) , the tests are creating Least("min_pages", AnyValue("greatest_page")). It wraps AnyValue() on "greatest_page" instead of the entire expression. This avoids the error that is raised Cannot compute ... is an aggregate error in resolve_expression, and it seems to make sense now that all the columns have an "aggregate" function operating on them.

Last edited 10 months ago by ontowhee (previous) (diff)

comment:18 by Simon Charette, 9 months ago

Has patch:	set
Patch needs improvement:	set

I left some comments on the PR but this is looking promising. I provided a way to make it work for SQLite as well.

Question: Is this a good approach? Or should AnyValue() implement over clause?

Given SQLite doesn't support it and the FirstValue window-compatible function can be used instead I think it might be better to set AnyValue.window_compatible = False for now and revisit in the future if truly needed.

The filter clause is leveraging the existing filter implementation for expressions. For MySQL, this means it creates CASE() instead of FILTER().

That's expected just like with any other aggregates. Is this causing issues with NULL values generated by members that should be filtered out?

The chart in the article shows a lightning bolt MySQL. I may be misunderstanding what that means, but I have not been able to directly run a “ANY_VALUE() FILTER() OVER()” expression. I tried to keep the scope of this ticket simple by not addressing this one for now.

The issue goes away if we don't initially support its usage in window functions which I think is okay for now.

This avoids the error that is raised Cannot compute ... is an aggregate error in resolve_expression, and it seems to make sense now that all the columns have an "aggregate" function operating on them.

That makes sense to me as well.

Last edited 9 months ago by Simon Charette (previous) (diff)

comment:19 by ontowhee, 9 months ago

Patch needs improvement:	unset

I've pushed changes to the PR that reflect this table, and updated the table to reflect what is supported. Does this look good?

+-----------------------------+--------+---------------+------------+------------+
|                             | SQLite | MySQL         | Oracle     | PostgreSQL |
+-----------------------------+--------+---------------+------------+------------+
| ANY_VALUE()                 | Yes    | Yes           | Yes        | Yes        |
| ANY_VALUE() FILTER()        | Yes    | No [*]        | No         | Yes        |
| ANY_VALUE() OVER()          | No     | No            | No         | No         |
| ANY_VALUE() FILTER() OVER() | No     | No            | No         | No         |
+-----------------------------+--------+---------------+------------+------------+

[*] MySQL does not seem to support this, but Django will use CASE if the filter clause is present.

The filter clause is leveraging the existing filter implementation for expressions. For MySQL, this means it creates CASE() instead of FILTER().

That's expected just like with any other aggregates. Is this causing issues with NULL values generated by members that should be filtered out?

I think I initially wrote the comment to clarify to myself how Django is handling FILTER, even though I was observing that running ANY_VALUE() FILTER() directly in the db would raise a syntax error.

With CASE, it does return NULL values. I don't know if that is an issue, because ANY_VALUE can return NULL for MySQL. For example:

CREATE TABLE person(id int, first_name varchar(255), dob date);

INSERT INTO person(id, first_name, dob)
VALUES(1, null, '2025-01-01'),
(2, 'alice', '2024-01-01'),
(3, 'alice', '2025-02-02'),
(4, 'bob', '2025-04-04'),
(5, 'charlie', '2025-05-05');

SELECT ANY_VALUE(first_name), EXTRACT(YEAR FROM dob) FROM person GROUP BY EXTRACT(YEAR FROM dob);

For me, this returns (NULL, 2025), ('alice', 2024).

comment:20 by Simon Charette, 9 months ago

Patch needs improvement:	set

With CASE, it does return NULL values. I don't know if that is an issue, because ANY_VALUE can return NULL for MySQL.

I think that's fine given an explicit COALESCE could be used if needs be. I suspect AnyValue will end up mostly being useful to assist MySQL functional dependency resolver and rarely with filter anyway.

Left some comments about the mention of support for SQLite in the release notes as well as supported Tim's request for slight tweaks of docs in MySQL specific parts. It also seems like ANY_VALUE can be used on MySQL even when ONLY_FULL_GROUP_BY is disabled.

comment:21 by ontowhee, 9 months ago

Patch needs improvement:	unset

comment:22 by Sarah Boyce, 7 months ago

Patch needs improvement:	set

comment:23 by ontowhee, 7 months ago

Patch needs improvement:	unset

comment:24 by Sarah Boyce, 7 months ago

Triage Stage:	Accepted → Ready for checkin

comment:25 by Sarah Boyce <42296566+sarahboyce@…>, 7 months ago

Resolution:	→ fixed
Status:	assigned → closed

In ddb8529:

Fixed #34262 -- Added support for AnyValue for SQLite, MySQL, Oracle, and Postgresql 16+.

Thanks Simon Charette for the guidance and review. Thanks Tim Schilling for the
documentation review. Thanks David Wobrock for investigation and solution proposals.

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues