#33768 closed Bug (fixed)
MySQL ordering of nulls last/first is broken in combination with UNION
Reported by: | Florian Apolloner | Owned by: | Simon Charette |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Adam Johnson, Simon Charette | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
putting the following into test_qs_combinators.py:
def test_ordering_with_null_first(self): Number.objects.filter(other_num=5).update(other_num=None) qs = ( Number.objects.filter(num__lte=1) .union(Number.objects.filter(num__gte=2)) .order_by(F("other_num").asc(nulls_first=True)) .values_list("other_num", flat=True) ) self.assertQuerysetEqual(qs, [None, 1, 2, 3, 4, 6, 7, 8, 9, 10]) qs = ( Number.objects.filter(num__lte=1) .union(Number.objects.filter(num__gte=2)) .order_by(F("other_num").asc(nulls_last=True)) .values_list("other_num", flat=True) ) self.assertQuerysetEqual(qs, [1, 2, 3, 4, 6, 7, 8, 9, 10, None])
results in:
====================================================================== FAIL: test_ordering_with_null_first (queries.test_qs_combinators.QuerySetSetOperationTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/florian/sources/django.git/tests/queries/test_qs_combinators.py", line 171, in test_ordering_with_null_first self.assertQuerysetEqual(qs, [1, 2, 3, 4, 6, 7, 8, 9, 10, None]) File "/home/florian/sources/django.git/django/test/testcases.py", line 1328, in assertQuerysetEqual return self.assertEqual(list(items), values, msg=msg) AssertionError: Lists differ: [None, 1, 2, 3, 4, 6, 7, 8, 9, 10] != [1, 2, 3, 4, 6, 7, 8, 9, 10, None] First differing element 0: None 1 - [None, 1, 2, 3, 4, 6, 7, 8, 9, 10] + [1, 2, 3, 4, 6, 7, 8, 9, 10, None] ---------------------------------------------------------------------- (0.000) UPDATE `queries_number` SET `other_num` = NULL WHERE `queries_number`.`other_num` = 5; args=(5,); alias=default (0.000) (SELECT `queries_number`.`other_num` FROM `queries_number` WHERE `queries_number`.`num` <= 1) UNION (SELECT `queries_number`.`other_num` FROM `queries_number` WHERE `queries_number`.`num` >= 2) ORDER BY (1) ASC; args=(1, 2); alias=default (0.000) (SELECT `queries_number`.`other_num` FROM `queries_number` WHERE `queries_number`.`num` <= 1) UNION (SELECT `queries_number`.`other_num` FROM `queries_number` WHERE `queries_number`.`num` >= 2) ORDER BY (1) IS NULL, (1) ASC; args=(1, 2); alias=default ----------------------------------------------------------------------
Due to the UNION
we are rewriting queries to use column numbers so we can somewhat reliably target the columns. This breaks down when nulls_first/last
is used because (1) IS NULL
is interpreted as expression in mysql and not as the colname 1 being NULL. Which brings me back to https://twitter.com/fapolloner/status/1533512493208936450 -- I think that the rewriting of this expression should happen in the compiler and not in OrderBy
itself. This way we could duplicate the OrderBy
into two and push them into the relevant select clauses like we already do in https://github.com/django/django/blob/49b470b9187b6be60e573fed08c8f4a87f133750/django/db/models/sql/compiler.py#L410 for other things.
Sadly it is a mess… Thoughts? Paging Adam since he was the resident MySQL specialist IIRC :)
Change History (11)
comment:1 by , 2 years ago
Description: | modified (diff) |
---|
comment:2 by , 2 years ago
Cc: | added |
---|
comment:3 by , 2 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:4 by , 2 years ago
What if we performed a subquery pushdown instead to make compilation more predictable so we can have the order by clause refer to the union columns by name?
Maybe, in hindsight that would probably should have been the way to implement unions (etc) in the first place. Maybe now is the time to switch to that :) What does a subquery mean in terms of performance -- I'd hope the database wouldn't care to much…
I don't see
OrderBy.as_sql
usage of template for compilation is an abuse at all; it's in charge of it's own compilation unit.
That is fair, but then again it still requires help from the outside because we have to pull it into the select clause when used with distinct. Which then results in horrible queries like the one outlined by Tim in https://github.com/django/django/pull/15687#issuecomment-1133798092 (see "The old SQL") where we have the initial select value and then once with IS NULL
and once without because we can't match it up to the initial select value that is aliased to last_date
anymore. In practice part of this is caused by https://github.com/django/django/blob/49b470b9187b6be60e573fed08c8f4a87f133750/django/db/models/sql/compiler.py#L466 where we use a regex to parse the generated SQL from OrderBy
and match it back to the select clause, which then obviously fails because the compilation of OrderBy
generated two clauses. Imo this is all rather fragile…
Also, and this is where #33767 comes back into play, we should try as hard as possible to reuse aliases where possible. Tim's old SQL example executed on PG nowadays (his is CockroachDB) looks like this:
SELECT DISTINCT "ordering_author"."id", "ordering_author"."name", "ordering_author"."editor_id", (SELECT Max(U0."pub_date") AS "last_date" FROM "ordering_article" U0 WHERE ( U0."author_id" = ( "ordering_author"."id" ) AND Upper(U0."headline" :: text) LIKE Upper( '%Article%') ) GROUP BY U0."author_id") AS "last_date", FROM "ordering_author" ORDER BY (SELECT Max(U0."pub_date") AS "last_date" FROM "ordering_article" U0 WHERE ( U0."author_id" = ( "ordering_author"."id" ) AND Upper(U0."headline" :: text) LIKE Upper('%Article%') ) GROUP BY U0."author_id") DESC NULLS FIRST
when it could be
SELECT DISTINCT "ordering_author"."id", "ordering_author"."name", "ordering_author"."editor_id", (SELECT Max(U0."pub_date") AS "last_date" FROM "ordering_article" U0 WHERE ( U0."author_id" = ( "ordering_author"."id" ) AND Upper(U0."headline" :: text) LIKE Upper( '%Article%') ) GROUP BY U0."author_id") AS "last_date", FROM "ordering_author" ORDER BY "last_date" DESC NULLS FIRST
Reusing aliases and references seems like a good idea. It makes the query imo more readable and also means less parsing overhead for the server (probably negligible but still).
Last but not least, the reason I discovered this is the new code for psycopg3 support. Psycopg3 makes use of server side param binding which introduces a whole new class of issues. To be fair this is not postgres alone but also informix and others. When actually using server side params you run into funny things like this: Assume Django ends up generating a query like this:
cursor.execute(""" SELECT left(name, %s) AS prefix, count(*) AS number FROM people GROUP BY left(name, %s) ORDER BY left(name, %s) """, [2, 2, 2])
This works perfectly fine with client side merging (simply replace %s with 2), but breaks done once you use server side binding because the server gets $1 to $3
for the params and doesn't know that they are actually the same and as such fails because it no longer knows if the group by is part of the select clause or not. Mind you, this example is rather simplistic but Django actually generates those like shown in Tim's example above.
To fix this we have two options:
- Switch to named parameters and introduce even more logic to always write %(same_param)s in the previous example.
- Switch our usage of order by / group by to use aliases and column numbers whenever possible.
comment:5 by , 2 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
Hello Everyone, I am new to open source contribution. But had used Django for the last 3 years. I want to contribute to the resolution of this bug so that I can start my journey as an open-source contributor to Django.
comment:6 by , 2 years ago
Aniruddh, we appreciate your interest but this issue is quite complex so you might want to focus on another issue instead for your first contributions.
Thanks for the great summary Florian!
Maybe, in hindsight that would probably should have been the way to implement unions (etc) in the first place. Maybe now is the time to switch to that :) What does a subquery mean in terms of performance -- I'd hope the database wouldn't care to much…
From some limited local testing on SQLite, PostgreSQL, and MySQL and looking at the generate EXPLAIN
plan it doesn't seem like they care too much. But it might not be necessary after some more thoughts as the ORDER BY
clause allows for ordering by aliases as well.
Reusing aliases and references seems like a good idea. It makes the query imo more readable and also means less parsing overhead for the server (probably negligible but still).
I completely agree with this statement but I don't understand how this relates to using selected field indexes instead of aliases?
From my understanding and testing engines locally it seems that all engines allow to refer to the first query member of the union by column aliases just like by index so once we've addressed the issue with ordering by selected references not being used we could entirely drop the specialized combinator logic that uses RawSQL
to order by column index?
comment:7 by , 2 years ago
Hi Simon, sorry I lost track of this one in my mails.
From my understanding and testing engines locally it seems that all engines allow to refer to the first query member of the union by column aliases just like by index
Interesting, that would be great indeed. I think I might have had issues with UNION at the time I implemented it but if aliases work around the board, that would be preferably imo.
we could entirely drop the specialized combinator logic that uses RawSQL to order by column index?
Hopefully yes :)
comment:9 by , 2 years ago
Has patch: | set |
---|---|
Owner: | changed from | to
Triage Stage: | Accepted → Ready for checkin |
What if we performed a subquery pushdown instead to make compilation more predictable so we can have the order by clause refer to the union columns by name?
This is a strategy that is used successfully for aggregation for example. I don't see
OrderBy.as_sql
usage oftemplate
for compilation is an abuse at all; it's in charge of it's own compilation unit. As mentioned on #33767 I'd prefer if we errored out on constant order by usage instead of making leaps to support it because it has ambiguous evaluation rules.