Opened 9 years ago
Last modified 5 weeks ago
#26434 assigned Bug
Inconsistent results of QuerySet.count() when ordering is not a subset of explicit grouping.
Reported by: | kamandol | Owned by: | Michal Mládek |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | postgresql queryset count annotate aggreagate order_by |
Cc: | github@…, Simon Charette | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Using a PostgreSQL backend database, in some cases with QuerySets involving annotations, values or values_list and ordering(order_by) clauses the resulting QuerySet count() method fails to predict the real row result unless that QuerySet is evaluated.
For instance:
q = BundleVersion.objects.order_by('-id').values_list("port__id", "asset_bundle__id").annotate(max=models.Max("id"))
This QuerySet using PostgreSQL would, in fact, group the results by BundleVersion.id instead of the tuple ("port__id", "asset_bundle__id") due to the order_by clause using a column not in the values_list. This is a documented behavior.
The problem is that if q is not evaluated yet, calling q.count() will return an amount of rows as if the grouping was done on the tuple ("port__id", "asset_bundle__id"). After evaluating q, or calling len(q), the result of q.count() will be the expected, as if the grouping was done using 'BundleVersion.id'.
Attachments (2)
Change History (27)
follow-up: 2 comment:1 by , 9 years ago
comment:2 by , 9 years ago
Replying to timgraham:
Could you provide a test for Django's test suite in
tests/queries/tests.py
that demonstrates the issue?
Here is the commit with the demonstration:
https://github.com/sp-ricard-valverde/django/commit/c50afafdb02880e4c941c2a72a215bee80de3aed
follow-up: 5 comment:3 by , 9 years ago
Tests looks legitimate, but I don't get this one https://github.com/sp-ricard-valverde/django/commit/c50afafdb02880e4c941c2a72a215bee80de3aed#diff-ce9b52a66d03e851a9828377263dc04bR3864 ; or rather I don't get why is labeled as 'OK' as it the concatenation of the previous ones
comment:4 by , 9 years ago
Cc: | added |
---|
comment:5 by , 9 years ago
Replying to yakky:
Tests looks legitimate, but I don't get this one https://github.com/sp-ricard-valverde/django/commit/c50afafdb02880e4c941c2a72a215bee80de3aed#diff-ce9b52a66d03e851a9828377263dc04bR3864 ; or rather I don't get why is labeled as 'OK' as it the concatenation of the previous ones
Sorry, my bad, it's a bad choice for a test title. What it demonstrates is that the second assert in that test is successful even though a new QuerySet is built and, according to the other failed tests it should fail too.
comment:6 by , 9 years ago
Updated test comments and clearer test names
https://github.com/sp-ricard-valverde/django/commit/42c1002a00b434331fd2c9970b05baf02f0a262f
comment:7 by , 9 years ago
Triage Stage: | Unreviewed → Accepted |
---|
If you could put the tests on a branch of you fork other than "master" that seems a bit safer. There is probably no need for skipUnlessDBEngine
- is there a reason the queries shouldn't also work on other databases? Anyway, the usual way to skip tests is based on database features.
comment:8 by , 9 years ago
I will change the branch as requested.
You're right, the faulty behavior was detected on PostgreSQL, but on MySQL worked correctly so the test should pass. Could not test other db engines though. I will remove that extra piece of code.
comment:10 by , 6 years ago
Has patch: | unset |
---|---|
Version: | 1.9 → master |
comment:11 by , 6 years ago
I want to discuss when collecting group by columns why are order by columns added into extensions.
For example:
Model.objects.values("col_a").annotate(max=Max("col_b")).order_by('col_c')
current query:
select col_a, MAX(col_b) as max from Model group by col_a, col_c order by col_c
I think the expected behavior has should be like that:
django.db.utils.ProgrammingError: column "Model.col_c" must appear in the GROUP BY clause or be used in an aggregate function
Because this behavior changes my group by and result. if I'm lucky, my result doesn't change.
If we check test_annotation_with_value, 'name' column is being added into group_by because Book model has Meta.ordering.
comment:12 by , 6 years ago
Has patch: | set |
---|
comment:13 by , 5 years ago
Patch needs improvement: | set |
---|
comment:14 by , 2 months ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
follow-up: 16 comment:15 by , 2 months ago
Summary: | Inconsistent results of QuerySet count() method using PostgreSQL backend prior and post the QuerySet evaluation → Inconsistent results of QuerySet.count() when ordering is not a subset of explicit grouping. |
---|
FWIW the issue is not Postgres specific (I reproduced against SQLite).
The problem appears to be caused by sql.Query.get_aggregation
's call to clear_ordering(force=False)
. It seems like the later should skip clearing when isinstance(self.group_by, tuple)
and self.order_by + self.extra_order_by
is not a subset of self.group_by
as force=False
must only clear if doing so preserves the semantic of the query which clearly doesn't if the ordering includes members missing from the group by (as they would normally be added at a later stage).
I adjusted the ticket summary accordingly.
follow-up: 17 comment:16 by , 2 months ago
Hello Simon,
Thank you very much for clarifying the bug — especially for pointing out that it’s not Postgres-specific, and also for the hint about a subset. I think everything will go much more smoothly and quickly now.
Replying to Simon Charette:
FWIW the issue is not Postgres specific (I reproduced against SQLite).
The problem appears to be caused by
sql.Query.get_aggregation
's call toclear_ordering(force=False)
. It seems like the later should skip clearing whenisinstance(self.group_by, tuple)
andself.order_by + self.extra_order_by
is not a subset ofself.group_by
asforce=False
must only clear if doing so preserves the semantic of the query which clearly doesn't if the ordering includes members missing from the group by (as they would normally be added at a later stage).
I adjusted the ticket summary accordingly.
comment:17 by , 2 months ago
Replying to Simon Charette:
Using order_by('...')
on a column that is also used implicitly for grouping in an aggregation queryset is a misuse of the ORM AFAIK. So what should we do about it?
Should we implement a patch that triggers a warning if order_by('...')
includes a column that ends up in the annotate(...)
, I mean GROUP BY ...
clause? Or should we instead modify the behavior of queryset.count()
to preserve ordering in such edge cases?
In case of implementing a warning, this doesn't look like a bug - it would be better treated as a feature request, I believe.
Also, the documentation already warns about this interaction quite clearly:
https://docs.djangoproject.com/en/5.2/topics/db/aggregation/#interaction-with-order-by — and that warning has been present since the LTS version Django 3.2.
Hello Simon,
Thank you very much for clarifying the bug — especially for pointing out that it’s not Postgres-specific, and also for the hint about a subset. I think everything will go much more smoothly and quickly now.
Replying to Simon Charette:
FWIW the issue is not Postgres specific (I reproduced against SQLite).
The problem appears to be caused by
sql.Query.get_aggregation
's call toclear_ordering(force=False)
. It seems like the later should skip clearing whenisinstance(self.group_by, tuple)
andself.order_by + self.extra_order_by
is not a subset ofself.group_by
asforce=False
must only clear if doing so preserves the semantic of the query which clearly doesn't if the ordering includes members missing from the group by (as they would normally be added at a later stage).
I adjusted the ticket summary accordingly.
comment:18 by , 2 months ago
Using order_by('...') on a column that is also used implicitly for grouping in an aggregation queryset is a misuse of the ORM AFAIK. So what should we do about it?
I share your sentiment that a warning should be emitted but that likely warrant a separate ticket inspired by #14357, #32546 (there might be one already I just can't find it now).
I think we should keep this one focused on making sure the right results are returned in the mean time as it's caused by an over-eager optimization (pruning an ORDER BY
for performance reasons).
The way things are currently setup the not quite correct example is wrong as the order_by("name")
is elided when count()
is called but not when the full query is materialized.
The example you linked is about aggregate function annotations but you'll notice that the gotcha is not honored when aggregation is used on top of the query.
In other words, we should make the gotcha more explicit but we should first make sure it's coherent if aggregation is used over a queryset making use of aggregate annotations.
comment:19 by , 2 months ago
Cc: | added |
---|
comment:20 by , 8 weeks ago
I added patch in attachment, I am still working on in it. I think I am one step closer to the solution.
by , 8 weeks ago
Attachment: | main_logic.patch added |
---|
by , 8 weeks ago
Attachment: | main_logic.2.patch added |
---|
follow-up: 22 comment:21 by , 8 weeks ago
Now the patch is done. I am continue with regression tests...
comment:23 by , 7 weeks ago
Patch needs improvement: | set |
---|
Left a few comments on the PR, the patch is quite invasive and could be reduced to a subtle changes in how sql.Query.get_aggregation
calls clear_ordering
as described in comment:15.
comment:24 by , 6 weeks ago
Patch needs improvement: | unset |
---|
comment:25 by , 5 weeks ago
I left there 2 PRs because I reached 2 similar solutions, one of them seems more detailed but more power/time consuming and the other is simple and straightforward.
Could you provide a test for Django's test suite in
tests/queries/tests.py
that demonstrates the issue?