Opened 6 years ago
Closed 6 years ago
#30130 closed Bug (duplicate)
Django .values().distinct() returns a lot more records than .values().distinct().count()
Reported by: | James Lin | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 2.1 |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Unreviewed | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
I have a table virtualmachineresources, which has 100k+ rows, it has columns machine
and cluster
, some rows the cluster field is empty. it has repeating rows of machine + with/without cluster, hence I want to use the distinct() method.
using .values().distinct().count(), it returned 2k rows
In [6]: VirtualMachineResources.objects.all().values('machine', 'cluster') ...: .distinct().count() Out[6]: 2247
When I loop through the distinct query
for resource in VirtualMachineResources.objects.all().values('machine', 'cluster').distinct(): print(resource['machine'], resource['cluster'])
I observed it returned 100k rows, with repeating rows that the same 'machine` with/without the cluster.
Here is the corresponding stackoverflow question https://stackoverflow.com/questions/54354462/django-distinct-returns-more-records-than-count
Change History (5)
comment:1 by , 6 years ago
Description: | modified (diff) |
---|
comment:2 by , 6 years ago
Description: | modified (diff) |
---|
comment:4 by , 6 years ago
I'm pretty sure that's what's at play here as I managed to reproduce locally. You have a defined Meta.ordering
(or are performing an order_by
) which makes your non-count query something along the lines of
SELECT DISTINCT "machine", "cluster", "ordering" FROM "resources" ORDER BY "ordering" ASC
The ordering
column is added to the SELECT
clause to prevent the generation of invalid SQL.
SELECT DISTINCT "machine", "cluster" FROM "resources" ORDER BY "ordering" ASC > ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
The .count()
query happens to be the following
SELECT COUNT(*) FROM (SELECT DISTINCT "machine", "cluster" FROM "resources") subquery
It looks like the issue is that we clear the ordering when we shouldn't on the .count()
. I think the logic needs to be adjusted to branch of not self.distinct
instead of self.distinct_fields
. Thing is switching this line generate this really nasty double nested query
SELECT COUNT(*) FROM ( SELECT "machine", "cluster" FROM ( SELECT DISTINCT "machine", "cluster", "ordering" FROM "resources" ORDER BY "ordering" ASC subquery )) subquery
Ideally we'd want
SELECT COUNT(*) FROM (SELECT DISTINCT "machine", "cluster", "ordering" FROM "resources") subquery
Where the ORDER BY
is cleared as it's not necessary but the columns that would have made it to SELECT
clause to satisfy the DISTINCT
/ORDER BY
rule are all included.
comment:5 by , 6 years ago
Resolution: | → duplicate |
---|---|
Status: | new → closed |
Do you happen to have a
VirtualMachineResources.Meta.ordering
defined? What doesprint(VirtualMachineResources.objects.values('machine', 'cluster').query)
yield?