Opened 5 years ago

Closed 5 years ago

#30130 closed Bug (duplicate)

Django .values().distinct() returns a lot more records than .values().distinct().count()

Reported by: James Lin Owned by: nobody
Component: Database layer (models, ORM) Version: 2.1
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by James Lin)

I have a table virtualmachineresources, which has 100k+ rows, it has columns machine and cluster, some rows the cluster field is empty. it has repeating rows of machine + with/without cluster, hence I want to use the distinct() method.

using .values().distinct().count(), it returned 2k rows

In [6]: VirtualMachineResources.objects.all().values('machine', 'cluster')
   ...: .distinct().count()                                               
Out[6]: 2247

When I loop through the distinct query

for resource in VirtualMachineResources.objects.all().values('machine', 'cluster').distinct(): 
    print(resource['machine'], resource['cluster'])

I observed it returned 100k rows, with repeating rows that the same 'machine` with/without the cluster.

Here is the corresponding stackoverflow question https://stackoverflow.com/questions/54354462/django-distinct-returns-more-records-than-count

Change History (5)

comment:1 by James Lin, 5 years ago

Description: modified (diff)

comment:2 by James Lin, 5 years ago

Description: modified (diff)

comment:3 by Simon Charette, 5 years ago

Do you happen to have a VirtualMachineResources.Meta.ordering defined? What does print(VirtualMachineResources.objects.values('machine', 'cluster').query) yield?

Last edited 5 years ago by Simon Charette (previous) (diff)

comment:4 by Simon Charette, 5 years ago

I'm pretty sure that's what's at play here as I managed to reproduce locally. You have a defined Meta.ordering (or are performing an order_by) which makes your non-count query something along the lines of

SELECT DISTINCT "machine", "cluster", "ordering" FROM "resources" ORDER BY "ordering" ASC

The ordering column is added to the SELECT clause to prevent the generation of invalid SQL.

SELECT DISTINCT "machine", "cluster" FROM "resources" ORDER BY "ordering" ASC
> ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list

The .count() query happens to be the following

SELECT COUNT(*) FROM (SELECT DISTINCT "machine", "cluster" FROM "resources") subquery

It looks like the issue is that we clear the ordering when we shouldn't on the .count(). I think the logic needs to be adjusted to branch of not self.distinct instead of self.distinct_fields. Thing is switching this line generate this really nasty double nested query

SELECT COUNT(*) FROM (
    SELECT "machine", "cluster" FROM (
        SELECT DISTINCT "machine", "cluster", "ordering" FROM "resources" ORDER BY "ordering" ASC subquery
    )) subquery

Ideally we'd want

SELECT COUNT(*) FROM (SELECT DISTINCT "machine", "cluster", "ordering" FROM "resources") subquery

Where the ORDER BY is cleared as it's not necessary but the columns that would have made it to SELECT clause to satisfy the DISTINCT/ORDER BY rule are all included.

Last edited 5 years ago by Simon Charette (previous) (diff)

comment:5 by Simon Charette, 5 years ago

Resolution: duplicate
Status: newclosed

There doesn't seem to be a consensus on how to deal with these type of queries right now so I'm just going to close as a duplicate of #14357, #16025, #28560.

Note: See TracTickets for help on using tickets.
Back to Top