Opened 12 months ago

Last modified 9 months ago

#30685 assigned Cleanup/optimization

Optimize QuerySet.count() with distinct()

Reported by: Adam Sołtysik Owned by: Ivaylo Donchev
Component: Database layer (models, ORM) Version: 2.2
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I have a PostgreSQL table with 100 000 records and 15 columns.

A simple .count() query results in a fast SQL (execution time in seconds on the left):

(0.015) SELECT COUNT(*) AS "__count" FROM "table";

When we add .distinct(), a subquery is created with all columns SELECTed:

(0.178) SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS Col1, ... (15 columns) FROM "table") subquery;

When instead of .distinct() we write .distinct('id') and add .order_by('id', 'col1', 'col2'), the subquery is additionally ORDERed:

(0.151) SELECT COUNT(*) FROM (SELECT DISTINCT ON ("table"."id") "table"."id" AS Col1, ... (15 columns) FROM "table" ORDER BY "table"."id" ASC, "table"."col1" ASC, "table"."col2" ASC) subquery;

Funny thing is that without .distinct('id') we can write .order_by('non_existing_column') and it works without any exception.

After adding .values('id') and an empty .order_by(), the query is as fast as it can be with DISTINCT:

(0.053) SELECT COUNT(*) FROM (SELECT DISTINCT ON ("table"."id") "table"."id" AS Col1 FROM "table") subquery;

I think that the subquery for count() should never contain additional columns nor be ordered (and the same probably goes for other aggregations).

Change History (13)

comment:1 Changed 12 months ago by Carlton Gibson

Resolution: invalid
Status: newclosed

Hi Adam,

From the description here it just looks like the expected behaviour, as per the various `Note` callouts in the `distinct()` documentation.

(If you really think there's a bug here, can I ask you to narrow it down to something more specific? What are the exact models and ORM calls in play. What is the exact SQL generated? Vs What do you expect and why? Thanks.)

comment:2 Changed 12 months ago by Adam Sołtysik

I don't think it's a bug, but I think there is a place for optimization. I can't see a reason why all columns are SELECTed in a subquery just to perform a simple count. I've attached the SQLs generated and what I would expect.

So, for any model, a query like this:

Model.objects.distinct().count()

should produce an SQL like this:

SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS Col1 FROM "table") subquery;

instead of:

SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS Col1, ... (all other columns) FROM "table") subquery;

which, depending on the number of columns and size of the table, can be several times slower (at least on PostgreSQL).

If you think this is not worth the effort, then OK, I won't insist.

comment:3 Changed 12 months ago by Carlton Gibson

I think the issue is that there's no way for the distinct() call to know that you're going to follow up with a count() — and that code to special case for that isn't going to worth the effort.

If you know you only want to select a single field tell the ORM via `only()`. This'll give you want you need, I think: Model.objects.only('id').distinct().count().

(Maybe the ORM experts can tell you more, but I suspect that be more or less the end of the story...)

comment:4 Changed 12 months ago by Simon Charette

It should feasible to adapt sql.Query.get_count() to perform these optimizations by setting a .values mask when no columns are specified.

https://github.com/django/django/blob/05964b2198e53a8d66e34d83d9123e3051720b28/django/db/models/sql/query.py#L505-L514

You might have to tweak the suquery pushdown logic a bit as that's what kicks in when distinct is used but it should be feasible.

https://github.com/django/django/blob/05964b2198e53a8d66e34d83d9123e3051720b28/django/db/models/sql/query.py#L421-L457

comment:5 Changed 12 months ago by Carlton Gibson

Resolution: invalid
Status: closednew
Triage Stage: UnreviewedAccepted

OK, fine, thank you for the input Simon. Let’s accept on that basis. 👍

comment:6 Changed 12 months ago by Adam Sołtysik

Thank you for accepting. I have added .values('pk').order_by() before my .count() queries, but I'll be glad to be able to get rid of it.

Just to add a bit more information. The "problem" is not only with .distinct(), also any .annotate() makes count() produce a subquery with the redundant annotations inside. For example, a query like this results in SQL that is 4 times slower for my data than the bare count:

Model.objects.annotate(id2=F('id')).count()
(0.057) SELECT COUNT(*) FROM (SELECT "table"."id" AS Col1, "table"."id" AS "id2" FROM "table" GROUP BY "table"."id") subquery;

Adding .values('pk') of course solves this as well.

comment:7 Changed 12 months ago by Carlton Gibson

Adding .values('pk') of course solves this as well.

Yes… and for the same reason, right? 🙂

Fancy taking on a PR as per Simon’s suggestion? (We’re happy to advise…)

comment:8 Changed 12 months ago by Adam Sołtysik

Summary: Suboptimal QuerySet.distinct().count()Optimize QuerySet.count() with distinct() or annotate().

Actually, wouldn't it be as simple as adding these two lines after obj = self.clone() here: https://github.com/django/django/blob/05964b2198e53a8d66e34d83d9123e3051720b28/django/db/models/sql/query.py#L505-L514

obj.set_values(['pk'])
obj.clear_ordering(force_empty=True)

I tried it and it seems to be working (didn't run tests, though). But I think I don't understand the second part of Simon's advice.

comment:9 Changed 12 months ago by Simon Charette

Summary: Optimize QuerySet.count() with distinct() or annotate().Optimize QuerySet.count() with distinct()

The annotation part is tracked in #28477 so I'll rename this ticket to only mention distinct.

I'm afraid we can't simply clear the values and ordering, else quite bit of code in get_aggregate would be rendundant.

One case that these changes don't consider is the fact that .values('foo').distinct() could influence the .count()

e.g.

SELECT COUNT(*) FROM (SELECT DISTINCT author_id FROM book)

Is not equivalent to

SELECT COUNT(*) FROM (SELECT DISTINCT book_id FROM book)

I suggest you open a PR with your changes so CI can report what exactly that breaks.

I'm getting three failures locally on SQLite; one related to the aforementioned .values().distinct() issue, one related to .datetimes().distinct() and one related to order_by spawning a multiple valued relationship.

comment:10 Changed 9 months ago by Ivaylo Donchev

Hi :)
Is this ticket free now ? Do you mind me start playing with an implemention of it

comment:11 Changed 9 months ago by Adam Sołtysik

I probably won't find time for a PR in the nearest time, so as far as I'm concerned, feel free to take it :)

comment:12 Changed 9 months ago by Ivaylo Donchev

Okay, thanks :) . Will play with it these days.

comment:13 Changed 9 months ago by Ivaylo Donchev

Owner: changed from nobody to Ivaylo Donchev
Status: newassigned
Note: See TracTickets for help on using tickets.
Back to Top