Opened 2 years ago

Last modified 2 years ago

#25230 new Cleanup/optimization

Change to Query.get_count() causes big performance hit

Reported by: dexity Owned by: nobody
Component: Database layer (models, ORM) Version: 1.8
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by dexity)

Recent pull merge https://github.com/django/django/commit/c7fd9b242d2d63406f1de6cc3204e35aaa025233?diff=unified generates SQL which causes severe performance hit when using distinct parameter.
Instead of getting something like:
SELECT DISTINCT COUNT('*') FROM some_model;

it generates SQL query which looks:
SELECT COUNT('*') FROM (SELECT DISTINCT * FROM some_model) subquery;

Related ticket: https://code.djangoproject.com/ticket/23875

Change History (10)

comment:1 Changed 2 years ago by dexity

Description: modified (diff)

comment:2 Changed 2 years ago by Josh Smeaton

Triage Stage: UnreviewedAccepted

The exact line causing the performance regression is: https://github.com/django/django/commit/c7fd9b242d2d63406f1de6cc3204e35aaa025233#diff-0edd853580d56db07e4020728d59e193R339

The or self.distinct component now forces a subquery. Might need Anssi to confirm if that part is needed or not, and whether we can be smarter in a restricted set of use cases.

comment:3 Changed 2 years ago by Anssi Kääriäinen

The above queries are identical because the distinct isn't actually doing anything. It won't be doing anything if we can guarantee there is some unique column in the query. For example, when there are no joins and the primary key of the main table is included in the query, then the distinct is redundant. We could be even smarter and detect when there are no multivalued joins in the query (and the pk is present).

Not sure how much trouble we should push to fixing this - the original distinct in the query is redundant, so it could be removed from the query in the reporter's application, at least for the count query.

comment:4 Changed 2 years ago by dexity

For our database the newly generated query takes 30-40 sec to perform, the original query took 0.1 sec. Huge difference!

Last edited 2 years ago by dexity (previous) (diff)

comment:5 Changed 2 years ago by Aymeric Augustin

MySQL?

comment:6 Changed 2 years ago by Simon Charette

I do get a non negligible slowdown on PostgreSQL with a fully analyzed table containing ~30M rows.

database=# EXPLAIN ANALYZE SELECT COUNT('*') FROM (SELECT DISTINCT * FROM schema.table) subquery;
                                                                                                                 QUERY PLAN                                                                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=6213779.77..6213779.78 rows=1 width=0) (actual time=88610.804..88610.804 rows=1 loops=1)
   ->  Unique  (cost=5290067.41..5858505.79 rows=28421919 width=117) (actual time=53265.590..86513.151 rows=28176350 loops=1)
         ->  Sort  (cost=5290067.41..5361122.20 rows=28421919 width=117) (actual time=53265.588..76718.379 rows=28176350 loops=1)
               Sort Key: table.col0, table.col1, table.col2, table.col2, table.col3, table.col4, table.col5
               Sort Method: external merge  Disk: 1296744kB
               ->  Seq Scan on table  (cost=0.00..522350.19 rows=28421919 width=117) (actual time=0.840..15830.723 rows=28176350 loops=1)
 Total runtime: 88695.934 ms
(7 rows)

database=# EXPLAIN ANALYZE SELECT COUNT(*) FROM schema.table;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=593404.99..593405.00 rows=1 width=0) (actual time=19139.072..19139.073 rows=1 loops=1)
   ->  Seq Scan on table  (cost=0.00..522350.19 rows=28421919 width=0) (actual time=5.683..17387.798 rows=28176350 loops=1)
 Total runtime: 19139.108 ms
(3 rows)

Should we escalate this to release blocker?

comment:7 Changed 2 years ago by Josh Smeaton

akaariai's argument against fixing this makes sense to me.

The distinct in Model.objects.distinct().count() is redundant, because the pk enforces that each row will be distinct anyway. For the cases where distinct will actually change the result, then a subquery is required.

So even if we decide to fix this, I don't think it warrants release blocker status. Fix the queryset by removing the redundant distinct. If we can demonstrate that changing the queryset to be correct is non-obvious, then I think that might push the case for implementing a fix.

Really though, distinct is very rarely useful, and is indicative of a poorly constructed query (in sql or in django). Obviously there are useful distinct queries, but they are the exception in my experience.

comment:8 Changed 2 years ago by dexity

We use MySQL. Model.objects.distinct().count() is a simple query to make the problem obvious. There might be intermediate filters which apparently make distinct() more relevant.

comment:9 Changed 2 years ago by Anssi Kääriäinen

I'm not against fixing this. We could in general remove distinct from queries when we can prove it is redundant.

But, I don't think this is that high priority issue. I probably will get to fixing this at some point, and I'm definitely willing to review patches.

comment:10 in reply to:  7 Changed 2 years ago by Shai Berger

Replying to jarshwah:

Really though, distinct is very rarely useful, and is indicative of a poorly constructed query (in sql or in django). Obviously there are useful distinct queries, but they are the exception in my experience.

It becomes necessary if you use backreferences for FKs in your filter; SQL is stupid and multiplies results in that case.

Note: See TracTickets for help on using tickets.
Back to Top