Context Navigation

← Previous Ticket
Next Ticket →

#25230 new Cleanup/optimization

Change to Query.get_count() causes big performance hit

Reported by:	dexity	Owned by:
Component:	Database layer (models, ORM)	Version:	1.8
Severity:	Normal	Keywords:
Cc:		Triage Stage:	Accepted
Has patch:	yes	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	yes
Easy pickings:	no	UI/UX:	no

Description (last modified by dexity) ¶

Recent pull merge https://github.com/django/django/commit/c7fd9b242d2d63406f1de6cc3204e35aaa025233?diff=unified generates SQL which causes severe performance hit when using distinct parameter.
Instead of getting something like:
SELECT DISTINCT COUNT('*') FROM some_model;

it generates SQL query which looks:
SELECT COUNT('*') FROM (SELECT DISTINCT * FROM some_model) subquery;

Related ticket: https://code.djangoproject.com/ticket/23875

According to the ticket's flags, the next step(s) to move this issue forward are:

To improve the patch as described in the pull request review comments or on this ticket, then uncheck "Patch needs improvement".
If creating a new pull request, include a link to the pull request in the ticket comment when making that update. The usual format is: [https://github.com/django/django/pull/#### PR].

Change History (12)

comment:1 by dexity, 10 years ago

Description:	modified (diff)

comment:2 by Josh Smeaton, 10 years ago

Triage Stage:	Unreviewed → Accepted

The exact line causing the performance regression is: https://github.com/django/django/commit/c7fd9b242d2d63406f1de6cc3204e35aaa025233#diff-0edd853580d56db07e4020728d59e193R339

The or self.distinct component now forces a subquery. Might need Anssi to confirm if that part is needed or not, and whether we can be smarter in a restricted set of use cases.

comment:3 by Anssi Kääriäinen, 10 years ago

The above queries are identical because the distinct isn't actually doing anything. It won't be doing anything if we can guarantee there is some unique column in the query. For example, when there are no joins and the primary key of the main table is included in the query, then the distinct is redundant. We could be even smarter and detect when there are no multivalued joins in the query (and the pk is present).

Not sure how much trouble we should push to fixing this - the original distinct in the query is redundant, so it could be removed from the query in the reporter's application, at least for the count query.

comment:4 by dexity, 10 years ago

For our database the newly generated query takes 30-40 sec to perform, the original query took 0.1 sec. Huge difference!

Last edited 10 years ago by dexity (previous) (diff)

comment:5 by Aymeric Augustin, 10 years ago

MySQL?

comment:6 by Simon Charette, 10 years ago

I do get a non negligible slowdown on PostgreSQL with a fully analyzed table containing ~30M rows.

database=# EXPLAIN ANALYZE SELECT COUNT('*') FROM (SELECT DISTINCT * FROM schema.table) subquery;
                                                                                                                 QUERY PLAN                                                                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=6213779.77..6213779.78 rows=1 width=0) (actual time=88610.804..88610.804 rows=1 loops=1)
   ->  Unique  (cost=5290067.41..5858505.79 rows=28421919 width=117) (actual time=53265.590..86513.151 rows=28176350 loops=1)
         ->  Sort  (cost=5290067.41..5361122.20 rows=28421919 width=117) (actual time=53265.588..76718.379 rows=28176350 loops=1)
               Sort Key: table.col0, table.col1, table.col2, table.col2, table.col3, table.col4, table.col5
               Sort Method: external merge  Disk: 1296744kB
               ->  Seq Scan on table  (cost=0.00..522350.19 rows=28421919 width=117) (actual time=0.840..15830.723 rows=28176350 loops=1)
 Total runtime: 88695.934 ms
(7 rows)

database=# EXPLAIN ANALYZE SELECT COUNT(*) FROM schema.table;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=593404.99..593405.00 rows=1 width=0) (actual time=19139.072..19139.073 rows=1 loops=1)
   ->  Seq Scan on table  (cost=0.00..522350.19 rows=28421919 width=0) (actual time=5.683..17387.798 rows=28176350 loops=1)
 Total runtime: 19139.108 ms
(3 rows)

Should we escalate this to release blocker?

follow-up: 10 comment:7 by Josh Smeaton, 10 years ago

akaariai's argument against fixing this makes sense to me.

The distinct in Model.objects.distinct().count() is redundant, because the pk enforces that each row will be distinct anyway. For the cases where distinct will actually change the result, then a subquery is required.

So even if we decide to fix this, I don't think it warrants release blocker status. Fix the queryset by removing the redundant distinct. If we can demonstrate that changing the queryset to be correct is non-obvious, then I think that might push the case for implementing a fix.

Really though, distinct is very rarely useful, and is indicative of a poorly constructed query (in sql or in django). Obviously there are useful distinct queries, but they are the exception in my experience.

comment:8 by dexity, 10 years ago

We use MySQL. Model.objects.distinct().count() is a simple query to make the problem obvious. There might be intermediate filters which apparently make distinct() more relevant.

comment:9 by Anssi Kääriäinen, 10 years ago

I'm not against fixing this. We could in general remove distinct from queries when we can prove it is redundant.

But, I don't think this is that high priority issue. I probably will get to fixing this at some point, and I'm definitely willing to review patches.

in reply to: 7 comment:10 by Shai Berger, 10 years ago

Replying to jarshwah:

Really though, distinct is very rarely useful, and is indicative of a poorly constructed query (in sql or in django). Obviously there are useful distinct queries, but they are the exception in my experience.

It becomes necessary if you use backreferences for FKs in your filter; SQL is stupid and multiplies results in that case.

comment:11 by Simon Charette, 2 years ago

Has patch:	set
Owner:	changed from nobody to Simon Charette
Patch needs improvement:	set
Status:	new → assigned

comment:12 by Simon Charette, 2 months ago

Owner:	Simon Charette removed
Status:	assigned → new

Not actively working on this.

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues