Opened 11 years ago
Closed 11 years ago
#22278 closed Uncategorized (invalid)
Excessive GROUP BY when using the Count annotation
Reported by: | Mark Jones | Owned by: | nobody |
---|---|---|---|
Component: | Uncategorized | Version: | 1.6 |
Severity: | Normal | Keywords: | |
Cc: | Shai Berger | Triage Stage: | Unreviewed |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
The following code illustrates a group by that seems computationally expensive since it is grouping by a large number of fields when it could be grouping by ID.
The generated SQL looks like this:
SELECT "foo_employee"."id", "foo_employee"."last_name", "foo_employee"."first_name", "foo_employee"."another_attribute", COUNT("foo_shift"."id") AS "overlapping_events" FROM "foo_employee" LEFT OUTER JOIN "foo_shift" ON ( "foo_employee"."id" = "foo_shift"."employee_id" ) WHERE ("foo_shift"."end_at" <= 2014-03-16 08:28:43.040428-07:00 AND "foo_shift"."start_at" >= 2014-03-16 08:28:43.040407-07:00 ) GROUP BY "foo_employee"."id", "foo_employee"."last_name", "foo_employee"."first_name", "foo_employee"."another_attribute" ORDER BY "overlapping_events" ASC, "foo_employee"."last_name" ASC, "foo_employee"."first_name" ASC
Everything after foo_employee.id in the GROUP BY isn't needed. Now the sql statement optimizer might be smart enough to figure this out, but then again, it might not.... Seems like if the pk is in the GROUP BY list, the rest of it isn't needed.
import datetime from django.db import models class Employee(models.Model): last_name = models.CharField(max_length=64) first_name = models.CharField(max_length=64) another_attribute = models.CharField(max_length=64) class Shift(models.Model): start_at = models.DateTimeField() end_at = models.DateTimeField() employee = models.ForeignKey(Employee, related_name="shift_set") start_at = datetime.datetime.now() end_at = datetime.datetime.now() qs = Employee.objects.filter(shift_set__start_at__gte=start_at, shift_set__end_at__lte=end_at ).annotate(overlapping_events=models.Count('shift_set')) print qs.order_by('overlapping_events', 'last_name', 'first_name').query
Changing the Count('shift_set')
to Count('id')
has no effect on the group by.
Oh, it is needed on any backend that isn't MySql, and that is for two reasons:
1) The SQL standard is stupid; once you have aggregates, it doesn't allow any column to be selected unless it is in the group-by clause.
2) MySQL is stupid, and ignores the above restriction in the SQL standard.
The stupidity of MySQL is because it doesn't actually check that the group-by columns determine all the other selected columns, and so it allows queries with non-deterministic results.