Opened 12 years ago
Closed 9 years ago
#19259 closed Cleanup/optimization (fixed)
Annotations generating inefficient SQL on PostgreSQL
Reported by: | Henrique C. Alves | Owned by: | Simon Charette |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | hcarvalhoalves@…, dev@…, Simon Charette | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Considering the following models:
class Movie(EditableBase, TimestampBase): year = models.PositiveSmallIntegerField(u"ano", null=True) country = models.CharField(u"país", max_length=100, blank=True, null=True) release_date = models.DateField(u"data de estréia", blank=True, null=True) length = models.PositiveSmallIntegerField(u"duração", blank=True, null=True) rating = models.PositiveIntegerField(u"avaliação", blank=True, null=True, choices=RATING_CHOICES) suitable_for = models.PositiveSmallIntegerField(u"censura", blank=True, null=True, choices=AGE_CHOICES) website_original = models.URLField(u"website", max_length=255, blank=True, null=True) website_national = models.URLField(u"website (nacional)", max_length=255, blank=True, null=True) synopsis = models.TextField(u"sinopse", blank=True) cover = models.ImageField(u"poster", upload_to=u'uploads/fichatecnica/', blank=True, null=True) genres = models.ManyToManyField(MovieGenre, verbose_name=u"gêneros") class MovieTheaterSession(models.Model): start_date = models.DateField(u"início") end_date = models.DateField(u"fim") movie = models.ForeignKey(Movie, verbose_name=u"filme", related_name='sessions')
If we annotate the query to count the number of sessions:
>>> Movie.objects.all().annotate(sessions_count=Count('sessions')).order_by('release_date')[:60]
Django generates pretty inneficient SQL by GROUPing BY on all fields from the parent:
SELECT "movies_movie"."id", "movies_movie"."creation_date", "movies_movie"."modification_date", "movies_movie"."year", "movies_movie"."country", "movies_movie"."release_date", "movies_movie"."length", "movies_movie"."rating", "movies_movie"."suitable_for", "movies_movie"."website_original", "movies_movie"."website_national", "movies_movie"."synopsis", "movies_movie"."cover", COUNT("theaters_movietheatersession"."id") AS "sessions_count" FROM "movies_movie" LEFT OUTER JOIN "theaters_movietheatersession" ON ("movies_movie"."id" = "theaters_movietheatersession"."movie_id") GROUP BY "movies_movie"."id", "movies_movie"."creation_date", "movies_movie"."modification_date", "movies_movie"."year", "movies_movie"."country", "movies_movie"."release_date", "movies_movie"."length", "movies_movie"."rating", "movies_movie"."suitable_for", "movies_movie"."website_original", "movies_movie"."website_national", "movies_movie"."synopsis", "movies_movie"."cover" ORDER BY "movies_movie"."release_date" ASC LIMIT 60 Time: 892,122 ms
EXPLAIN shows the database is sorting all fields from the GROUP BY clause:
Limit (cost=2250.37..2250.52 rows=60 width=502) -> Sort (cost=2250.37..2256.15 rows=2311 width=502) Sort Key: (count(theaters_movietheatersession.id)), movies_movie.release_date -> GroupAggregate (cost=1975.47..2170.56 rows=2311 width=502) -> Sort (cost=1975.47..1986.94 rows=4586 width=502) Sort Key: movies_movie.id, movies_movie.creation_date, movies_movie.modification_date, movies_movie.year, movies_movie.country, movies_movie.release_date, movies_movie.length, movies_movie.rating, movies_movie.suitable_for, movies_movie.website_original, movies_movie.website_national, movies_movie.synopsis, movies_movie.cover -> Merge Left Join (cost=0.00..660.58 rows=4586 width=502) Merge Cond: (movies_movie.id = theaters_movietheatersession.movie_id) -> Index Scan using movies_movie_pkey on movies_movie (cost=0.00..283.14 rows=2311 width=498) -> Index Scan using theaters_movietheatersession_movie_id on theaters_movietheatersession (cost=0.00..314.34 rows=4586 width=8)
It suffices to GROUP BY by the PK field:
SELECT "movies_movie"."id", "movies_movie"."creation_date", "movies_movie"."modification_date", "movies_movie"."year", "movies_movie"."country", "movies_movie"."release_date", "movies_movie"."length", "movies_movie"."rating", "movies_movie"."suitable_for", "movies_movie"."website_original", "movies_movie"."website_national", "movies_movie"."synopsis", "movies_movie"."cover", COUNT("theaters_movietheatersession"."id") AS "sessions_count" FROM "movies_movie" LEFT OUTER JOIN "theaters_movietheatersession" ON ("movies_movie"."id" = "theaters_movietheatersession"."movie_id") GROUP BY "movies_movie"."id" ORDER BY "movies_movie"."release_date" ASC LIMIT 60 Time: 16,285 ms
EXPLAIN shows the database doesn't need to sort anymore:
Limit (cost=786.42..786.57 rows=60 width=502) -> Sort (cost=786.42..792.20 rows=2311 width=502) Sort Key: movies_movie.release_date -> GroupAggregate (cost=0.00..706.62 rows=2311 width=502) -> Merge Left Join (cost=0.00..660.58 rows=4586 width=502) Merge Cond: (movies_movie.id = theaters_movietheatersession.movie_id) -> Index Scan using movies_movie_pkey on movies_movie (cost=0.00..283.14 rows=2311 width=498) -> Index Scan using theaters_movietheatersession_movie_id on theaters_movietheatersession (cost=0.00..314.34 rows=4586 width=8)
I've tried fixing it by setting the undocumented group_by
attribute, but Django doesn't seem to pick it up:
class MovieManager(models.Manager): def with_sessions(self): qs = self.get_query_set().annotate( sessions_count=Count('sessions')) qs.group_by = ['id'] return qs
I'm using Django 1.4 and PostgreSQL 9.1.
Attachments (1)
Change History (25)
comment:1 by , 12 years ago
comment:2 by , 12 years ago
Would be the case of adding allows_group_by_pk = True
to DatabaseFeatures
at https://github.com/django/django/blob/master/django/db/backends/postgresql_psycopg2/base.py#L111 ?
comment:3 by , 12 years ago
That's the right flag, not sure that's the right place off the top of my head.
by , 12 years ago
Attachment: | 19259.diff added |
---|
group_by pk is supported starting 9.1 http://www.postgresql.org/docs/9.1/static/sql-select.html#SQL-GROUPBY
comment:4 by , 12 years ago
Has patch: | set |
---|
comment:5 by , 12 years ago
Thanks. I've tried your patch but it doesn't solve the issue, Django still adds all fields to GROUP BY
clause (the SQL is identical to the original one).
follow-up: 8 comment:6 by , 12 years ago
In PostgreSQL you must include all the primary keys in the query, so if you have joins that means you will need to add those PKs too. This isn't true for MySQL, just one pk is enough.
This has a pre-existing ticket, #18016. I am not sure which ticket to close as duplicate at this point.
Also, the MySQL code isn't currently working, see #17144. I should really commit the patch in that ticket...
comment:7 by , 12 years ago
I've been discussing this on irc with hcarvalhoalves and I hit my head on #17144 as well.
In addition the patch attached to this ticket does not work because the sql can be constructed before the connection to the database is established.
This can be worked around by forcing a connection when a new DatabaseWrapper instance is created but that's a bit dirty and I don't think a patch like that would be accepted.
comment:8 by , 12 years ago
Replying to akaariai:
In PostgreSQL you must include all the primary keys in the query, so if you have joins that means you will need to add those PKs too. This isn't true for MySQL, just one pk is enough.
This has a pre-existing ticket, #18016. I am not sure which ticket to close as duplicate at this point.
Also, the MySQL code isn't currently working, see #17144. I should really commit the patch in that ticket...
You might want to close #18016 as dup to keep the discussion here since there's an example use case and the output of the query planner so we can compare.
comment:9 by , 12 years ago
comment:10 by , 12 years ago
Cc: | added |
---|
follow-up: 12 comment:11 by , 12 years ago
For the version dependant check you could probably use a cached_property (django.utils.functional IIRC). The idea is that when accessed the getter will make sure version information is available. Once the version information has been checked, the value can be cached and there is no longer need to do version information checks. This could mean opening a connection in the getter.
Another approach is to generate inefficient SQL if the version information isn't yet available.
I am leaning towards just generating inefficient SQL if the query happens to be ran first thing in a new ConnectionWrapper (not just as first thing in a connection - first thing for the object). The reason is that otherwise we have to close the connection used for version checking and that is somewhat expensive. I am not feeling strongly at all about this...
The harder problem is that you can't group by the main PK alone. You will need the primary keys of joined tables too, and that information isn't directly available in group by clause generation. (Actually, it might be in select and related_select_cols as "second" part of the tuple, not sure of this). In any case the MySQL code will not work directly.
comment:12 by , 12 years ago
Replying to akaariai:
For the version dependant check you could probably use a cached_property (django.utils.functional IIRC). The idea is that when accessed the getter will make sure version information is available. Once the version information has been checked, the value can be cached and there is no longer need to do version information checks. This could mean opening a connection in the getter.
Another approach is to generate inefficient SQL if the version information isn't yet available.
I am leaning towards just generating inefficient SQL if the query happens to be ran first thing in a new ConnectionWrapper (not just as first thing in a connection - first thing for the object). The reason is that otherwise we have to close the connection used for version checking and that is somewhat expensive. I am not feeling strongly at all about this...
Since it's a chicken and egg problem, is it really necessary to "autodetect" the feature then? I couldn't find any other example where Django enables/disables features based on the database version. What about just introducing a django.db.backends.postgresql9_psycopg2
backend? It's a trivial patch.
The harder problem is that you can't group by the main PK alone. You will need the primary keys of joined tables too, and that information isn't directly available in group by clause generation. (Actually, it might be in select and related_select_cols as "second" part of the tuple, not sure of this). In any case the MySQL code will not work directly.
That's a bigger problem, I'm afraid I'm too crud on the ORM code to figure it out.
comment:13 by , 12 years ago
Triage Stage: | Unreviewed → Design decision needed |
---|---|
Type: | Uncategorized → Cleanup/optimization |
Adding a new backend might be trivial in code, but not in maintenance, docs, user support, etc.
Marking as DDN because the problem is valid but there isn't a consensus on how to fix it.
comment:14 by , 11 years ago
Triage Stage: | Design decision needed → Accepted |
---|
Moving back to accepted: we should fix it, at least once we figure out how :)
follow-up: 17 comment:15 by , 11 years ago
I think the GROUP BY clause could be constructed using the select and related_select_cols' field information. If a select column's field is available, then include the field only if field.primary_key is True. If the field isn't available, include it always in the GROUP BY.
For the version dependency there are other example of disabling/enabling functionality based on version, at least regex searches on Oracle are such. In any case, a separate backend for this is a no-go.
comment:16 by , 11 years ago
Cc: | added |
---|
comment:17 by , 11 years ago
Replying to akaariai:
I think the GROUP BY clause could be constructed using the select and related_select_cols' field information. If a select column's field is available, then include the field only if field.primary_key is True. If the field isn't available, include it always in the GROUP BY.
I think #20971 is related. In particular, when building the GROUP BY clause, also exclude the field if it is deferred.
comment:18 by , 11 years ago
Needs tests: | set |
---|---|
Patch needs improvement: | set |
comment:19 by , 10 years ago
Cc: | added |
---|
comment:21 by , 9 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:22 by , 9 years ago
Needs tests: | unset |
---|---|
Patch needs improvement: | unset |
Patch passes the full test suite including the adjusted existing tests for allows_group_by_pk
.
comment:23 by , 9 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
We already have logic to do this in MySQL, it's just a matter of making the postgresql backend set the right flag (note that this is new in a recent version of postgresql, 9.1?) so we need to make sure it's only set there.