Opened 11 years ago

Closed 9 years ago

#19259 closed Cleanup/optimization (fixed)

Annotations generating inefficient SQL on PostgreSQL

Reported by: Henrique C. Alves Owned by: Simon Charette
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords:
Cc: hcarvalhoalves@…, dev@…, Simon Charette Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Considering the following models:

    class Movie(EditableBase, TimestampBase):
        year = models.PositiveSmallIntegerField(u"ano", null=True)
        country = models.CharField(u"país", max_length=100, blank=True, null=True)
        release_date = models.DateField(u"data de estréia", blank=True, null=True)
        length = models.PositiveSmallIntegerField(u"duração", blank=True, null=True)
        rating = models.PositiveIntegerField(u"avaliação", blank=True, null=True, choices=RATING_CHOICES)
        suitable_for = models.PositiveSmallIntegerField(u"censura", blank=True, null=True, choices=AGE_CHOICES)
        website_original = models.URLField(u"website", max_length=255, blank=True, null=True)
        website_national = models.URLField(u"website (nacional)", max_length=255, blank=True, null=True)
        synopsis = models.TextField(u"sinopse", blank=True)
        cover = models.ImageField(u"poster", upload_to=u'uploads/fichatecnica/', blank=True, null=True)
        genres = models.ManyToManyField(MovieGenre, verbose_name=u"gêneros")

    class MovieTheaterSession(models.Model):
        start_date = models.DateField(u"início")
        end_date = models.DateField(u"fim")
        movie = models.ForeignKey(Movie, verbose_name=u"filme", related_name='sessions')

If we annotate the query to count the number of sessions:

    >>> Movie.objects.all().annotate(sessions_count=Count('sessions')).order_by('release_date')[:60]

Django generates pretty inneficient SQL by GROUPing BY on all fields from the parent:

    SELECT "movies_movie"."id", "movies_movie"."creation_date", "movies_movie"."modification_date", "movies_movie"."year", "movies_movie"."country", "movies_movie"."release_date", "movies_movie"."length", "movies_movie"."rating", "movies_movie"."suitable_for", "movies_movie"."website_original", "movies_movie"."website_national", "movies_movie"."synopsis", "movies_movie"."cover", COUNT("theaters_movietheatersession"."id") AS "sessions_count" FROM "movies_movie" LEFT OUTER JOIN "theaters_movietheatersession" ON ("movies_movie"."id" = "theaters_movietheatersession"."movie_id") GROUP BY "movies_movie"."id", "movies_movie"."creation_date", "movies_movie"."modification_date", "movies_movie"."year", "movies_movie"."country", "movies_movie"."release_date", "movies_movie"."length", "movies_movie"."rating", "movies_movie"."suitable_for", "movies_movie"."website_original", "movies_movie"."website_national", "movies_movie"."synopsis", "movies_movie"."cover" ORDER BY "movies_movie"."release_date" ASC LIMIT 60
    Time: 892,122 ms

EXPLAIN shows the database is sorting all fields from the GROUP BY clause:

    Limit  (cost=2250.37..2250.52 rows=60 width=502)
      ->  Sort  (cost=2250.37..2256.15 rows=2311 width=502)
            Sort Key: (count(theaters_movietheatersession.id)), movies_movie.release_date
            ->  GroupAggregate  (cost=1975.47..2170.56 rows=2311 width=502)
                  ->  Sort  (cost=1975.47..1986.94 rows=4586 width=502)
                        Sort Key: movies_movie.id, movies_movie.creation_date, movies_movie.modification_date, movies_movie.year, movies_movie.country, movies_movie.release_date, movies_movie.length, movies_movie.rating, movies_movie.suitable_for, movies_movie.website_original, movies_movie.website_national, movies_movie.synopsis, movies_movie.cover
                        ->  Merge Left Join  (cost=0.00..660.58 rows=4586 width=502)
                              Merge Cond: (movies_movie.id = theaters_movietheatersession.movie_id)
                              ->  Index Scan using movies_movie_pkey on movies_movie  (cost=0.00..283.14 rows=2311 width=498)
                              ->  Index Scan using theaters_movietheatersession_movie_id on theaters_movietheatersession  (cost=0.00..314.34 rows=4586 width=8)

It suffices to GROUP BY by the PK field:

    SELECT "movies_movie"."id", "movies_movie"."creation_date", "movies_movie"."modification_date", "movies_movie"."year", "movies_movie"."country", "movies_movie"."release_date", "movies_movie"."length", "movies_movie"."rating", "movies_movie"."suitable_for", "movies_movie"."website_original", "movies_movie"."website_national", "movies_movie"."synopsis", "movies_movie"."cover", COUNT("theaters_movietheatersession"."id") AS "sessions_count" FROM "movies_movie" LEFT OUTER JOIN "theaters_movietheatersession" ON ("movies_movie"."id" = "theaters_movietheatersession"."movie_id") GROUP BY "movies_movie"."id" ORDER BY "movies_movie"."release_date" ASC LIMIT 60
    Time: 16,285 ms

EXPLAIN shows the database doesn't need to sort anymore:

    Limit  (cost=786.42..786.57 rows=60 width=502)
       ->  Sort  (cost=786.42..792.20 rows=2311 width=502)
             Sort Key: movies_movie.release_date
             ->  GroupAggregate  (cost=0.00..706.62 rows=2311 width=502)
                   ->  Merge Left Join  (cost=0.00..660.58 rows=4586 width=502)
                         Merge Cond: (movies_movie.id = theaters_movietheatersession.movie_id)
                         ->  Index Scan using movies_movie_pkey on movies_movie  (cost=0.00..283.14 rows=2311 width=498)
                         ->  Index Scan using theaters_movietheatersession_movie_id on theaters_movietheatersession  (cost=0.00..314.34 rows=4586 width=8)

I've tried fixing it by setting the undocumented group_by attribute, but Django doesn't seem to pick it up:

class MovieManager(models.Manager):
        def with_sessions(self):
            qs = self.get_query_set().annotate(
                sessions_count=Count('sessions'))
            qs.group_by = ['id']
            return qs

I'm using Django 1.4 and PostgreSQL 9.1.

Attachments (1)

19259.diff (748 bytes ) - added by hgeerts@… 11 years ago.
group_by pk is supported starting 9.1 http://www.postgresql.org/docs/9.1/static/sql-select.html#SQL-GROUPBY

Download all attachments as: .zip

Change History (25)

comment:1 by Alex Gaynor, 11 years ago

We already have logic to do this in MySQL, it's just a matter of making the postgresql backend set the right flag (note that this is new in a recent version of postgresql, 9.1?) so we need to make sure it's only set there.

comment:2 by Henrique C. Alves, 11 years ago

Would be the case of adding allows_group_by_pk = True to DatabaseFeatures at https://github.com/django/django/blob/master/django/db/backends/postgresql_psycopg2/base.py#L111 ?

comment:3 by Alex Gaynor, 11 years ago

That's the right flag, not sure that's the right place off the top of my head.

by hgeerts@…, 11 years ago

Attachment: 19259.diff added

comment:4 by hgeerts@…, 11 years ago

Has patch: set

comment:5 by Henrique C. Alves, 11 years ago

Thanks. I've tried your patch but it doesn't solve the issue, Django still adds all fields to GROUP BY clause (the SQL is identical to the original one).

comment:6 by Anssi Kääriäinen, 11 years ago

In PostgreSQL you must include all the primary keys in the query, so if you have joins that means you will need to add those PKs too. This isn't true for MySQL, just one pk is enough.

This has a pre-existing ticket, #18016. I am not sure which ticket to close as duplicate at this point.

Also, the MySQL code isn't currently working, see #17144. I should really commit the patch in that ticket...

comment:7 by Harm Geerts <hgeerts@…>, 11 years ago

I've been discussing this on irc with hcarvalhoalves and I hit my head on #17144 as well.
In addition the patch attached to this ticket does not work because the sql can be constructed before the connection to the database is established.
This can be worked around by forcing a connection when a new DatabaseWrapper instance is created but that's a bit dirty and I don't think a patch like that would be accepted.

in reply to:  6 comment:8 by Henrique C. Alves, 11 years ago

Replying to akaariai:

In PostgreSQL you must include all the primary keys in the query, so if you have joins that means you will need to add those PKs too. This isn't true for MySQL, just one pk is enough.

This has a pre-existing ticket, #18016. I am not sure which ticket to close as duplicate at this point.

Also, the MySQL code isn't currently working, see #17144. I should really commit the patch in that ticket...

You might want to close #18016 as dup to keep the discussion here since there's an example use case and the output of the query planner so we can compare.

comment:9 by Anssi Kääriäinen, 11 years ago

#18016 was closed as duplicate.

I will commit the #17144 patch soon unless something unexpected is found.

comment:10 by Henrique C. Alves, 11 years ago

Cc: hcarvalhoalves@… added

comment:11 by Anssi Kääriäinen, 11 years ago

For the version dependant check you could probably use a cached_property (django.utils.functional IIRC). The idea is that when accessed the getter will make sure version information is available. Once the version information has been checked, the value can be cached and there is no longer need to do version information checks. This could mean opening a connection in the getter.

Another approach is to generate inefficient SQL if the version information isn't yet available.

I am leaning towards just generating inefficient SQL if the query happens to be ran first thing in a new ConnectionWrapper (not just as first thing in a connection - first thing for the object). The reason is that otherwise we have to close the connection used for version checking and that is somewhat expensive. I am not feeling strongly at all about this...

The harder problem is that you can't group by the main PK alone. You will need the primary keys of joined tables too, and that information isn't directly available in group by clause generation. (Actually, it might be in select and related_select_cols as "second" part of the tuple, not sure of this). In any case the MySQL code will not work directly.

in reply to:  11 comment:12 by Henrique C. Alves, 11 years ago

Replying to akaariai:

For the version dependant check you could probably use a cached_property (django.utils.functional IIRC). The idea is that when accessed the getter will make sure version information is available. Once the version information has been checked, the value can be cached and there is no longer need to do version information checks. This could mean opening a connection in the getter.

Another approach is to generate inefficient SQL if the version information isn't yet available.

I am leaning towards just generating inefficient SQL if the query happens to be ran first thing in a new ConnectionWrapper (not just as first thing in a connection - first thing for the object). The reason is that otherwise we have to close the connection used for version checking and that is somewhat expensive. I am not feeling strongly at all about this...

Since it's a chicken and egg problem, is it really necessary to "autodetect" the feature then? I couldn't find any other example where Django enables/disables features based on the database version. What about just introducing a django.db.backends.postgresql9_psycopg2 backend? It's a trivial patch.

The harder problem is that you can't group by the main PK alone. You will need the primary keys of joined tables too, and that information isn't directly available in group by clause generation. (Actually, it might be in select and related_select_cols as "second" part of the tuple, not sure of this). In any case the MySQL code will not work directly.

That's a bigger problem, I'm afraid I'm too crud on the ORM code to figure it out.

Last edited 11 years ago by Henrique C. Alves (previous) (diff)

comment:13 by Aymeric Augustin, 11 years ago

Triage Stage: UnreviewedDesign decision needed
Type: UncategorizedCleanup/optimization

Adding a new backend might be trivial in code, but not in maintenance, docs, user support, etc.

Marking as DDN because the problem is valid but there isn't a consensus on how to fix it.

comment:14 by Jacob, 11 years ago

Triage Stage: Design decision neededAccepted

Moving back to accepted: we should fix it, at least once we figure out how :)

comment:15 by Anssi Kääriäinen, 11 years ago

I think the GROUP BY clause could be constructed using the select and related_select_cols' field information. If a select column's field is available, then include the field only if field.primary_key is True. If the field isn't available, include it always in the GROUP BY.

For the version dependency there are other example of disabling/enabling functionality based on version, at least regex searches on Oracle are such. In any case, a separate backend for this is a no-go.

comment:16 by Ryan Kaskel, 11 years ago

Cc: dev@… added

in reply to:  15 comment:17 by Shai Berger, 11 years ago

Replying to akaariai:

I think the GROUP BY clause could be constructed using the select and related_select_cols' field information. If a select column's field is available, then include the field only if field.primary_key is True. If the field isn't available, include it always in the GROUP BY.

I think #20971 is related. In particular, when building the GROUP BY clause, also exclude the field if it is deferred.

comment:18 by Tim Graham, 11 years ago

Needs tests: set
Patch needs improvement: set

comment:19 by Simon Charette, 10 years ago

Cc: Simon Charette added

comment:20 by Simon Charette, 9 years ago

Version: 1.4master

I started to work on a PR.

comment:21 by Simon Charette, 9 years ago

Owner: changed from nobody to Simon Charette
Status: newassigned

comment:22 by Simon Charette, 9 years ago

Needs tests: unset
Patch needs improvement: unset

Patch passes the full test suite including the adjusted existing tests for allows_group_by_pk.

comment:23 by Tim Graham, 9 years ago

Triage Stage: AcceptedReady for checkin

comment:24 by Simon Charette <charette.s@…>, 9 years ago

Resolution: fixed
Status: assignedclosed

In dc27f3e:

Fixed #19259 -- Added group by selected primary keys support.

Note: See TracTickets for help on using tickets.
Back to Top