Opened 3 years ago

Closed 8 weeks ago

#19259 closed Cleanup/optimization (fixed)

Annotations generating inefficient SQL on PostgreSQL

Reported by: hcarvalhoalves Owned by: charettes
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords:
Cc: hcarvalhoalves@…, dev@…, charettes Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Considering the following models:

    class Movie(EditableBase, TimestampBase):
        year = models.PositiveSmallIntegerField(u"ano", null=True)
        country = models.CharField(u"país", max_length=100, blank=True, null=True)
        release_date = models.DateField(u"data de estréia", blank=True, null=True)
        length = models.PositiveSmallIntegerField(u"duração", blank=True, null=True)
        rating = models.PositiveIntegerField(u"avaliação", blank=True, null=True, choices=RATING_CHOICES)
        suitable_for = models.PositiveSmallIntegerField(u"censura", blank=True, null=True, choices=AGE_CHOICES)
        website_original = models.URLField(u"website", max_length=255, blank=True, null=True)
        website_national = models.URLField(u"website (nacional)", max_length=255, blank=True, null=True)
        synopsis = models.TextField(u"sinopse", blank=True)
        cover = models.ImageField(u"poster", upload_to=u'uploads/fichatecnica/', blank=True, null=True)
        genres = models.ManyToManyField(MovieGenre, verbose_name=u"gêneros")

    class MovieTheaterSession(models.Model):
        start_date = models.DateField(u"início")
        end_date = models.DateField(u"fim")
        movie = models.ForeignKey(Movie, verbose_name=u"filme", related_name='sessions')

If we annotate the query to count the number of sessions:

    >>> Movie.objects.all().annotate(sessions_count=Count('sessions')).order_by('release_date')[:60]

Django generates pretty inneficient SQL by GROUPing BY on all fields from the parent:

    SELECT "movies_movie"."id", "movies_movie"."creation_date", "movies_movie"."modification_date", "movies_movie"."year", "movies_movie"."country", "movies_movie"."release_date", "movies_movie"."length", "movies_movie"."rating", "movies_movie"."suitable_for", "movies_movie"."website_original", "movies_movie"."website_national", "movies_movie"."synopsis", "movies_movie"."cover", COUNT("theaters_movietheatersession"."id") AS "sessions_count" FROM "movies_movie" LEFT OUTER JOIN "theaters_movietheatersession" ON ("movies_movie"."id" = "theaters_movietheatersession"."movie_id") GROUP BY "movies_movie"."id", "movies_movie"."creation_date", "movies_movie"."modification_date", "movies_movie"."year", "movies_movie"."country", "movies_movie"."release_date", "movies_movie"."length", "movies_movie"."rating", "movies_movie"."suitable_for", "movies_movie"."website_original", "movies_movie"."website_national", "movies_movie"."synopsis", "movies_movie"."cover" ORDER BY "movies_movie"."release_date" ASC LIMIT 60
    Time: 892,122 ms

EXPLAIN shows the database is sorting all fields from the GROUP BY clause:

    Limit  (cost=2250.37..2250.52 rows=60 width=502)
      ->  Sort  (cost=2250.37..2256.15 rows=2311 width=502)
            Sort Key: (count(theaters_movietheatersession.id)), movies_movie.release_date
            ->  GroupAggregate  (cost=1975.47..2170.56 rows=2311 width=502)
                  ->  Sort  (cost=1975.47..1986.94 rows=4586 width=502)
                        Sort Key: movies_movie.id, movies_movie.creation_date, movies_movie.modification_date, movies_movie.year, movies_movie.country, movies_movie.release_date, movies_movie.length, movies_movie.rating, movies_movie.suitable_for, movies_movie.website_original, movies_movie.website_national, movies_movie.synopsis, movies_movie.cover
                        ->  Merge Left Join  (cost=0.00..660.58 rows=4586 width=502)
                              Merge Cond: (movies_movie.id = theaters_movietheatersession.movie_id)
                              ->  Index Scan using movies_movie_pkey on movies_movie  (cost=0.00..283.14 rows=2311 width=498)
                              ->  Index Scan using theaters_movietheatersession_movie_id on theaters_movietheatersession  (cost=0.00..314.34 rows=4586 width=8)

It suffices to GROUP BY by the PK field:

    SELECT "movies_movie"."id", "movies_movie"."creation_date", "movies_movie"."modification_date", "movies_movie"."year", "movies_movie"."country", "movies_movie"."release_date", "movies_movie"."length", "movies_movie"."rating", "movies_movie"."suitable_for", "movies_movie"."website_original", "movies_movie"."website_national", "movies_movie"."synopsis", "movies_movie"."cover", COUNT("theaters_movietheatersession"."id") AS "sessions_count" FROM "movies_movie" LEFT OUTER JOIN "theaters_movietheatersession" ON ("movies_movie"."id" = "theaters_movietheatersession"."movie_id") GROUP BY "movies_movie"."id" ORDER BY "movies_movie"."release_date" ASC LIMIT 60
    Time: 16,285 ms

EXPLAIN shows the database doesn't need to sort anymore:

    Limit  (cost=786.42..786.57 rows=60 width=502)
       ->  Sort  (cost=786.42..792.20 rows=2311 width=502)
             Sort Key: movies_movie.release_date
             ->  GroupAggregate  (cost=0.00..706.62 rows=2311 width=502)
                   ->  Merge Left Join  (cost=0.00..660.58 rows=4586 width=502)
                         Merge Cond: (movies_movie.id = theaters_movietheatersession.movie_id)
                         ->  Index Scan using movies_movie_pkey on movies_movie  (cost=0.00..283.14 rows=2311 width=498)
                         ->  Index Scan using theaters_movietheatersession_movie_id on theaters_movietheatersession  (cost=0.00..314.34 rows=4586 width=8)

I've tried fixing it by setting the undocumented group_by attribute, but Django doesn't seem to pick it up:

class MovieManager(models.Manager):
        def with_sessions(self):
            qs = self.get_query_set().annotate(
                sessions_count=Count('sessions'))
            qs.group_by = ['id']
            return qs

I'm using Django 1.4 and PostgreSQL 9.1.

Attachments (1)

19259.diff (748 bytes) - added by hgeerts@… 3 years ago.
group_by pk is supported starting 9.1 http://www.postgresql.org/docs/9.1/static/sql-select.html#SQL-GROUPBY

Download all attachments as: .zip

Change History (25)

comment:1 Changed 3 years ago by Alex

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

We already have logic to do this in MySQL, it's just a matter of making the postgresql backend set the right flag (note that this is new in a recent version of postgresql, 9.1?) so we need to make sure it's only set there.

comment:2 Changed 3 years ago by hcarvalhoalves

Would be the case of adding allows_group_by_pk = True to DatabaseFeatures at https://github.com/django/django/blob/master/django/db/backends/postgresql_psycopg2/base.py#L111 ?

comment:3 Changed 3 years ago by Alex

That's the right flag, not sure that's the right place off the top of my head.

Changed 3 years ago by hgeerts@…

comment:4 Changed 3 years ago by hgeerts@…

  • Has patch set

comment:5 Changed 3 years ago by hcarvalhoalves

Thanks. I've tried your patch but it doesn't solve the issue, Django still adds all fields to GROUP BY clause (the SQL is identical to the original one).

comment:6 follow-up: Changed 3 years ago by akaariai

In PostgreSQL you must include all the primary keys in the query, so if you have joins that means you will need to add those PKs too. This isn't true for MySQL, just one pk is enough.

This has a pre-existing ticket, #18016. I am not sure which ticket to close as duplicate at this point.

Also, the MySQL code isn't currently working, see #17144. I should really commit the patch in that ticket...

comment:7 Changed 3 years ago by Harm Geerts <hgeerts@…>

I've been discussing this on irc with hcarvalhoalves and I hit my head on #17144 as well.
In addition the patch attached to this ticket does not work because the sql can be constructed before the connection to the database is established.
This can be worked around by forcing a connection when a new DatabaseWrapper instance is created but that's a bit dirty and I don't think a patch like that would be accepted.

comment:8 in reply to: ↑ 6 Changed 3 years ago by hcarvalhoalves

Replying to akaariai:

In PostgreSQL you must include all the primary keys in the query, so if you have joins that means you will need to add those PKs too. This isn't true for MySQL, just one pk is enough.

This has a pre-existing ticket, #18016. I am not sure which ticket to close as duplicate at this point.

Also, the MySQL code isn't currently working, see #17144. I should really commit the patch in that ticket...

You might want to close #18016 as dup to keep the discussion here since there's an example use case and the output of the query planner so we can compare.

comment:9 Changed 3 years ago by akaariai

#18016 was closed as duplicate.

I will commit the #17144 patch soon unless something unexpected is found.

comment:10 Changed 3 years ago by hcarvalhoalves

  • Cc hcarvalhoalves@… added

comment:11 follow-up: Changed 3 years ago by akaariai

For the version dependant check you could probably use a cached_property (django.utils.functional IIRC). The idea is that when accessed the getter will make sure version information is available. Once the version information has been checked, the value can be cached and there is no longer need to do version information checks. This could mean opening a connection in the getter.

Another approach is to generate inefficient SQL if the version information isn't yet available.

I am leaning towards just generating inefficient SQL if the query happens to be ran first thing in a new ConnectionWrapper (not just as first thing in a connection - first thing for the object). The reason is that otherwise we have to close the connection used for version checking and that is somewhat expensive. I am not feeling strongly at all about this...

The harder problem is that you can't group by the main PK alone. You will need the primary keys of joined tables too, and that information isn't directly available in group by clause generation. (Actually, it might be in select and related_select_cols as "second" part of the tuple, not sure of this). In any case the MySQL code will not work directly.

comment:12 in reply to: ↑ 11 Changed 3 years ago by hcarvalhoalves

Replying to akaariai:

For the version dependant check you could probably use a cached_property (django.utils.functional IIRC). The idea is that when accessed the getter will make sure version information is available. Once the version information has been checked, the value can be cached and there is no longer need to do version information checks. This could mean opening a connection in the getter.

Another approach is to generate inefficient SQL if the version information isn't yet available.

I am leaning towards just generating inefficient SQL if the query happens to be ran first thing in a new ConnectionWrapper (not just as first thing in a connection - first thing for the object). The reason is that otherwise we have to close the connection used for version checking and that is somewhat expensive. I am not feeling strongly at all about this...

Since it's a chicken and egg problem, is it really necessary to "autodetect" the feature then? I couldn't find any other example where Django enables/disables features based on the database version. What about just introducing a django.db.backends.postgresql9_psycopg2 backend? It's a trivial patch.

The harder problem is that you can't group by the main PK alone. You will need the primary keys of joined tables too, and that information isn't directly available in group by clause generation. (Actually, it might be in select and related_select_cols as "second" part of the tuple, not sure of this). In any case the MySQL code will not work directly.

That's a bigger problem, I'm afraid I'm too crud on the ORM code to figure it out.

Last edited 3 years ago by hcarvalhoalves (previous) (diff)

comment:13 Changed 3 years ago by aaugustin

  • Triage Stage changed from Unreviewed to Design decision needed
  • Type changed from Uncategorized to Cleanup/optimization

Adding a new backend might be trivial in code, but not in maintenance, docs, user support, etc.

Marking as DDN because the problem is valid but there isn't a consensus on how to fix it.

comment:14 Changed 2 years ago by jacob

  • Triage Stage changed from Design decision needed to Accepted

Moving back to accepted: we should fix it, at least once we figure out how :)

comment:15 follow-up: Changed 2 years ago by akaariai

I think the GROUP BY clause could be constructed using the select and related_select_cols' field information. If a select column's field is available, then include the field only if field.primary_key is True. If the field isn't available, include it always in the GROUP BY.

For the version dependency there are other example of disabling/enabling functionality based on version, at least regex searches on Oracle are such. In any case, a separate backend for this is a no-go.

comment:16 Changed 21 months ago by ryankask

  • Cc dev@… added

comment:17 in reply to: ↑ 15 Changed 21 months ago by shai

Replying to akaariai:

I think the GROUP BY clause could be constructed using the select and related_select_cols' field information. If a select column's field is available, then include the field only if field.primary_key is True. If the field isn't available, include it always in the GROUP BY.

I think #20971 is related. In particular, when building the GROUP BY clause, also exclude the field if it is deferred.

comment:18 Changed 20 months ago by timo

  • Needs tests set
  • Patch needs improvement set

comment:19 Changed 9 months ago by charettes

  • Cc charettes added

comment:20 Changed 2 months ago by charettes

  • Version changed from 1.4 to master

I started to work on a PR.

comment:21 Changed 2 months ago by charettes

  • Owner changed from nobody to charettes
  • Status changed from new to assigned

comment:22 Changed 2 months ago by charettes

  • Needs tests unset
  • Patch needs improvement unset

Patch passes the full test suite including the adjusted existing tests for allows_group_by_pk.

comment:23 Changed 2 months ago by timgraham

  • Triage Stage changed from Accepted to Ready for checkin

comment:24 Changed 8 weeks ago by Simon Charette <charette.s@…>

  • Resolution set to fixed
  • Status changed from assigned to closed

In dc27f3e:

Fixed #19259 -- Added group by selected primary keys support.

Note: See TracTickets for help on using tickets.
Back to Top