Code

Opened 20 months ago

Last modified 9 months ago

#19259 new Cleanup/optimization

Annotations generating inefficient SQL on PostgreSQL

Reported by: hcarvalhoalves Owned by: nobody
Component: Database layer (models, ORM) Version: 1.4
Severity: Normal Keywords:
Cc: hcarvalhoalves@…, dev@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

Considering the following models:

    class Movie(EditableBase, TimestampBase):
        year = models.PositiveSmallIntegerField(u"ano", null=True)
        country = models.CharField(u"país", max_length=100, blank=True, null=True)
        release_date = models.DateField(u"data de estréia", blank=True, null=True)
        length = models.PositiveSmallIntegerField(u"duração", blank=True, null=True)
        rating = models.PositiveIntegerField(u"avaliação", blank=True, null=True, choices=RATING_CHOICES)
        suitable_for = models.PositiveSmallIntegerField(u"censura", blank=True, null=True, choices=AGE_CHOICES)
        website_original = models.URLField(u"website", max_length=255, blank=True, null=True)
        website_national = models.URLField(u"website (nacional)", max_length=255, blank=True, null=True)
        synopsis = models.TextField(u"sinopse", blank=True)
        cover = models.ImageField(u"poster", upload_to=u'uploads/fichatecnica/', blank=True, null=True)
        genres = models.ManyToManyField(MovieGenre, verbose_name=u"gêneros")

    class MovieTheaterSession(models.Model):
        start_date = models.DateField(u"início")
        end_date = models.DateField(u"fim")
        movie = models.ForeignKey(Movie, verbose_name=u"filme", related_name='sessions')

If we annotate the query to count the number of sessions:

    >>> Movie.objects.all().annotate(sessions_count=Count('sessions')).order_by('release_date')[:60]

Django generates pretty inneficient SQL by GROUPing BY on all fields from the parent:

    SELECT "movies_movie"."id", "movies_movie"."creation_date", "movies_movie"."modification_date", "movies_movie"."year", "movies_movie"."country", "movies_movie"."release_date", "movies_movie"."length", "movies_movie"."rating", "movies_movie"."suitable_for", "movies_movie"."website_original", "movies_movie"."website_national", "movies_movie"."synopsis", "movies_movie"."cover", COUNT("theaters_movietheatersession"."id") AS "sessions_count" FROM "movies_movie" LEFT OUTER JOIN "theaters_movietheatersession" ON ("movies_movie"."id" = "theaters_movietheatersession"."movie_id") GROUP BY "movies_movie"."id", "movies_movie"."creation_date", "movies_movie"."modification_date", "movies_movie"."year", "movies_movie"."country", "movies_movie"."release_date", "movies_movie"."length", "movies_movie"."rating", "movies_movie"."suitable_for", "movies_movie"."website_original", "movies_movie"."website_national", "movies_movie"."synopsis", "movies_movie"."cover" ORDER BY "movies_movie"."release_date" ASC LIMIT 60
    Time: 892,122 ms

EXPLAIN shows the database is sorting all fields from the GROUP BY clause:

    Limit  (cost=2250.37..2250.52 rows=60 width=502)
      ->  Sort  (cost=2250.37..2256.15 rows=2311 width=502)
            Sort Key: (count(theaters_movietheatersession.id)), movies_movie.release_date
            ->  GroupAggregate  (cost=1975.47..2170.56 rows=2311 width=502)
                  ->  Sort  (cost=1975.47..1986.94 rows=4586 width=502)
                        Sort Key: movies_movie.id, movies_movie.creation_date, movies_movie.modification_date, movies_movie.year, movies_movie.country, movies_movie.release_date, movies_movie.length, movies_movie.rating, movies_movie.suitable_for, movies_movie.website_original, movies_movie.website_national, movies_movie.synopsis, movies_movie.cover
                        ->  Merge Left Join  (cost=0.00..660.58 rows=4586 width=502)
                              Merge Cond: (movies_movie.id = theaters_movietheatersession.movie_id)
                              ->  Index Scan using movies_movie_pkey on movies_movie  (cost=0.00..283.14 rows=2311 width=498)
                              ->  Index Scan using theaters_movietheatersession_movie_id on theaters_movietheatersession  (cost=0.00..314.34 rows=4586 width=8)

It suffices to GROUP BY by the PK field:

    SELECT "movies_movie"."id", "movies_movie"."creation_date", "movies_movie"."modification_date", "movies_movie"."year", "movies_movie"."country", "movies_movie"."release_date", "movies_movie"."length", "movies_movie"."rating", "movies_movie"."suitable_for", "movies_movie"."website_original", "movies_movie"."website_national", "movies_movie"."synopsis", "movies_movie"."cover", COUNT("theaters_movietheatersession"."id") AS "sessions_count" FROM "movies_movie" LEFT OUTER JOIN "theaters_movietheatersession" ON ("movies_movie"."id" = "theaters_movietheatersession"."movie_id") GROUP BY "movies_movie"."id" ORDER BY "movies_movie"."release_date" ASC LIMIT 60
    Time: 16,285 ms

EXPLAIN shows the database doesn't need to sort anymore:

    Limit  (cost=786.42..786.57 rows=60 width=502)
       ->  Sort  (cost=786.42..792.20 rows=2311 width=502)
             Sort Key: movies_movie.release_date
             ->  GroupAggregate  (cost=0.00..706.62 rows=2311 width=502)
                   ->  Merge Left Join  (cost=0.00..660.58 rows=4586 width=502)
                         Merge Cond: (movies_movie.id = theaters_movietheatersession.movie_id)
                         ->  Index Scan using movies_movie_pkey on movies_movie  (cost=0.00..283.14 rows=2311 width=498)
                         ->  Index Scan using theaters_movietheatersession_movie_id on theaters_movietheatersession  (cost=0.00..314.34 rows=4586 width=8)

I've tried fixing it by setting the undocumented group_by attribute, but Django doesn't seem to pick it up:

class MovieManager(models.Manager):
        def with_sessions(self):
            qs = self.get_query_set().annotate(
                sessions_count=Count('sessions'))
            qs.group_by = ['id']
            return qs

I'm using Django 1.4 and PostgreSQL 9.1.

Attachments (1)

19259.diff (748 bytes) - added by hgeerts@… 20 months ago.
group_by pk is supported starting 9.1 http://www.postgresql.org/docs/9.1/static/sql-select.html#SQL-GROUPBY

Download all attachments as: .zip

Change History (19)

comment:1 Changed 20 months ago by Alex

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

We already have logic to do this in MySQL, it's just a matter of making the postgresql backend set the right flag (note that this is new in a recent version of postgresql, 9.1?) so we need to make sure it's only set there.

comment:2 Changed 20 months ago by hcarvalhoalves

Would be the case of adding allows_group_by_pk = True to DatabaseFeatures at https://github.com/django/django/blob/master/django/db/backends/postgresql_psycopg2/base.py#L111 ?

comment:3 Changed 20 months ago by Alex

That's the right flag, not sure that's the right place off the top of my head.

Changed 20 months ago by hgeerts@…

comment:4 Changed 20 months ago by hgeerts@…

  • Has patch set

comment:5 Changed 20 months ago by hcarvalhoalves

Thanks. I've tried your patch but it doesn't solve the issue, Django still adds all fields to GROUP BY clause (the SQL is identical to the original one).

comment:6 follow-up: Changed 20 months ago by akaariai

In PostgreSQL you must include all the primary keys in the query, so if you have joins that means you will need to add those PKs too. This isn't true for MySQL, just one pk is enough.

This has a pre-existing ticket, #18016. I am not sure which ticket to close as duplicate at this point.

Also, the MySQL code isn't currently working, see #17144. I should really commit the patch in that ticket...

comment:7 Changed 20 months ago by Harm Geerts <hgeerts@…>

I've been discussing this on irc with hcarvalhoalves and I hit my head on #17144 as well.
In addition the patch attached to this ticket does not work because the sql can be constructed before the connection to the database is established.
This can be worked around by forcing a connection when a new DatabaseWrapper instance is created but that's a bit dirty and I don't think a patch like that would be accepted.

comment:8 in reply to: ↑ 6 Changed 20 months ago by hcarvalhoalves

Replying to akaariai:

In PostgreSQL you must include all the primary keys in the query, so if you have joins that means you will need to add those PKs too. This isn't true for MySQL, just one pk is enough.

This has a pre-existing ticket, #18016. I am not sure which ticket to close as duplicate at this point.

Also, the MySQL code isn't currently working, see #17144. I should really commit the patch in that ticket...

You might want to close #18016 as dup to keep the discussion here since there's an example use case and the output of the query planner so we can compare.

comment:9 Changed 20 months ago by akaariai

#18016 was closed as duplicate.

I will commit the #17144 patch soon unless something unexpected is found.

comment:10 Changed 20 months ago by hcarvalhoalves

  • Cc hcarvalhoalves@… added

comment:11 follow-up: Changed 20 months ago by akaariai

For the version dependant check you could probably use a cached_property (django.utils.functional IIRC). The idea is that when accessed the getter will make sure version information is available. Once the version information has been checked, the value can be cached and there is no longer need to do version information checks. This could mean opening a connection in the getter.

Another approach is to generate inefficient SQL if the version information isn't yet available.

I am leaning towards just generating inefficient SQL if the query happens to be ran first thing in a new ConnectionWrapper (not just as first thing in a connection - first thing for the object). The reason is that otherwise we have to close the connection used for version checking and that is somewhat expensive. I am not feeling strongly at all about this...

The harder problem is that you can't group by the main PK alone. You will need the primary keys of joined tables too, and that information isn't directly available in group by clause generation. (Actually, it might be in select and related_select_cols as "second" part of the tuple, not sure of this). In any case the MySQL code will not work directly.

comment:12 in reply to: ↑ 11 Changed 20 months ago by hcarvalhoalves

Replying to akaariai:

For the version dependant check you could probably use a cached_property (django.utils.functional IIRC). The idea is that when accessed the getter will make sure version information is available. Once the version information has been checked, the value can be cached and there is no longer need to do version information checks. This could mean opening a connection in the getter.

Another approach is to generate inefficient SQL if the version information isn't yet available.

I am leaning towards just generating inefficient SQL if the query happens to be ran first thing in a new ConnectionWrapper (not just as first thing in a connection - first thing for the object). The reason is that otherwise we have to close the connection used for version checking and that is somewhat expensive. I am not feeling strongly at all about this...

Since it's a chicken and egg problem, is it really necessary to "autodetect" the feature then? I couldn't find any other example where Django enables/disables features based on the database version. What about just introducing a django.db.backends.postgresql9_psycopg2 backend? It's a trivial patch.

The harder problem is that you can't group by the main PK alone. You will need the primary keys of joined tables too, and that information isn't directly available in group by clause generation. (Actually, it might be in select and related_select_cols as "second" part of the tuple, not sure of this). In any case the MySQL code will not work directly.

That's a bigger problem, I'm afraid I'm too crud on the ORM code to figure it out.

Last edited 20 months ago by hcarvalhoalves (previous) (diff)

comment:13 Changed 20 months ago by aaugustin

  • Triage Stage changed from Unreviewed to Design decision needed
  • Type changed from Uncategorized to Cleanup/optimization

Adding a new backend might be trivial in code, but not in maintenance, docs, user support, etc.

Marking as DDN because the problem is valid but there isn't a consensus on how to fix it.

comment:14 Changed 15 months ago by jacob

  • Triage Stage changed from Design decision needed to Accepted

Moving back to accepted: we should fix it, at least once we figure out how :)

comment:15 follow-up: Changed 15 months ago by akaariai

I think the GROUP BY clause could be constructed using the select and related_select_cols' field information. If a select column's field is available, then include the field only if field.primary_key is True. If the field isn't available, include it always in the GROUP BY.

For the version dependency there are other example of disabling/enabling functionality based on version, at least regex searches on Oracle are such. In any case, a separate backend for this is a no-go.

comment:16 Changed 11 months ago by ryankask

  • Cc dev@… added

comment:17 in reply to: ↑ 15 Changed 10 months ago by shai

Replying to akaariai:

I think the GROUP BY clause could be constructed using the select and related_select_cols' field information. If a select column's field is available, then include the field only if field.primary_key is True. If the field isn't available, include it always in the GROUP BY.

I think #20971 is related. In particular, when building the GROUP BY clause, also exclude the field if it is deferred.

comment:18 Changed 9 months ago by timo

  • Needs tests set
  • Patch needs improvement set

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as new
The owner will be changed from nobody to anonymous. Next status will be 'assigned'
as The resolution will be set. Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.