Opened 9 years ago

Closed 9 years ago

Last modified 6 years ago

#25375 closed Bug (needsinfo)

Paginator generates suboptimal queries

Reported by: Alexandru Mărășteanu Owned by: nobody
Component: Database layer (models, ORM) Version: 1.7
Severity: Normal Keywords:
Cc: Alexandru Mărășteanu Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Hello,

I'm using Django Rest Framework and have the following situation:

I have two models, Article and Comment:

class Article(models.Model):
    # ...

class Comment(models.Model):
    article = models.ForeignKey(Article, related_name='comments')
    # ...

And an articles viewset:

class Articles(viewsets.ModelViewSet):
    model = Article
    serializer_class = ArticleSerializer
    paginate_by = 50
    paginate_by_param = 'limit'

    def get_queryset(self):
        return Article.objects.annotate(comments_count=Count('comments'))

When doing a simple listing, the database will be queried two times: one to count the result set, the other to fetch the results:

SELECT
    COUNT(*)
    FROM (
        SELECT
            `article`.`id` AS `id`,
            COUNT(`article_comment`.`id`) AS `comments_count`
        FROM `article`
        LEFT OUTER JOIN `article_comment` ON (`article`.`id` = `article_comment`.`article_id`)
        GROUP BY `article`.`id`
        ORDER BY NULL
    )
    subquery;

And

SELECT
    `article`.`id` AS `id`,
    COUNT(`article_comment`.`id`) AS `comments_count`
FROM `article`
LEFT OUTER JOIN `article_comment` ON (`article`.`id` = `article_comment`.`article_id`)
GROUP BY `article`.`id`
ORDER BY `article`.`id` DESC
LIMIT 50;

respectively.

If you look at (and EXPLAIN) the first query, the one that does the counting, you'll notice it loads up *all* records and then starts counting them, a process which may take up all resources and eventually crash the machine. It did so in my case, with only 50000 records.

I figured this has something to do with Django, not DRF.

Change History (9)

comment:1 by Tim Graham, 9 years ago

Could you please add a test case that demonstrates the problem without DRF?

comment:2 by Simon Charette, 9 years ago

Wonder if this can be related to the changes that caused #25230?

@alexei, are you using MySQL?

comment:3 by Tim Graham, 9 years ago

Resolution: needsinfo
Status: newclosed

comment:4 by Adam Johnson, 9 years ago

I think DRF is probably mutating your QuerySet to use a subquery to get the count when it's not necessary. If you are using MySQL, these tend to be slow - newer versions of MySQL and MariaDB have improvements that make them faster.

comment:5 by Alexandru Mărășteanu, 9 years ago

Cc: Alexandru Mărășteanu added
Resolution: needsinfo
Status: closednew

Hello,

Sorry for the delay. I didn't get notified when you guys replied, and I completely forgot about it.

Yes, in my previous example I was using MySQL.

Hoewever, I created a test app to see how it goes using the default settings (including SQLite db). The result is as follows:

$ python manage.py shell
>>> from django.core.paginator import Paginator
>>> from django.db import connections
>>> from django.db.models import Count
>>> from bug.models import Article
>>> a = Article.objects.annotate(comments_count=Count('comments'))
>>> p = Paginator(a, 25)
>>> print p.object_list
[]
>>> print connections['default'].queries
[{u'time': u'0.001', u'sql': u'QUERY = u\'SELECT "bug_article"."id", "bug_article"."title", "bug_article"."body", COUNT("bug_comment"."id") AS "comments_count" FROM "bug_article" LEFT OUTER JOIN "bug_comment" ON ( "bug_article"."id" = "bug_comment"."article_id" ) GROUP BY "bug_article"."id", "bug_article"."title", "bug_article"."body" LIMIT 21\' - PARAMS = ()'}]
>>> print p.count
0
>>> print connections['default'].queries
[{u'time': u'0.001', u'sql': u'QUERY = u\'SELECT "bug_article"."id", "bug_article"."title", "bug_article"."body", COUNT("bug_comment"."id") AS "comments_count" FROM "bug_article" LEFT OUTER JOIN "bug_comment" ON ( "bug_article"."id" = "bug_comment"."article_id" ) GROUP BY "bug_article"."id", "bug_article"."title", "bug_article"."body" LIMIT 21\' - PARAMS = ()'}, {u'time': u'0.000', u'sql': u'QUERY = u\'SELECT COUNT(*) FROM (SELECT "bug_article"."id" AS "id", "bug_article"."title" AS "title", "bug_article"."body" AS "body", COUNT("bug_comment"."id") AS "comments_count" FROM "bug_article" LEFT OUTER JOIN "bug_comment" ON ( "bug_article"."id" = "bug_comment"."article_id" ) GROUP BY "bug_article"."id", "bug_article"."title", "bug_article"."body") subquery\' - PARAMS = ()'}]
>>> 

This time I'm again on Django 1.7.1.

For a second look, I updated Django to 1.8.4. The result is the same.

As you can see, it's Paginator's count that wraps the query resulting from the QuerySet in a subquery.

Version 0, edited 9 years ago by Alexandru Mărășteanu (next)

comment:6 by Tim Graham, 9 years ago

The queries in your latest comment are difficult to read given they are not formatted as nicely as the ticket description. Perhaps you could edit the description and move your latest comment there.

After that, could you explain what the expected query is? Are you able to propose a fix? Not knowing these particular internal details of Django offhand, it's a bit difficult for me to triage the ticket otherwise. Thanks!

comment:7 by Tim Graham, 9 years ago

Resolution: needsinfo
Status: newclosed

comment:8 by Stefan Wehrmeyer, 7 years ago

The Paginator needs to make two queries: the count query and the query for the page objects.

If the queryset you use in your Paginator satisfies one of these conditions the count query will run as a subquery, so this query:

SELECT COUNT(*) FROM (SELECT ...);

instead of

SELECT COUNT(*) AS "__count" FROM ...;

In my case the first query takes double the amount of time of the second query.

The row count can be independent of annotations. So if you make the Paginator do the count query on a queryset without annotations (resulting in a more performant count query) and then doing the actual pagination with the annotated queryset may result in a small performance win.

Annotations can influence row count (e.g. when filtering by annotation) and you still may have groupby() or distinct() on your queryset that influences counts and forces a count with a subquery. So I'm not sure if there's a general way for the Paginator (or any .count() call) to run the more performant query.

comment:9 by Ramiro Morales, 6 years ago

This is similar to #23771

Note: See TracTickets for help on using tickets.
Back to Top