Opened 4 years ago

Last modified 5 months ago

#19415 new Cleanup/optimization

Clarify how aggregates work with multi-valued relationships and multiple filter() calls

Reported by: svniemeijer@… Owned by: nobody
Component: Documentation Version: master
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

This issue seems related to #16603, but has a slightly different scope, so I am creating a new ticket for it.

Consider the following model:

class Author(models.Model):
    name = models.CharField(max_length=100)

class Book(models.Model):
    author = models.ForeignKey(Author, related_name='books')
    title = models.CharField(max_length=100)
    published = models.DateField()
    pages = models.IntegerField()

then the query:

    qs = Author.objects.filter(books__published__gte=date(2012, 11, 1))
    qs = qs.filter(books__published__lte=date(2012, 11, 30))
    qs = qs.values('name').annotate(num_pages=Sum('books__pages'))
    print qs.query

will result in the following SQL:

SELECT "app_author"."name", SUM("app_book"."pages") AS "num_pages" FROM "app_author" LEFT OUTER JOIN "app_book" ON ("app_author"."id" = "app_book"."author_id") INNER JOIN "app_book" T3 ON ("app_author"."id" = T3."author_id") WHERE ("app_book"."published" >= 2012-11-01  AND T3."published" <= 2012-11-30 ) GROUP BY "app_author"."name"

The problem is that in this case the upper date filter is not taken into account when calculating the 'num_pages' aggregate.

I would have expected to see the following SQL:

SELECT "app_author"."name", SUM("app_book"."pages") AS "num_pages" FROM "app_author" LEFT OUTER JOIN "app_book" ON ("app_author"."id" = "app_book"."author_id") WHERE ("app_book"."published" >= 2012-11-01  AND "app_book"."published" <= 2012-11-30 ) GROUP BY "app_author"."name"

Note that a query with just a single filter seems to work correctly:

    qs = Author.objects.filter(books__published__gte=date(2012, 11, 1))
    qs = qs.values('name').annotate(num_pages=Sum('books__pages'))
    print qs.query

which gives:

SELECT "app_author"."name", SUM("app_book"."pages") AS "num_pages" FROM "app_author" LEFT OUTER JOIN "app_book" ON ("app_author"."id" = "app_book"."author_id") WHERE "app_book"."published" >= 2012-11-01  GROUP BY "app_author"."name"

Attachments (1)

19415Docs.diff (1.0 KB) - added by Benjamin Phillips 13 months ago.

Download all attachments as: .zip

Change History (17)

comment:1 Changed 4 years ago by Russell Keith-Magee

Resolution: invalid
Status: newclosed

As far as I can make out, this is working as expected (for suitably confusing values of expected :-).
The catch is that:

Author.objects.filter(books__published__gte=date(2012, 11, 1)).filter(books__published__lte=date(2012, 11, 30))

and

Author.objects.filter(books__published__gte=date(2012, 11, 1), books__published__lte=date(2012, 11, 30))

are not interchangeable. By using a single filter clause, you're telling Django to use a single join table. Splitting the filter means 2 joins are used (hence the T3 table in the "wrong" query). The involvement of aggregates here is a red herring; the underlying issue is the filters.

comment:2 Changed 4 years ago by svniemeijer@…

Thanks! Combining everything into a single .filter() does indeed solve the problem.

However, the aggregate part is not entirely a red herring and some clarification in the documentation on aggregation could be useful to prevent others making this mistake.

In the documentation topics/db/queries/#spanning-multi-valued-relationships it is already clarified how subsequent filter() calls are treated. However, it may be good to have a similar section in topics/db/aggregation/#filtering-on-annotations that explains how aggregates on multi-valued relationships behave in case multiple filter() statements are used.

There is also no example that shows how to filter using multiple conditions in topics/db/aggregation/#filtering-on-annotations . Adding such an example (using a single .filter() call) would have probably already guided me in the right direction.

comment:3 Changed 4 years ago by Russell Keith-Magee

Component: ORM aggregationDocumentation
Easy pickings: set
Resolution: invalid
Status: closedreopened
Triage Stage: UnreviewedAccepted
Type: BugCleanup/optimization

That's a fair point. Reopening and accepting as a documentation fix.

comment:4 Changed 4 years ago by Aymeric Augustin

Status: reopenednew

comment:7 Changed 3 years ago by Tim Graham

Easy pickings: unset
Summary: Invalid result when using multiple reverse foreign-key filters together with reverse foreign-key aggregateClarify how aggregates work with multi-valued relationships and multiple filter() calls
Version: 1.5-beta-1master

Changed 13 months ago by Benjamin Phillips

Attachment: 19415Docs.diff added

comment:8 Changed 13 months ago by Benjamin Phillips

Has patch: set

Added a documentation patch.

comment:9 Changed 13 months ago by Tim Graham

If you are able to submit a pull request rather than an attachment, that's ideal. Just let me know if not.

comment:10 Changed 13 months ago by Sander Niemeijer

This documentation fragment does indeed get to the core of what is happening, but I still had to read it several times to understand the difference. To me and and as well as mean the same thing.
It may be useful to refer to the concepts of union and intersection, because the first example gives an annotation on the intersection of books matching the two conditions whereas the second one gives an annotation on the union of them.

comment:11 Changed 13 months ago by Benjamin Phillips

I can go ahead and make a PR. I'll also look into changing the wording to be more transparent. Thanks for the feedback!

comment:12 Changed 13 months ago by Benjamin Phillips

Made a PR for the patch (PR 5729 https://github.com/django/django/pull/5729) and attempted to improve the wording.

comment:13 Changed 13 months ago by Sander Niemeijer

Looks good to me. Thanks for the effort.

comment:14 Changed 12 months ago by Tim Graham

Patch needs improvement: set

The proposed example queries use local fields instead of relations, so I think it doesn't quite address the ticket.

comment:15 Changed 12 months ago by Benjamin Phillips

PR has been updated with an example using queries across relationships instead of local fields.

comment:16 Changed 12 months ago by Benjamin Phillips

I did some more digging on this issue as I was working on editing my Doc patch and I learned a bit more about what's going on. Consider the following test case:

class Author(models.Model):
    name = models.CharField(max_length=10)

class Book(models.Model):
    title = models.TextField()
    pages = models.IntegerField(null=True)
    rating = models.IntegerField(null=True)
    authors = models.ManyToManyField(Author)

a1 = Author.objects.create(name='a1')
a2 = Author.objects.create(name='a2')
a3 = Author.objects.create(name='a3')
a4 = Author.objects.create(name='a4')

b1 = Book.objects.create(title='b1', pages=100, rating=4)
b1.authors.set([a1])
b2 = Book.objects.create(title='b2', pages=50, rating=2)
b2.authors.set([a2])
b3 = Book.objects.create(title='b3', pages=150, rating=3)
b3.authors.set([a4])
b4 = Book.objects.create(title='b4', pages=400, rating=5)
b4.authors.set([a4])
b5 = Book.objects.create(title='b5', pages=100, rating=1)
b5.authors.set([a1,a4])

Now, let's look at two queries on that test case:

qs1 = Author.objects.filter(book__pages__lt=200, book__pages__gt=100).annotate(num_book=Count('book'))

qs2 = Author.objects.filter(book__pages__lt=200).filter(book__pages__gt=100).annotate(num_book=Count('book'))

The first query gives us the expected results:

for i in qs1:
    print(i, i.num_book)
>>> a4, 1
print(qs1.query)
>>> SELECT "queries_author"."id", "queries_author"."name", "queries_author"."num", "queries_author"."extra_id", COUNT("queries_book_authors"."book_id") AS "num_book" FROM "queries_author" INNER JOIN "queries_book_authors" ON ("queries_author"."id" = "queries_book_authors"."author_id") INNER JOIN "queries_book" ON ("queries_book_authors"."book_id" = "queries_book"."id") WHERE ("queries_book"."pages" < 200 AND "queries_book"."pages" > 100) GROUP BY "queries_author"."id", "queries_author"."name", "queries_author"."num", "queries_author"."extra_id" ORDER BY "queries_author"."name" ASC

The second query, however, gives some rather strange results (author4 only has 3 books, not 4!):

for i in qs2:
    print(i, i.num_book)
>>> a4, 4
print(qs2.query)
>>> SELECT "queries_author"."id", "queries_author"."name", "queries_author"."num", "queries_author"."extra_id", COUNT("queries_book_authors"."book_id") AS "num_book" FROM "queries_author" INNER JOIN "queries_book_authors" ON ("queries_author"."id" = "queries_book_authors"."author_id") INNER JOIN "queries_book" ON ("queries_book_authors"."book_id" = "queries_book"."id") INNER JOIN "queries_book_authors" T4 ON ("queries_author"."id" = T4."author_id") INNER JOIN "queries_book" T5 ON (T4."book_id" = T5."id") WHERE ("queries_book"."pages" < 200 AND T5."pages" > 100) GROUP BY "queries_author"."id", "queries_author"."name", "queries_author"."num", "queries_author"."extra_id" ORDER BY "queries_author"."name" ASC

So, the problem appears to be that Count is only looking at results from queries_book_authors and cannot see results from the aliased tables (T4 and T5), thus it only counts from the first joins and not the latter joins. This results, rather bizarrely, in the annotated value being the product of the number of results meeting the first condition ("queries_book"."pages" < 200) times the number of results meeting the second condition (T5."pages" > 100). This result is neither intuitive nor helpful, so I would suggest that, going forward, the documentation should provide an example of the first query and a warning not to annotate queries with multiple filters as in the second query since the results are not what would be expected or useful.

NOTE: I have not yet tested this case with just a ForeignKey relationship, so as of now the above statements can only be applied when filtering across M2M relationships.

comment:17 Changed 12 months ago by Tim Graham

Just wanted to drop a quick note to ask if the explanation in https://docs.djangoproject.com/en/dev/topics/db/aggregation/#order-of-annotate-and-filter-clauses about the "query bug" and Count('...', distinct=True) might help.

comment:18 Changed 10 months ago by Benjamin Phillips

That does change the results. Using the same setup I used above but adding in Count('...', distinct=True) results in the following:

for i in qs1:
    print(i, i.num_book)
>>> a4 1
print(qs1.query)
>>> SELECT "queries_author"."id", "queries_author"."name", "queries_author"."num", COUNT(DISTINCT "queries_book_authors"."book_id") AS "num_book" FROM "queries_author" INNER JOIN "queries_book_authors" ON ("queries_author"."id" = "queries_book_authors"."author_id") INNER JOIN "queries_book" ON ("queries_book_authors"."book_id" = "queries_book"."id") WHERE ("queries_book"."pages" < 200 AND "queries_book"."pages" > 100) GROUP BY "queries_author"."id", "queries_author"."name"

for i in qs2:
    print(i, i.num_book)
>>> a4 2
print(qs2.query)
>>>  SELECT "queries_author"."id", "queries_author"."name", "queries_author"."num", COUNT(DISTINCT "queries_book_authors"."book_id") AS "num_book" FROM "queries_author" INNER JOIN "queries_book_authors" ON ("queries_author"."id" = "queries_book_authors"."author_id") INNER JOIN "queries_book" ON ("queries_book_authors"."book_id" = "queries_book"."id") INNER JOIN "queries_book_authors" T4 ON ("queries_author"."id" = T4."author_id") INNER JOIN "queries_book" T5 ON (T4."book_id" = T5."id") WHERE ("queries_book"."pages" < 200 AND T5."pages" > 100) GROUP BY "queries_author"."id", "queries_author"."name"

So, this does appear to be better, but I still don't think it is giving a result that is helpful or intuitive. It looks to me like the second query is counting the same book as the first, just twice. What it does not appear to be doing is giving a list of Authors with books having either fewer than 200 pages or greater than 100 pages, nor is it giving a list of Authors with books having between 100 and 200 pages.

Note: See TracTickets for help on using tickets.
Back to Top