Opened 11 years ago

Last modified 7 years ago

#19415 new Cleanup/optimization

Clarify how aggregates work with multi-valued relationships and multiple filter() calls

Reported by: svniemeijer@… Owned by: nobody
Component: Documentation Version: dev
Severity: Normal Keywords:
Cc: Simon Charette, Josh Smeaton Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

This issue seems related to #16603, but has a slightly different scope, so I am creating a new ticket for it.

Consider the following model:

class Author(models.Model):
    name = models.CharField(max_length=100)

class Book(models.Model):
    author = models.ForeignKey(Author, related_name='books')
    title = models.CharField(max_length=100)
    published = models.DateField()
    pages = models.IntegerField()

then the query:

    qs = Author.objects.filter(books__published__gte=date(2012, 11, 1))
    qs = qs.filter(books__published__lte=date(2012, 11, 30))
    qs = qs.values('name').annotate(num_pages=Sum('books__pages'))
    print qs.query

will result in the following SQL:

SELECT "app_author"."name", SUM("app_book"."pages") AS "num_pages" FROM "app_author" LEFT OUTER JOIN "app_book" ON ("app_author"."id" = "app_book"."author_id") INNER JOIN "app_book" T3 ON ("app_author"."id" = T3."author_id") WHERE ("app_book"."published" >= 2012-11-01  AND T3."published" <= 2012-11-30 ) GROUP BY "app_author"."name"

The problem is that in this case the upper date filter is not taken into account when calculating the 'num_pages' aggregate.

I would have expected to see the following SQL:

SELECT "app_author"."name", SUM("app_book"."pages") AS "num_pages" FROM "app_author" LEFT OUTER JOIN "app_book" ON ("app_author"."id" = "app_book"."author_id") WHERE ("app_book"."published" >= 2012-11-01  AND "app_book"."published" <= 2012-11-30 ) GROUP BY "app_author"."name"

Note that a query with just a single filter seems to work correctly:

    qs = Author.objects.filter(books__published__gte=date(2012, 11, 1))
    qs = qs.values('name').annotate(num_pages=Sum('books__pages'))
    print qs.query

which gives:

SELECT "app_author"."name", SUM("app_book"."pages") AS "num_pages" FROM "app_author" LEFT OUTER JOIN "app_book" ON ("app_author"."id" = "app_book"."author_id") WHERE "app_book"."published" >= 2012-11-01  GROUP BY "app_author"."name"

Attachments (1)

19415Docs.diff (1.0 KB ) - added by Benjamin Phillips 8 years ago.

Download all attachments as: .zip

Change History (18)

comment:1 by Russell Keith-Magee, 11 years ago

Resolution: invalid
Status: newclosed

As far as I can make out, this is working as expected (for suitably confusing values of expected :-).
The catch is that:

Author.objects.filter(books__published__gte=date(2012, 11, 1)).filter(books__published__lte=date(2012, 11, 30))

and

Author.objects.filter(books__published__gte=date(2012, 11, 1), books__published__lte=date(2012, 11, 30))

are not interchangeable. By using a single filter clause, you're telling Django to use a single join table. Splitting the filter means 2 joins are used (hence the T3 table in the "wrong" query). The involvement of aggregates here is a red herring; the underlying issue is the filters.

comment:2 by svniemeijer@…, 11 years ago

Thanks! Combining everything into a single .filter() does indeed solve the problem.

However, the aggregate part is not entirely a red herring and some clarification in the documentation on aggregation could be useful to prevent others making this mistake.

In the documentation topics/db/queries/#spanning-multi-valued-relationships it is already clarified how subsequent filter() calls are treated. However, it may be good to have a similar section in topics/db/aggregation/#filtering-on-annotations that explains how aggregates on multi-valued relationships behave in case multiple filter() statements are used.

There is also no example that shows how to filter using multiple conditions in topics/db/aggregation/#filtering-on-annotations . Adding such an example (using a single .filter() call) would have probably already guided me in the right direction.

comment:3 by Russell Keith-Magee, 11 years ago

Component: ORM aggregationDocumentation
Easy pickings: set
Resolution: invalid
Status: closedreopened
Triage Stage: UnreviewedAccepted
Type: BugCleanup/optimization

That's a fair point. Reopening and accepting as a documentation fix.

comment:4 by Aymeric Augustin, 11 years ago

Status: reopenednew

comment:7 by Tim Graham, 10 years ago

Easy pickings: unset
Summary: Invalid result when using multiple reverse foreign-key filters together with reverse foreign-key aggregateClarify how aggregates work with multi-valued relationships and multiple filter() calls
Version: 1.5-beta-1master

by Benjamin Phillips, 8 years ago

Attachment: 19415Docs.diff added

comment:8 by Benjamin Phillips, 8 years ago

Has patch: set

Added a documentation patch.

comment:9 by Tim Graham, 8 years ago

If you are able to submit a pull request rather than an attachment, that's ideal. Just let me know if not.

comment:10 by Sander Niemeijer, 8 years ago

This documentation fragment does indeed get to the core of what is happening, but I still had to read it several times to understand the difference. To me and and as well as mean the same thing.
It may be useful to refer to the concepts of union and intersection, because the first example gives an annotation on the intersection of books matching the two conditions whereas the second one gives an annotation on the union of them.

comment:11 by Benjamin Phillips, 8 years ago

I can go ahead and make a PR. I'll also look into changing the wording to be more transparent. Thanks for the feedback!

comment:12 by Benjamin Phillips, 8 years ago

Made a PR for the patch (PR 5729 https://github.com/django/django/pull/5729) and attempted to improve the wording.

comment:13 by Sander Niemeijer, 8 years ago

Looks good to me. Thanks for the effort.

comment:14 by Tim Graham, 8 years ago

Patch needs improvement: set

The proposed example queries use local fields instead of relations, so I think it doesn't quite address the ticket.

comment:15 by Benjamin Phillips, 8 years ago

PR has been updated with an example using queries across relationships instead of local fields.

comment:16 by Benjamin Phillips, 8 years ago

I did some more digging on this issue as I was working on editing my Doc patch and I learned a bit more about what's going on. Consider the following test case:

class Author(models.Model):
    name = models.CharField(max_length=10)

class Book(models.Model):
    title = models.TextField()
    pages = models.IntegerField(null=True)
    rating = models.IntegerField(null=True)
    authors = models.ManyToManyField(Author)

a1 = Author.objects.create(name='a1')
a2 = Author.objects.create(name='a2')
a3 = Author.objects.create(name='a3')
a4 = Author.objects.create(name='a4')

b1 = Book.objects.create(title='b1', pages=100, rating=4)
b1.authors.set([a1])
b2 = Book.objects.create(title='b2', pages=50, rating=2)
b2.authors.set([a2])
b3 = Book.objects.create(title='b3', pages=150, rating=3)
b3.authors.set([a4])
b4 = Book.objects.create(title='b4', pages=400, rating=5)
b4.authors.set([a4])
b5 = Book.objects.create(title='b5', pages=100, rating=1)
b5.authors.set([a1,a4])

Now, let's look at two queries on that test case:

qs1 = Author.objects.filter(book__pages__lt=200, book__pages__gt=100).annotate(num_book=Count('book'))

qs2 = Author.objects.filter(book__pages__lt=200).filter(book__pages__gt=100).annotate(num_book=Count('book'))

The first query gives us the expected results:

for i in qs1:
    print(i, i.num_book)
>>> a4, 1
print(qs1.query)
>>> SELECT "queries_author"."id", "queries_author"."name", "queries_author"."num", "queries_author"."extra_id", COUNT("queries_book_authors"."book_id") AS "num_book" FROM "queries_author" INNER JOIN "queries_book_authors" ON ("queries_author"."id" = "queries_book_authors"."author_id") INNER JOIN "queries_book" ON ("queries_book_authors"."book_id" = "queries_book"."id") WHERE ("queries_book"."pages" < 200 AND "queries_book"."pages" > 100) GROUP BY "queries_author"."id", "queries_author"."name", "queries_author"."num", "queries_author"."extra_id" ORDER BY "queries_author"."name" ASC

The second query, however, gives some rather strange results (author4 only has 3 books, not 4!):

for i in qs2:
    print(i, i.num_book)
>>> a4, 4
print(qs2.query)
>>> SELECT "queries_author"."id", "queries_author"."name", "queries_author"."num", "queries_author"."extra_id", COUNT("queries_book_authors"."book_id") AS "num_book" FROM "queries_author" INNER JOIN "queries_book_authors" ON ("queries_author"."id" = "queries_book_authors"."author_id") INNER JOIN "queries_book" ON ("queries_book_authors"."book_id" = "queries_book"."id") INNER JOIN "queries_book_authors" T4 ON ("queries_author"."id" = T4."author_id") INNER JOIN "queries_book" T5 ON (T4."book_id" = T5."id") WHERE ("queries_book"."pages" < 200 AND T5."pages" > 100) GROUP BY "queries_author"."id", "queries_author"."name", "queries_author"."num", "queries_author"."extra_id" ORDER BY "queries_author"."name" ASC

So, the problem appears to be that Count is only looking at results from queries_book_authors and cannot see results from the aliased tables (T4 and T5), thus it only counts from the first joins and not the latter joins. This results, rather bizarrely, in the annotated value being the product of the number of results meeting the first condition ("queries_book"."pages" < 200) times the number of results meeting the second condition (T5."pages" > 100). This result is neither intuitive nor helpful, so I would suggest that, going forward, the documentation should provide an example of the first query and a warning not to annotate queries with multiple filters as in the second query since the results are not what would be expected or useful.

NOTE: I have not yet tested this case with just a ForeignKey relationship, so as of now the above statements can only be applied when filtering across M2M relationships.

comment:17 by Tim Graham, 8 years ago

Just wanted to drop a quick note to ask if the explanation in https://docs.djangoproject.com/en/dev/topics/db/aggregation/#order-of-annotate-and-filter-clauses about the "query bug" and Count('...', distinct=True) might help.

comment:18 by Benjamin Phillips, 8 years ago

That does change the results. Using the same setup I used above but adding in Count('...', distinct=True) results in the following:

for i in qs1:
    print(i, i.num_book)
>>> a4 1
print(qs1.query)
>>> SELECT "queries_author"."id", "queries_author"."name", "queries_author"."num", COUNT(DISTINCT "queries_book_authors"."book_id") AS "num_book" FROM "queries_author" INNER JOIN "queries_book_authors" ON ("queries_author"."id" = "queries_book_authors"."author_id") INNER JOIN "queries_book" ON ("queries_book_authors"."book_id" = "queries_book"."id") WHERE ("queries_book"."pages" < 200 AND "queries_book"."pages" > 100) GROUP BY "queries_author"."id", "queries_author"."name"

for i in qs2:
    print(i, i.num_book)
>>> a4 2
print(qs2.query)
>>>  SELECT "queries_author"."id", "queries_author"."name", "queries_author"."num", COUNT(DISTINCT "queries_book_authors"."book_id") AS "num_book" FROM "queries_author" INNER JOIN "queries_book_authors" ON ("queries_author"."id" = "queries_book_authors"."author_id") INNER JOIN "queries_book" ON ("queries_book_authors"."book_id" = "queries_book"."id") INNER JOIN "queries_book_authors" T4 ON ("queries_author"."id" = T4."author_id") INNER JOIN "queries_book" T5 ON (T4."book_id" = T5."id") WHERE ("queries_book"."pages" < 200 AND T5."pages" > 100) GROUP BY "queries_author"."id", "queries_author"."name"

So, this does appear to be better, but I still don't think it is giving a result that is helpful or intuitive. It looks to me like the second query is counting the same book as the first, just twice. What it does not appear to be doing is giving a list of Authors with books having either fewer than 200 pages or greater than 100 pages, nor is it giving a list of Authors with books having between 100 and 200 pages.

comment:19 by Tim Graham, 7 years ago

Cc: Simon Charette Josh Smeaton added

I don't have much experience constructing complex annotations so I'm not sure if there cases where chains filters as in the second case might be useful. Simon, Josh, others... any input?

Note: See TracTickets for help on using tickets.
Back to Top