Opened 12 years ago
Last modified 8 years ago
#19415 new Cleanup/optimization
Clarify how aggregates work with multi-valued relationships and multiple filter() calls
Reported by: | Owned by: | nobody | |
---|---|---|---|
Component: | Documentation | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Simon Charette, Josh Smeaton | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
This issue seems related to #16603, but has a slightly different scope, so I am creating a new ticket for it.
Consider the following model:
class Author(models.Model): name = models.CharField(max_length=100) class Book(models.Model): author = models.ForeignKey(Author, related_name='books') title = models.CharField(max_length=100) published = models.DateField() pages = models.IntegerField()
then the query:
qs = Author.objects.filter(books__published__gte=date(2012, 11, 1)) qs = qs.filter(books__published__lte=date(2012, 11, 30)) qs = qs.values('name').annotate(num_pages=Sum('books__pages')) print qs.query
will result in the following SQL:
SELECT "app_author"."name", SUM("app_book"."pages") AS "num_pages" FROM "app_author" LEFT OUTER JOIN "app_book" ON ("app_author"."id" = "app_book"."author_id") INNER JOIN "app_book" T3 ON ("app_author"."id" = T3."author_id") WHERE ("app_book"."published" >= 2012-11-01 AND T3."published" <= 2012-11-30 ) GROUP BY "app_author"."name"
The problem is that in this case the upper date filter is not taken into account when calculating the 'num_pages' aggregate.
I would have expected to see the following SQL:
SELECT "app_author"."name", SUM("app_book"."pages") AS "num_pages" FROM "app_author" LEFT OUTER JOIN "app_book" ON ("app_author"."id" = "app_book"."author_id") WHERE ("app_book"."published" >= 2012-11-01 AND "app_book"."published" <= 2012-11-30 ) GROUP BY "app_author"."name"
Note that a query with just a single filter seems to work correctly:
qs = Author.objects.filter(books__published__gte=date(2012, 11, 1)) qs = qs.values('name').annotate(num_pages=Sum('books__pages')) print qs.query
which gives:
SELECT "app_author"."name", SUM("app_book"."pages") AS "num_pages" FROM "app_author" LEFT OUTER JOIN "app_book" ON ("app_author"."id" = "app_book"."author_id") WHERE "app_book"."published" >= 2012-11-01 GROUP BY "app_author"."name"
Attachments (1)
Change History (18)
comment:1 by , 12 years ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
comment:2 by , 12 years ago
Thanks! Combining everything into a single .filter() does indeed solve the problem.
However, the aggregate part is not entirely a red herring and some clarification in the documentation on aggregation could be useful to prevent others making this mistake.
In the documentation topics/db/queries/#spanning-multi-valued-relationships it is already clarified how subsequent filter() calls are treated. However, it may be good to have a similar section in topics/db/aggregation/#filtering-on-annotations that explains how aggregates on multi-valued relationships behave in case multiple filter() statements are used.
There is also no example that shows how to filter using multiple conditions in topics/db/aggregation/#filtering-on-annotations . Adding such an example (using a single .filter() call) would have probably already guided me in the right direction.
comment:3 by , 12 years ago
Component: | ORM aggregation → Documentation |
---|---|
Easy pickings: | set |
Resolution: | invalid |
Status: | closed → reopened |
Triage Stage: | Unreviewed → Accepted |
Type: | Bug → Cleanup/optimization |
That's a fair point. Reopening and accepting as a documentation fix.
comment:4 by , 12 years ago
Status: | reopened → new |
---|
comment:7 by , 11 years ago
Easy pickings: | unset |
---|---|
Summary: | Invalid result when using multiple reverse foreign-key filters together with reverse foreign-key aggregate → Clarify how aggregates work with multi-valued relationships and multiple filter() calls |
Version: | 1.5-beta-1 → master |
by , 9 years ago
Attachment: | 19415Docs.diff added |
---|
comment:9 by , 9 years ago
If you are able to submit a pull request rather than an attachment, that's ideal. Just let me know if not.
comment:10 by , 9 years ago
This documentation fragment does indeed get to the core of what is happening, but I still had to read it several times to understand the difference. To me and and as well as mean the same thing.
It may be useful to refer to the concepts of union and intersection, because the first example gives an annotation on the intersection of books matching the two conditions whereas the second one gives an annotation on the union of them.
comment:11 by , 9 years ago
I can go ahead and make a PR. I'll also look into changing the wording to be more transparent. Thanks for the feedback!
comment:12 by , 9 years ago
Made a PR for the patch (PR 5729 https://github.com/django/django/pull/5729) and attempted to improve the wording.
comment:14 by , 9 years ago
Patch needs improvement: | set |
---|
The proposed example queries use local fields instead of relations, so I think it doesn't quite address the ticket.
comment:15 by , 9 years ago
PR has been updated with an example using queries across relationships instead of local fields.
comment:16 by , 9 years ago
I did some more digging on this issue as I was working on editing my Doc patch and I learned a bit more about what's going on. Consider the following test case:
class Author(models.Model): name = models.CharField(max_length=10) class Book(models.Model): title = models.TextField() pages = models.IntegerField(null=True) rating = models.IntegerField(null=True) authors = models.ManyToManyField(Author) a1 = Author.objects.create(name='a1') a2 = Author.objects.create(name='a2') a3 = Author.objects.create(name='a3') a4 = Author.objects.create(name='a4') b1 = Book.objects.create(title='b1', pages=100, rating=4) b1.authors.set([a1]) b2 = Book.objects.create(title='b2', pages=50, rating=2) b2.authors.set([a2]) b3 = Book.objects.create(title='b3', pages=150, rating=3) b3.authors.set([a4]) b4 = Book.objects.create(title='b4', pages=400, rating=5) b4.authors.set([a4]) b5 = Book.objects.create(title='b5', pages=100, rating=1) b5.authors.set([a1,a4])
Now, let's look at two queries on that test case:
qs1 = Author.objects.filter(book__pages__lt=200, book__pages__gt=100).annotate(num_book=Count('book')) qs2 = Author.objects.filter(book__pages__lt=200).filter(book__pages__gt=100).annotate(num_book=Count('book'))
The first query gives us the expected results:
for i in qs1: print(i, i.num_book) >>> a4, 1 print(qs1.query) >>> SELECT "queries_author"."id", "queries_author"."name", "queries_author"."num", "queries_author"."extra_id", COUNT("queries_book_authors"."book_id") AS "num_book" FROM "queries_author" INNER JOIN "queries_book_authors" ON ("queries_author"."id" = "queries_book_authors"."author_id") INNER JOIN "queries_book" ON ("queries_book_authors"."book_id" = "queries_book"."id") WHERE ("queries_book"."pages" < 200 AND "queries_book"."pages" > 100) GROUP BY "queries_author"."id", "queries_author"."name", "queries_author"."num", "queries_author"."extra_id" ORDER BY "queries_author"."name" ASC
The second query, however, gives some rather strange results (author4 only has 3 books, not 4!):
for i in qs2: print(i, i.num_book) >>> a4, 4 print(qs2.query) >>> SELECT "queries_author"."id", "queries_author"."name", "queries_author"."num", "queries_author"."extra_id", COUNT("queries_book_authors"."book_id") AS "num_book" FROM "queries_author" INNER JOIN "queries_book_authors" ON ("queries_author"."id" = "queries_book_authors"."author_id") INNER JOIN "queries_book" ON ("queries_book_authors"."book_id" = "queries_book"."id") INNER JOIN "queries_book_authors" T4 ON ("queries_author"."id" = T4."author_id") INNER JOIN "queries_book" T5 ON (T4."book_id" = T5."id") WHERE ("queries_book"."pages" < 200 AND T5."pages" > 100) GROUP BY "queries_author"."id", "queries_author"."name", "queries_author"."num", "queries_author"."extra_id" ORDER BY "queries_author"."name" ASC
So, the problem appears to be that Count
is only looking at results from queries_book_authors
and cannot see results from the aliased tables (T4
and T5
), thus it only counts from the first joins and not the latter joins. This results, rather bizarrely, in the annotated value being the product of the number of results meeting the first condition ("queries_book"."pages" < 200
) times the number of results meeting the second condition (T5."pages" > 100
). This result is neither intuitive nor helpful, so I would suggest that, going forward, the documentation should provide an example of the first query and a warning not to annotate queries with multiple filters as in the second query since the results are not what would be expected or useful.
NOTE: I have not yet tested this case with just a ForeignKey relationship, so as of now the above statements can only be applied when filtering across M2M relationships.
comment:17 by , 9 years ago
Just wanted to drop a quick note to ask if the explanation in https://docs.djangoproject.com/en/dev/topics/db/aggregation/#order-of-annotate-and-filter-clauses about the "query bug" and Count('...', distinct=True)
might help.
comment:18 by , 9 years ago
That does change the results. Using the same setup I used above but adding in Count('...', distinct=True)
results in the following:
for i in qs1: print(i, i.num_book) >>> a4 1 print(qs1.query) >>> SELECT "queries_author"."id", "queries_author"."name", "queries_author"."num", COUNT(DISTINCT "queries_book_authors"."book_id") AS "num_book" FROM "queries_author" INNER JOIN "queries_book_authors" ON ("queries_author"."id" = "queries_book_authors"."author_id") INNER JOIN "queries_book" ON ("queries_book_authors"."book_id" = "queries_book"."id") WHERE ("queries_book"."pages" < 200 AND "queries_book"."pages" > 100) GROUP BY "queries_author"."id", "queries_author"."name" for i in qs2: print(i, i.num_book) >>> a4 2 print(qs2.query) >>> SELECT "queries_author"."id", "queries_author"."name", "queries_author"."num", COUNT(DISTINCT "queries_book_authors"."book_id") AS "num_book" FROM "queries_author" INNER JOIN "queries_book_authors" ON ("queries_author"."id" = "queries_book_authors"."author_id") INNER JOIN "queries_book" ON ("queries_book_authors"."book_id" = "queries_book"."id") INNER JOIN "queries_book_authors" T4 ON ("queries_author"."id" = T4."author_id") INNER JOIN "queries_book" T5 ON (T4."book_id" = T5."id") WHERE ("queries_book"."pages" < 200 AND T5."pages" > 100) GROUP BY "queries_author"."id", "queries_author"."name"
So, this does appear to be better, but I still don't think it is giving a result that is helpful or intuitive. It looks to me like the second query is counting the same book as the first, just twice. What it does not appear to be doing is giving a list of Authors with books having either fewer than 200 pages or greater than 100 pages, nor is it giving a list of Authors with books having between 100 and 200 pages.
comment:19 by , 8 years ago
Cc: | added |
---|
I don't have much experience constructing complex annotations so I'm not sure if there cases where chains filters as in the second case might be useful. Simon, Josh, others... any input?
As far as I can make out, this is working as expected (for suitably confusing values of expected :-).
The catch is that:
and
are not interchangeable. By using a single filter clause, you're telling Django to use a single join table. Splitting the filter means 2 joins are used (hence the T3 table in the "wrong" query). The involvement of aggregates here is a red herring; the underlying issue is the filters.