Opened 7 years ago

Last modified 3 weeks ago

#10060 new Bug

Multiple table annotation failure

Reported by: svsharma@… Owned by:
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords:
Cc: flosch@…, bendavis78@…, cmutel@…, daniel@…, gabrielhurley, sebleier@…, greg@…, mpjung@…, teemu.kurppa@…, Fak3, michaelB, glic3rinu Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by ubernostrum)

Annotating across multiple tables results in wrong answers. i.e.

In [110]: total = Branch.objects.all().annotate(total=Sum('center__client__loan__amount'))

In [111]: total[0].total
Out[111]: 3433000

In [112]: repaid = Branch.objects.all().annotate(repaid=Sum('center__client__loan__payment_schedule__payments__principal'))

In [113]: repaid[0].repaid
Out[113]: 1976320.0

In [114]: both =  Branch.objects.all().annotate(total=Sum('center__client__loan__amount'),repaid=Sum('center__client__loan__payment_schedule__payments__principal'))

In [115]: both[0].repaid
Out[115]: 1976320.0

In [116]: both[0].total
Out[116]: 98816000
          ^^^^^^^^^^^

Compare the output of total in 116 vs. 111 (the correct answer).

Attachments (1)

agg_test.tar.gz (7.0 KB) - added by svsharma@… 7 years ago.
Test project for issue.

Download all attachments as: .zip

Change History (52)

comment:1 Changed 7 years ago by russellm

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Accepted

It would help a great deal if you gave us the actual models and sample data. You have highlighted an inconsistency, but you haven't given us the ability to reproduce the failure.

Changed 7 years ago by svsharma@…

Test project for issue.

comment:2 Changed 7 years ago by svsharma@…

if you unzip agg_test.tar.gz, start a django shell in the project and run count.py in the shell, you will see the error being reproduced.

cheers
Sid

comment:3 Changed 7 years ago by ubernostrum

  • Description modified (diff)

(formatting)

comment:4 Changed 7 years ago by anonymous

  • Cc flosch@… added

comment:5 Changed 6 years ago by anonymous

  • milestone post-1.0 deleted

Milestone post-1.0 deleted

comment:6 Changed 6 years ago by russellm

  • milestone set to 1.1

This is a fairly big problem, but it's not going to be an easy one to fix - the solution may just have to be documentation to say "don't do that".

comment:7 Changed 6 years ago by bendavis78

If anyone were to fix this, the first question to ask would be how you would structure the SQL. I'm not an SQL efficiency expert, but I know one way to get the correct result would be to use subqueries in the SELECT clause:

For example, let's say I have a User, PointEarning w/ ForeignKey(User), and PointExpense w/ F!oreignKey(User). Let's say I want a query to get the user's email, total points earned, and total points spent.:

SELECT auth_user.id, auth_user.email, (SELECT SUM(points) FROM app_pointearning WHERE user_id=auth_user.id) AS points_earned,  (SELECT SUM(points) FROM app_pointexpense WHERE user_id=auth_user.id) AS points_spent FROM auth_user

Another way, I believe, is to join on derived tables (sub-queries). Whatever way we decide to write the SQL, django could perhaps detect when we're using aggregates accross multiple joins, and alter the structure of the query accordingly. Seems like it could be kinda messy, but at least the results would be more what the user expected.

comment:8 Changed 6 years ago by anonymous

  • Cc bendavis78@… added

comment:9 Changed 6 years ago by bendavis78

From what I've researched, joining on subqueries would be faster than subqueries in the SELECT clause. So instead of this:

SELECT
  u.email,
  (SELECT SUM(points) FROM point_earning WHERE user_id=u.id) AS points_earned,
  (SELECT SUM(points) FROM point_expense WHERE user_id=u.id) AS points_spent
FROM
  "user" u

...we would want this:

SELECT
  u.email,
  a.points AS points_earned,
  b.points AS points_spent
FROM
  "user" u
  LEFT OUTER JOIN (SELECT user_id, SUM(points) AS points FROM point_earning GROUP BY user_id) a ON a.user_id=u.id
  LEFT OUTER JOIN (SELECT user_id, SUM(points) AS points FROM point_expense GROUP BY user_id) b ON b.user_id=u.id
ORDER BY u.id

What this does, essentially, is move the aggregate function into a derived table which we join onto the main table. I would imagine a solution to this bug would detect if annotations are being used across multiple relations, and if so, would adjust the join and select clauses as necessary. I may attempt to write a patch for this soon, but it seems like it would be tough, so there's no guarantees :-p

Thoughts?

comment:10 Changed 6 years ago by bendavis78

Ok, after spending several hours looking through django/db/models/sql/query.py, I feel like my brain has turned to mush. This is definitely better left up to the ppl that wrote it. I hope that at some point someone who is capable can write a patch for it.

I'm against simply adding a "don't do this" in the documentation, because users would simply expect something like this to work -- it's hard to understand at a high level why simply adding another annotation from a different relation would skew the results.

comment:11 Changed 6 years ago by russellm

  • milestone 1.1 deleted

comment:12 Changed 6 years ago by seocam

Would be nice to know why the milestone was deleted.
If we are pushing this to 1.2 we need to add it in "Known Bugs" in the doc page:
http://docs.djangoproject.com/en/dev/topics/db/aggregation/

comment:13 Changed 6 years ago by russellm

The milestone was deleted because this is a fairly big problem, without an obvious (or particularly easy) solution. There is no potential for data loss, and there is a viable workaround - either "don't do it",or use two queries rather than one. One of the side effects of a time-based release schedule is that some bugs will always exist in releases - unfortunately, this is one of those bugs.

Regarding adding this to a "known bugs" section - we don't document bugs. We document intended behavior. The ticket tracker is for bugs.

comment:14 Changed 6 years ago by anonymous

I'm going to second adding this to the aggregation documentation, because I just wasted two hours re-discovering this bug.

comment:15 Changed 6 years ago by bendavis78

I also agree that this should be documented, but perhaps an exception should be raise when attempting to do multiple annotations on different tables/joins? I can't think of a situation where that would ever guarantee accurate results.

comment:16 Changed 6 years ago by bendavis78

Regarding the example sql queries in comment 9 above, I've discovered that the first query is actually faster. My database has grown fairly large in the past few months, so I've been able to test with more clear results (this is using MySQL, by the way).

Using the JOIN method (where the aggregates are done in a derived join), the query was extremely slow. It took over 30 minutes to execute with 57,000 rows in the users table. When I changed the query to the SELECT method (where the aggregates are done in a subquery in the select clause), the query took only 2 seconds with the exact same results. I spent a whole day on this, and couldn't find a solid explanation for why the derived join method was taking so long. I'd say at this point if anyone moves forward with fixing this bug we should put the aggregate in the subquery of the select clause.

I've looked around for an official stance on how to write a query with multiple aggregations across different joined tables, and haven't come up with a solid answer. If anyone does find "the right way" to do it, I'd be interested to know -- I think that definitely needs to be settled before this bug can be fixed.

comment:17 Changed 6 years ago by anonymous

+1 victim

comment:18 Changed 6 years ago by fitoria

I can confirm this one... It happens with aggregate too wrong values in 2 - 5% range.

comment:19 Changed 6 years ago by anonymous

  • Cc cmutel@… added

comment:20 Changed 5 years ago by dablak

  • Cc daniel@… added

A big bug needs a big shoe...

comment:21 Changed 5 years ago by gabrielhurley

  • Cc gabrielhurley added

chalk up another victim... ::sigh::

comment:22 Changed 5 years ago by sebleier@…

  • Cc sebleier@… added

comment:23 Changed 5 years ago by fas

I also just ran into this bug and it almost went unnoticed. An exception would indeed be very nice indicating that this does not work yet.

comment:24 Changed 5 years ago by anonymous

  • Cc greg@… added

comment:25 Changed 5 years ago by mp

  • Cc mpjung@… added

comment:26 Changed 5 years ago by teemu

  • Cc teemu.kurppa@… added

Another victim here. Just spent several hours with finding this bug, which had existed several days. This definitely should be either fixed or prevented: values that the incorrect query provides are often somewhat in the sensible range, depending on your row structures, which can make this bug hard to spot.

I also tested both queries that bendavis78 outlined above and they give correct results. Didn't do extensive performance testing yet, with 20K rows in the largest table, both queries were between 0.05 and 0.10 secs.

comment:27 Changed 5 years ago by thamedave

Another victim.

This bug means that you can't use annotations at all unless you know exactly what is going to happen to your QuerySet later. Functions cannot expect that any QuerySet passed as a parameter is capable of being annotated as it may already have been done.

I am a reasonably to Django, but it seems odd to me that this has remained a known bug for 2 years without anyone working on it, or a the documentation being changed to reflect the lack of functionality. I understand russell's logic that bugs should not be documented, but if this cannot be fixed, then surely it should be removed from intended behaviour.

comment:28 Changed 4 years ago by SmileyChris

  • Severity set to Normal
  • Type set to Bug

comment:29 Changed 4 years ago by anonymous

  • Easy pickings unset
  • UI/UX unset

Another victim. I have a model Foo. I also have models Bar and Baz, which each have a foreign key pointing to Foo with related_names 'bars', and 'bazs', respectively:

for foo in Foo.objects.all().annotate(bar_count=Count('bars'), baz_count=Count('bazs')):
    print foo.bar_count  # Was correct
    print foo.baz_count  # Was completely wrong, but not by any apparent pattern (ie, the numbers seemed to be almost random)

I'm unable to display the baz_count in my template now, unless I want to introduce an N+1 problem (ie, foo.bazs.count() during each iteration)

comment:30 Changed 4 years ago by anonymous

Whoa, thought I was doing something wrong in my code until I searched and found this. Isn't this kind of a big deal? As in, with this bug, isn't Django's aggregation basically worthless for anyone who's not doing it on a single table, or did I miss something?

comment:31 follow-up: Changed 4 years ago by anonymous

Is this something that will ever be fixed?

comment:32 in reply to: ↑ 31 Changed 4 years ago by lukeplant

Replying to anonymous:

Is this something that will ever be fixed?

It will only ever get fixed if someone cares enough to write a patch (complete with tests). I think at this point a patch which threw an exception for the failure case would be accepted. Obviously ideally this would be replaced with a proper fix at some point.

comment:33 follow-up: Changed 4 years ago by anonymous

Really, it seems there are plenty of ticket with patches (complete with tests), and with offers to do the work, but the core team doesn't care.

comment:34 in reply to: ↑ 33 Changed 4 years ago by carljm

Replying to anonymous:

Really, it seems there are plenty of ticket with patches (complete with tests), and with offers to do the work, but the core team doesn't care.

Sorry you're feeling frustrated.

There are currently 15 tickets in Trac marked "Ready for checkin." The _only_ task in Django's development process that's limited to core developers is committing those 15 tickets. Fifteen tickets doesn't seem like such a terrible backlog for an entirely-volunteer crew, frankly.

Any other step in the process (writing a patch, adding tests to an existing patch, reviewing a patch and marking it "Ready for checkin" if there are tests, they pass, and it fixes the problem for you) can be done by anyone in the community. See https://docs.djangoproject.com/en/dev/internals/contributing/triaging-tickets/

There is a larger backlog of "design decision needed" tickets (298), but anyone can contribute usefully to those as well by researching pros and cons and presenting a balanced summary to the developers mailing list to start a discussion.

comment:36 Changed 4 years ago by Fak3

  • Cc Fak3 added

comment:37 Changed 3 years ago by aaugustin

#19290 is a duplicate.

comment:38 Changed 3 years ago by michael@…

Hmm, another victim here. Spent 2 hours trying to understand why my numbers were so odd to finally find this bug. This is the first time in several django projects I am really feeling left alone :-(
I am no database expert (which is one reason i LOVE django and it's awesome Queryset) so I don't see any way that I could provide implementation help or anything, but seeing that this bug is now open for several years really makes me shake my head. Even if there is no obvious fix available, PLEASE at least update the documentatin to point out this issue!

comment:39 Changed 3 years ago by michaelB

  • Cc michaelB added

comment:40 Changed 3 years ago by akaariai

This is somewhat hard problem. We could do automatic subquery pushdown for aggregates - it isn't exactly easy but most of the pieces are already there. I am not sure how well such queries will perform. Although that isn't perhaps too important. If the results aren't correct speed doesn't matter...

I think a good start to this would be documentation + detecting the problematic situations and throwing an error in those cases. Basically if we have two multijoins in the query (from filtering, aggregation or something else) and aggregation is used then the query is likely wrong. However this condition is a little too restrictive - something like:

qs.filter(translations__lang='fi', translations__name='foo')

introduces a multijoin to translations, yet the query is safe as translations.lang is unique. Detecting if we have a unique multijoin is going to be nasty. So, we need a way to bypass this error checking.

Proposal: aggregates get a new kwarg "inline" - default False. If inline is False, then multiple multijoins in single query will raise an error. If it is True, the aggregate is forced to be joined to the query (that is, to do what aggregates do today). Later on we will likely add another kwarg "subquery" - defaults again to False. If set to True the aggregate is forced to subquery (something like in comment:9). We could also add kwarg "subselect". Later on maybe also "lateral_subquery" and so on.

The proposal allows us to throw an error, yet give users the ability to force the ORM to generate the queries they are using now. It also allows for extending the same API if we get subquery aggregate support some day.

I think we need to add the information about multijoins to JoinInfo, and add the info about used joins to aggregates when they are added to the query. Then at compiler stage, check for all inline=False aggregates if there are multijoins in the query not used by the aggregate itself. If so, throw an error. Changes should be needed at query.setup_joins(), compiler.get_from_clause() and then a way to annotate the used joins into the aggregate (this should not be too hard).

comment:41 Changed 3 years ago by anonymous

At this point the documentation on annotations can be considered to be intentionally misleading.

It is hard not to think this given the severity of the issue and the comments above.

Yes a code fix is hard but at least an obviously tentative documentation "fix" can be pushed fairly easily.

comment:42 Changed 2 years ago by akaariai

  • Component changed from ORM aggregation to Database layer (models, ORM)

comment:43 Changed 2 years ago by anonymous

Bump

comment:44 Changed 14 months ago by anonymous

bump!!!!

comment:45 Changed 14 months ago by timo

Bumping this ticket accomplishes nothing. If you want to see it addressed, please write a patch.

comment:46 Changed 8 months ago by benjaoming

Agreed with anonymous 01/28/2013 that there needs to be at least some documentation on this: https://github.com/django/django/pull/3654

comment:47 Changed 8 months ago by Tim Graham <timograham@…>

In 4b23a6c7a9232cc07ec95fe98be17efbd4449822:

Documented a current limitation of multiple table annotation; refs #10060.

comment:48 Changed 8 months ago by Tim Graham <timograham@…>

In 48bd44670b4c58735d5cdf6762d6b61219ecbb12:

[1.7.x] Documented a current limitation of multiple table annotation; refs #10060.

Backport of 4b23a6c7a9232cc07ec95fe98be17efbd4449822 from master

comment:49 Changed 8 months ago by Tim Graham <timograham@…>

In a87ade7e30c80de93ec7ad8335d8714b494a58fb:

[1.6.x] Documented a current limitation of multiple table annotation; refs #10060.

Backport of 4b23a6c7a9232cc07ec95fe98be17efbd4449822 from master

comment:50 follow-up: Changed 4 months ago by camillobruni

I recently ran into a similar problem. The deep joining causes an explosion of values possibly resulting in many duplicate values. I fixed this in https://github.com/django/django/pull/4388 by implementing the DISTINCT parameter on the Aggregate functions. Hence you can do
Sum('center__client__loan__payment_schedule__payments__principal', distinct=True)
which ignores the duplicate entries.

It might be that the issue at hand is slightly different, but this solves at least one particular case.

comment:51 in reply to: ↑ 50 Changed 4 months ago by camillobruni

Replying to camillobruni:

I recently ran into a similar problem. The deep joining causes an explosion of values possibly resulting in many duplicate values. I fixed this in https://github.com/django/django/pull/4388 by implementing the DISTINCT parameter on the Aggregate functions. Hence you can do
Sum('center__client__loan__payment_schedule__payments__principal', distinct=True)
which ignores the duplicate entries.

It might be that the issue at hand is slightly different, but this solves at least one particular case.

Sorry for the noise, I was wrong (in hindsight: "obviously") http://www.sqlteam.com/article/how-to-use-group-by-with-distinct-aggregates-and-derived-tables describes the problem I thought of dealing with in more detail.

comment:52 Changed 3 weeks ago by glic3rinu

  • Cc glic3rinu added
Note: See TracTickets for help on using tickets.
Back to Top