Opened 8 years ago

Last modified 3 months ago

#10060 new Bug

Multiple table annotation failure

Reported by: svsharma@… Owned by:
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords:
Cc: flosch@…, bendavis78@…, cmutel@…, daniel@…, Gabriel Hurley, sebleier@…, greg@…, mpjung@…, teemu.kurppa@…, Evstifeev Roman, michaelB, Marc Aymerich, nowak2000@… Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by James Bennett)

Annotating across multiple tables results in wrong answers. i.e.

In [110]: total = Branch.objects.all().annotate(total=Sum('center__client__loan__amount'))

In [111]: total[0].total
Out[111]: 3433000

In [112]: repaid = Branch.objects.all().annotate(repaid=Sum('center__client__loan__payment_schedule__payments__principal'))

In [113]: repaid[0].repaid
Out[113]: 1976320.0

In [114]: both =  Branch.objects.all().annotate(total=Sum('center__client__loan__amount'),repaid=Sum('center__client__loan__payment_schedule__payments__principal'))

In [115]: both[0].repaid
Out[115]: 1976320.0

In [116]: both[0].total
Out[116]: 98816000
          ^^^^^^^^^^^

Compare the output of total in 116 vs. 111 (the correct answer).

Attachments (1)

agg_test.tar.gz (7.0 KB) - added by svsharma@… 8 years ago.
Test project for issue.

Download all attachments as: .zip

Change History (56)

comment:1 Changed 8 years ago by Russell Keith-Magee

Triage Stage: UnreviewedAccepted

It would help a great deal if you gave us the actual models and sample data. You have highlighted an inconsistency, but you haven't given us the ability to reproduce the failure.

Changed 8 years ago by svsharma@…

Attachment: agg_test.tar.gz added

Test project for issue.

comment:2 Changed 8 years ago by svsharma@…

if you unzip agg_test.tar.gz, start a django shell in the project and run count.py in the shell, you will see the error being reproduced.

cheers
Sid

comment:3 Changed 8 years ago by James Bennett

Description: modified (diff)

(formatting)

comment:4 Changed 8 years ago by anonymous

Cc: flosch@… added

comment:6 Changed 8 years ago by Russell Keith-Magee

milestone: 1.1

This is a fairly big problem, but it's not going to be an easy one to fix - the solution may just have to be documentation to say "don't do that".

comment:7 Changed 8 years ago by Ben Davis

If anyone were to fix this, the first question to ask would be how you would structure the SQL. I'm not an SQL efficiency expert, but I know one way to get the correct result would be to use subqueries in the SELECT clause:

For example, let's say I have a User, PointEarning w/ ForeignKey(User), and PointExpense w/ F!oreignKey(User). Let's say I want a query to get the user's email, total points earned, and total points spent.:

SELECT auth_user.id, auth_user.email, (SELECT SUM(points) FROM app_pointearning WHERE user_id=auth_user.id) AS points_earned,  (SELECT SUM(points) FROM app_pointexpense WHERE user_id=auth_user.id) AS points_spent FROM auth_user

Another way, I believe, is to join on derived tables (sub-queries). Whatever way we decide to write the SQL, django could perhaps detect when we're using aggregates accross multiple joins, and alter the structure of the query accordingly. Seems like it could be kinda messy, but at least the results would be more what the user expected.

comment:8 Changed 8 years ago by anonymous

Cc: bendavis78@… added

comment:9 Changed 8 years ago by Ben Davis

From what I've researched, joining on subqueries would be faster than subqueries in the SELECT clause. So instead of this:

SELECT
  u.email,
  (SELECT SUM(points) FROM point_earning WHERE user_id=u.id) AS points_earned,
  (SELECT SUM(points) FROM point_expense WHERE user_id=u.id) AS points_spent
FROM
  "user" u

...we would want this:

SELECT
  u.email,
  a.points AS points_earned,
  b.points AS points_spent
FROM
  "user" u
  LEFT OUTER JOIN (SELECT user_id, SUM(points) AS points FROM point_earning GROUP BY user_id) a ON a.user_id=u.id
  LEFT OUTER JOIN (SELECT user_id, SUM(points) AS points FROM point_expense GROUP BY user_id) b ON b.user_id=u.id
ORDER BY u.id

What this does, essentially, is move the aggregate function into a derived table which we join onto the main table. I would imagine a solution to this bug would detect if annotations are being used across multiple relations, and if so, would adjust the join and select clauses as necessary. I may attempt to write a patch for this soon, but it seems like it would be tough, so there's no guarantees :-p

Thoughts?

comment:10 Changed 8 years ago by Ben Davis

Ok, after spending several hours looking through django/db/models/sql/query.py, I feel like my brain has turned to mush. This is definitely better left up to the ppl that wrote it. I hope that at some point someone who is capable can write a patch for it.

I'm against simply adding a "don't do this" in the documentation, because users would simply expect something like this to work -- it's hard to understand at a high level why simply adding another annotation from a different relation would skew the results.

comment:11 Changed 8 years ago by Russell Keith-Magee

milestone: 1.1

comment:12 Changed 8 years ago by Sergio Oliveira

Would be nice to know why the milestone was deleted.
If we are pushing this to 1.2 we need to add it in "Known Bugs" in the doc page:
http://docs.djangoproject.com/en/dev/topics/db/aggregation/

comment:13 Changed 8 years ago by Russell Keith-Magee

The milestone was deleted because this is a fairly big problem, without an obvious (or particularly easy) solution. There is no potential for data loss, and there is a viable workaround - either "don't do it",or use two queries rather than one. One of the side effects of a time-based release schedule is that some bugs will always exist in releases - unfortunately, this is one of those bugs.

Regarding adding this to a "known bugs" section - we don't document bugs. We document intended behavior. The ticket tracker is for bugs.

comment:14 Changed 8 years ago by anonymous

I'm going to second adding this to the aggregation documentation, because I just wasted two hours re-discovering this bug.

comment:15 Changed 8 years ago by Ben Davis

I also agree that this should be documented, but perhaps an exception should be raise when attempting to do multiple annotations on different tables/joins? I can't think of a situation where that would ever guarantee accurate results.

comment:16 Changed 8 years ago by Ben Davis

Regarding the example sql queries in comment 9 above, I've discovered that the first query is actually faster. My database has grown fairly large in the past few months, so I've been able to test with more clear results (this is using MySQL, by the way).

Using the JOIN method (where the aggregates are done in a derived join), the query was extremely slow. It took over 30 minutes to execute with 57,000 rows in the users table. When I changed the query to the SELECT method (where the aggregates are done in a subquery in the select clause), the query took only 2 seconds with the exact same results. I spent a whole day on this, and couldn't find a solid explanation for why the derived join method was taking so long. I'd say at this point if anyone moves forward with fixing this bug we should put the aggregate in the subquery of the select clause.

I've looked around for an official stance on how to write a query with multiple aggregations across different joined tables, and haven't come up with a solid answer. If anyone does find "the right way" to do it, I'd be interested to know -- I think that definitely needs to be settled before this bug can be fixed.

comment:17 Changed 8 years ago by anonymous

+1 victim

comment:18 Changed 7 years ago by fitoria

I can confirm this one... It happens with aggregate too wrong values in 2 - 5% range.

comment:19 Changed 7 years ago by anonymous

Cc: cmutel@… added

comment:20 Changed 7 years ago by Daniel Blasco

Cc: daniel@… added

A big bug needs a big shoe...

comment:21 Changed 7 years ago by Gabriel Hurley

Cc: Gabriel Hurley added

chalk up another victim... ::sigh::

comment:22 Changed 7 years ago by sebleier@…

Cc: sebleier@… added

comment:23 Changed 7 years ago by fas

I also just ran into this bug and it almost went unnoticed. An exception would indeed be very nice indicating that this does not work yet.

comment:24 Changed 7 years ago by anonymous

Cc: greg@… added

comment:25 Changed 7 years ago by mp

Cc: mpjung@… added

comment:26 Changed 6 years ago by Teemu Kurppa

Cc: teemu.kurppa@… added

Another victim here. Just spent several hours with finding this bug, which had existed several days. This definitely should be either fixed or prevented: values that the incorrect query provides are often somewhat in the sensible range, depending on your row structures, which can make this bug hard to spot.

I also tested both queries that bendavis78 outlined above and they give correct results. Didn't do extensive performance testing yet, with 20K rows in the largest table, both queries were between 0.05 and 0.10 secs.

comment:27 Changed 6 years ago by Dead Account

Another victim.

This bug means that you can't use annotations at all unless you know exactly what is going to happen to your QuerySet later. Functions cannot expect that any QuerySet passed as a parameter is capable of being annotated as it may already have been done.

I am a reasonably to Django, but it seems odd to me that this has remained a known bug for 2 years without anyone working on it, or a the documentation being changed to reflect the lack of functionality. I understand russell's logic that bugs should not be documented, but if this cannot be fixed, then surely it should be removed from intended behaviour.

comment:28 Changed 6 years ago by Chris Beaven

Severity: Normal
Type: Bug

comment:29 Changed 5 years ago by anonymous

Easy pickings: unset
UI/UX: unset

Another victim. I have a model Foo. I also have models Bar and Baz, which each have a foreign key pointing to Foo with related_names 'bars', and 'bazs', respectively:

for foo in Foo.objects.all().annotate(bar_count=Count('bars'), baz_count=Count('bazs')):
    print foo.bar_count  # Was correct
    print foo.baz_count  # Was completely wrong, but not by any apparent pattern (ie, the numbers seemed to be almost random)

I'm unable to display the baz_count in my template now, unless I want to introduce an N+1 problem (ie, foo.bazs.count() during each iteration)

comment:30 Changed 5 years ago by anonymous

Whoa, thought I was doing something wrong in my code until I searched and found this. Isn't this kind of a big deal? As in, with this bug, isn't Django's aggregation basically worthless for anyone who's not doing it on a single table, or did I miss something?

comment:31 Changed 5 years ago by anonymous

Is this something that will ever be fixed?

comment:32 in reply to:  31 Changed 5 years ago by Luke Plant

Replying to anonymous:

Is this something that will ever be fixed?

It will only ever get fixed if someone cares enough to write a patch (complete with tests). I think at this point a patch which threw an exception for the failure case would be accepted. Obviously ideally this would be replaced with a proper fix at some point.

comment:33 Changed 5 years ago by anonymous

Really, it seems there are plenty of ticket with patches (complete with tests), and with offers to do the work, but the core team doesn't care.

comment:34 in reply to:  33 Changed 5 years ago by Carl Meyer

Replying to anonymous:

Really, it seems there are plenty of ticket with patches (complete with tests), and with offers to do the work, but the core team doesn't care.

Sorry you're feeling frustrated.

There are currently 15 tickets in Trac marked "Ready for checkin." The _only_ task in Django's development process that's limited to core developers is committing those 15 tickets. Fifteen tickets doesn't seem like such a terrible backlog for an entirely-volunteer crew, frankly.

Any other step in the process (writing a patch, adding tests to an existing patch, reviewing a patch and marking it "Ready for checkin" if there are tests, they pass, and it fixes the problem for you) can be done by anyone in the community. See https://docs.djangoproject.com/en/dev/internals/contributing/triaging-tickets/

There is a larger backlog of "design decision needed" tickets (298), but anyone can contribute usefully to those as well by researching pros and cons and presenting a balanced summary to the developers mailing list to start a discussion.

comment:36 Changed 5 years ago by Evstifeev Roman

Cc: Evstifeev Roman added

comment:37 Changed 4 years ago by Aymeric Augustin

#19290 is a duplicate.

comment:38 Changed 4 years ago by michael@…

Hmm, another victim here. Spent 2 hours trying to understand why my numbers were so odd to finally find this bug. This is the first time in several django projects I am really feeling left alone :-(
I am no database expert (which is one reason i LOVE django and it's awesome Queryset) so I don't see any way that I could provide implementation help or anything, but seeing that this bug is now open for several years really makes me shake my head. Even if there is no obvious fix available, PLEASE at least update the documentatin to point out this issue!

comment:39 Changed 4 years ago by michaelB

Cc: michaelB added

comment:40 Changed 4 years ago by Anssi Kääriäinen

This is somewhat hard problem. We could do automatic subquery pushdown for aggregates - it isn't exactly easy but most of the pieces are already there. I am not sure how well such queries will perform. Although that isn't perhaps too important. If the results aren't correct speed doesn't matter...

I think a good start to this would be documentation + detecting the problematic situations and throwing an error in those cases. Basically if we have two multijoins in the query (from filtering, aggregation or something else) and aggregation is used then the query is likely wrong. However this condition is a little too restrictive - something like:

qs.filter(translations__lang='fi', translations__name='foo')

introduces a multijoin to translations, yet the query is safe as translations.lang is unique. Detecting if we have a unique multijoin is going to be nasty. So, we need a way to bypass this error checking.

Proposal: aggregates get a new kwarg "inline" - default False. If inline is False, then multiple multijoins in single query will raise an error. If it is True, the aggregate is forced to be joined to the query (that is, to do what aggregates do today). Later on we will likely add another kwarg "subquery" - defaults again to False. If set to True the aggregate is forced to subquery (something like in comment:9). We could also add kwarg "subselect". Later on maybe also "lateral_subquery" and so on.

The proposal allows us to throw an error, yet give users the ability to force the ORM to generate the queries they are using now. It also allows for extending the same API if we get subquery aggregate support some day.

I think we need to add the information about multijoins to JoinInfo, and add the info about used joins to aggregates when they are added to the query. Then at compiler stage, check for all inline=False aggregates if there are multijoins in the query not used by the aggregate itself. If so, throw an error. Changes should be needed at query.setup_joins(), compiler.get_from_clause() and then a way to annotate the used joins into the aggregate (this should not be too hard).

comment:41 Changed 4 years ago by anonymous

At this point the documentation on annotations can be considered to be intentionally misleading.

It is hard not to think this given the severity of the issue and the comments above.

Yes a code fix is hard but at least an obviously tentative documentation "fix" can be pushed fairly easily.

comment:42 Changed 4 years ago by Anssi Kääriäinen

Component: ORM aggregationDatabase layer (models, ORM)

comment:46 Changed 2 years ago by benjaoming

Agreed with anonymous 01/28/2013 that there needs to be at least some documentation on this: https://github.com/django/django/pull/3654

comment:47 Changed 2 years ago by Tim Graham <timograham@…>

In 4b23a6c7a9232cc07ec95fe98be17efbd4449822:

Documented a current limitation of multiple table annotation; refs #10060.

comment:48 Changed 2 years ago by Tim Graham <timograham@…>

In 48bd44670b4c58735d5cdf6762d6b61219ecbb12:

[1.7.x] Documented a current limitation of multiple table annotation; refs #10060.

Backport of 4b23a6c7a9232cc07ec95fe98be17efbd4449822 from master

comment:49 Changed 2 years ago by Tim Graham <timograham@…>

In a87ade7e30c80de93ec7ad8335d8714b494a58fb:

[1.6.x] Documented a current limitation of multiple table annotation; refs #10060.

Backport of 4b23a6c7a9232cc07ec95fe98be17efbd4449822 from master

comment:50 Changed 2 years ago by Camillo Bruni

I recently ran into a similar problem. The deep joining causes an explosion of values possibly resulting in many duplicate values. I fixed this in https://github.com/django/django/pull/4388 by implementing the DISTINCT parameter on the Aggregate functions. Hence you can do
Sum('center__client__loan__payment_schedule__payments__principal', distinct=True)
which ignores the duplicate entries.

It might be that the issue at hand is slightly different, but this solves at least one particular case.

comment:51 in reply to:  50 Changed 2 years ago by Camillo Bruni

Replying to camillobruni:

I recently ran into a similar problem. The deep joining causes an explosion of values possibly resulting in many duplicate values. I fixed this in https://github.com/django/django/pull/4388 by implementing the DISTINCT parameter on the Aggregate functions. Hence you can do
Sum('center__client__loan__payment_schedule__payments__principal', distinct=True)
which ignores the duplicate entries.

It might be that the issue at hand is slightly different, but this solves at least one particular case.

Sorry for the noise, I was wrong (in hindsight: "obviously") http://www.sqlteam.com/article/how-to-use-group-by-with-distinct-aggregates-and-derived-tables describes the problem I thought of dealing with in more detail.

comment:52 Changed 20 months ago by Marc Aymerich

Cc: Marc Aymerich added

comment:53 Changed 15 months ago by Tim Graham

If I understood Anssi correctly, the existing documentation warning isn't correct, so I submitted a revision.

comment:54 Changed 14 months ago by Tim Graham <timograham@…>

In 3d22367:

Refs #10060 -- Corrected description of multiple annotations bug.

comment:55 Changed 14 months ago by Tim Graham <timograham@…>

In ea63cf8d:

[1.9.x] Refs #10060 -- Corrected description of multiple annotations bug.

Backport of 3d2236773ba88e330841b8c72183b0e978e10909 from master

comment:56 Changed 12 months ago by Tomasz Nowak

Cc: nowak2000@… added

comment:57 in reply to:  9 Changed 10 months ago by powderflask

Replying to bendavis78:
In comment16 bendavis78 recommends this query to solve the issue:

SELECT
  u.email,
  (SELECT SUM(points) FROM point_earning WHERE user_id=u.id) AS points_earned,
  (SELECT SUM(points) FROM point_expense WHERE user_id=u.id) AS points_spent
FROM
  "user" u

For others who might arrive here looking for a reasonable way to work around this issue, the following worked well for me. I added a custom Queryset method to the Manager class, and used a RawSQL() expression to create the annotations. This at least encapsulates the SQL code and allows the annotation to be integrated with a normal django queryset. Here's a sample for the example given above:

def annotate_sum_for_user(user_related_modelClass, field_name, annotation_name):
            raw_query = """
          SELECT SUM({field}) FROM {model} AS model
              WHERE model.user_id = user.id
        """.format(
            field = field_name,
            model = user_related_modelClass._meta.db_table,
        )

        annotation = {annotation_name: RawSQL(raw_query, [])}
        return self.annotate(**annotation)

Usage for above query on presumed User model:

    users = models.User.objects\
        .annotate_sum_for_user(PointEarning, 'points', 'points_earned')\
        .annotate_sum_for_user(PointExpense, 'points', 'points_spent')

Hope someone finds this useful, and a word of HUGE thanks to those who worked to get a statement of this issue and a link to this thread into the django documentation -- probably saved me hours.

comment:58 Changed 7 months ago by Aur Saraf

I just ran into this and submitted a duplicate bug (#26898).

I see a lot of sadness in this thread, joining my own.

My offer to fix if someone knowledgeable about the code can mentor me still stands. I assume at least detecting it and generating a warning shouldn't be too hard?

comment:59 Changed 6 months ago by Danilo Akamine

The solution proposed by powderflask is very interesting and worked well.
This ticket helped me a lot to understand the big problem.

Thank you all, guys.

comment:60 Changed 3 months ago by akermen

No sure this will work for all different cases, at least it worked well for me and the below simplified test case:

Model definitions of test case,

class ModelA(models.Model):
    title = models.CharField(max_length=255, blank=False, null=False, unique=True)

class ModelB(models.Model):
    parent = models.ForeignKey(ModelA, on_delete=models.CASCADE)    
    value = models.IntegerField()

class ModelC(models.Model):
    parent = models.ForeignKey(ModelA, on_delete=models.CASCADE)    

and sample data:

instance_a_1 = ModelA(title="instance1")
instance_a_1.save()
instance_b = ModelB(parent=instance_a_1, value=1)
instance_b.save()
instance_b = ModelB(parent=instance_a_1, value=3)
instance_b.save()
instance_b = ModelB(parent=instance_a_1, value=5)
instance_b.save()
instance_c = ModelC(parent=instance_a_1)
instance_c.save()
instance_c = ModelC(parent=instance_a_1)
instance_c.save()
instance_c = ModelC(parent=instance_a_1)
instance_c.save()
instance_c = ModelC(parent=instance_a_1)
instance_c.save()
instance_c = ModelC(parent=instance_a_1)
instance_c.save()

instance_a_2 = ModelA(title="instance2")
instance_a_2.save()
instance_b = ModelB(parent=instance_a_2, value=7)
instance_b.save()
instance_b = ModelB(parent=instance_a_2, value=11)
instance_b.save()
instance_c = ModelC(parent=instance_a_2)
instance_c.save()
instance_c = ModelC(parent=instance_a_2)
instance_c.save()
instance_c = ModelC(parent=instance_a_2)
instance_c.save()

Trying to get two independent annotations from two different tables (sum of values from ModelB and number of ModelC instances per ModelA instance):

for a in ModelA.objects.all() \
    .annotate(sumB=Sum('modelb__value')) \
    .annotate(countC=Count('modelc', distinct=True)):
    print "%s:  sumB: %s  countC: %s" % (a.title, a.sumB, a.countC)

we get these results

instance1:  sumB: 45  countC: 5
instance2:  sumB: 54  countC: 3

instead of these:

instance1  sumB: 9   countC: 5
instance2  sumB: 18   countC: 3

no surprise until here.

As you can see the 'sumB' values are repeated (multiplied) by a factor, lets account that factor inside single query:

for a in ModelA.objects.all() \
    .annotate(countC=Count('modelc', distinct=True)) \
    .annotate(countB=Count('modelb')) \
    .annotate(countB_distinct=Count('modelb', distinct=True)) \
    .annotate(sumB_multiplied=Sum('modelb__value')) \
    .annotate(sumB=(F('sumB_multiplied') * F('countB_distinct')) / F('countB')):
    print "%s  sumB: %s   countC: %s" % (a.title, a.sumB, a.countC)

and get correct values:

instance1  sumB: 9   countC: 5
instance2  sumB: 18   countC: 3
Note: See TracTickets for help on using tickets.
Back to Top