Opened 16 years ago

Last modified 7 months ago

#10060 new Bug

Multiple table annotation failure

Reported by: svsharma@… Owned by:
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords:
Cc: flosch@…, bendavis78@…, cmutel@…, daniel@…, Gabriel Hurley, sebleier@…, greg@…, mpjung@…, teemu.kurppa@…, Evstifeev Roman, michaelB, Marc Aymerich, nowak2000@…, Maxim, Gökhan Sarı, Adam M. Costello, Antoine, David Kwong Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by James Bennett)

Annotating across multiple tables results in wrong answers. i.e.

In [110]: total = Branch.objects.all().annotate(total=Sum('center__client__loan__amount'))

In [111]: total[0].total
Out[111]: 3433000

In [112]: repaid = Branch.objects.all().annotate(repaid=Sum('center__client__loan__payment_schedule__payments__principal'))

In [113]: repaid[0].repaid
Out[113]: 1976320.0

In [114]: both =  Branch.objects.all().annotate(total=Sum('center__client__loan__amount'),repaid=Sum('center__client__loan__payment_schedule__payments__principal'))

In [115]: both[0].repaid
Out[115]: 1976320.0

In [116]: both[0].total
Out[116]: 98816000
          ^^^^^^^^^^^

Compare the output of total in 116 vs. 111 (the correct answer).

Attachments (1)

agg_test.tar.gz (7.0 KB ) - added by svsharma@… 16 years ago.
Test project for issue.

Download all attachments as: .zip

Change History (70)

comment:1 by Russell Keith-Magee, 16 years ago

Triage Stage: UnreviewedAccepted

It would help a great deal if you gave us the actual models and sample data. You have highlighted an inconsistency, but you haven't given us the ability to reproduce the failure.

by svsharma@…, 16 years ago

Attachment: agg_test.tar.gz added

Test project for issue.

comment:2 by svsharma@…, 16 years ago

if you unzip agg_test.tar.gz, start a django shell in the project and run count.py in the shell, you will see the error being reproduced.

cheers
Sid

comment:3 by James Bennett, 16 years ago

Description: modified (diff)

(formatting)

comment:4 by anonymous, 16 years ago

Cc: flosch@… added

comment:6 by Russell Keith-Magee, 16 years ago

milestone: 1.1

This is a fairly big problem, but it's not going to be an easy one to fix - the solution may just have to be documentation to say "don't do that".

comment:7 by Ben Davis, 16 years ago

If anyone were to fix this, the first question to ask would be how you would structure the SQL. I'm not an SQL efficiency expert, but I know one way to get the correct result would be to use subqueries in the SELECT clause:

For example, let's say I have a User, PointEarning w/ ForeignKey(User), and PointExpense w/ F!oreignKey(User). Let's say I want a query to get the user's email, total points earned, and total points spent.:

SELECT auth_user.id, auth_user.email, (SELECT SUM(points) FROM app_pointearning WHERE user_id=auth_user.id) AS points_earned,  (SELECT SUM(points) FROM app_pointexpense WHERE user_id=auth_user.id) AS points_spent FROM auth_user

Another way, I believe, is to join on derived tables (sub-queries). Whatever way we decide to write the SQL, django could perhaps detect when we're using aggregates accross multiple joins, and alter the structure of the query accordingly. Seems like it could be kinda messy, but at least the results would be more what the user expected.

comment:8 by anonymous, 16 years ago

Cc: bendavis78@… added

comment:9 by Ben Davis, 16 years ago

From what I've researched, joining on subqueries would be faster than subqueries in the SELECT clause. So instead of this:

SELECT
  u.email,
  (SELECT SUM(points) FROM point_earning WHERE user_id=u.id) AS points_earned,
  (SELECT SUM(points) FROM point_expense WHERE user_id=u.id) AS points_spent
FROM
  "user" u

...we would want this:

SELECT
  u.email,
  a.points AS points_earned,
  b.points AS points_spent
FROM
  "user" u
  LEFT OUTER JOIN (SELECT user_id, SUM(points) AS points FROM point_earning GROUP BY user_id) a ON a.user_id=u.id
  LEFT OUTER JOIN (SELECT user_id, SUM(points) AS points FROM point_expense GROUP BY user_id) b ON b.user_id=u.id
ORDER BY u.id

What this does, essentially, is move the aggregate function into a derived table which we join onto the main table. I would imagine a solution to this bug would detect if annotations are being used across multiple relations, and if so, would adjust the join and select clauses as necessary. I may attempt to write a patch for this soon, but it seems like it would be tough, so there's no guarantees :-p

Thoughts?

comment:10 by Ben Davis, 16 years ago

Ok, after spending several hours looking through django/db/models/sql/query.py, I feel like my brain has turned to mush. This is definitely better left up to the ppl that wrote it. I hope that at some point someone who is capable can write a patch for it.

I'm against simply adding a "don't do this" in the documentation, because users would simply expect something like this to work -- it's hard to understand at a high level why simply adding another annotation from a different relation would skew the results.

comment:11 by Russell Keith-Magee, 16 years ago

milestone: 1.1

comment:12 by Sergio Oliveira, 16 years ago

Would be nice to know why the milestone was deleted.
If we are pushing this to 1.2 we need to add it in "Known Bugs" in the doc page:
http://docs.djangoproject.com/en/dev/topics/db/aggregation/

comment:13 by Russell Keith-Magee, 16 years ago

The milestone was deleted because this is a fairly big problem, without an obvious (or particularly easy) solution. There is no potential for data loss, and there is a viable workaround - either "don't do it",or use two queries rather than one. One of the side effects of a time-based release schedule is that some bugs will always exist in releases - unfortunately, this is one of those bugs.

Regarding adding this to a "known bugs" section - we don't document bugs. We document intended behavior. The ticket tracker is for bugs.

comment:14 by anonymous, 16 years ago

I'm going to second adding this to the aggregation documentation, because I just wasted two hours re-discovering this bug.

comment:15 by Ben Davis, 16 years ago

I also agree that this should be documented, but perhaps an exception should be raise when attempting to do multiple annotations on different tables/joins? I can't think of a situation where that would ever guarantee accurate results.

comment:16 by Ben Davis, 16 years ago

Regarding the example sql queries in comment 9 above, I've discovered that the first query is actually faster. My database has grown fairly large in the past few months, so I've been able to test with more clear results (this is using MySQL, by the way).

Using the JOIN method (where the aggregates are done in a derived join), the query was extremely slow. It took over 30 minutes to execute with 57,000 rows in the users table. When I changed the query to the SELECT method (where the aggregates are done in a subquery in the select clause), the query took only 2 seconds with the exact same results. I spent a whole day on this, and couldn't find a solid explanation for why the derived join method was taking so long. I'd say at this point if anyone moves forward with fixing this bug we should put the aggregate in the subquery of the select clause.

I've looked around for an official stance on how to write a query with multiple aggregations across different joined tables, and haven't come up with a solid answer. If anyone does find "the right way" to do it, I'd be interested to know -- I think that definitely needs to be settled before this bug can be fixed.

comment:17 by anonymous, 15 years ago

+1 victim

comment:18 by fitoria, 15 years ago

I can confirm this one... It happens with aggregate too wrong values in 2 - 5% range.

comment:19 by anonymous, 15 years ago

Cc: cmutel@… added

comment:20 by Daniel Blasco, 15 years ago

Cc: daniel@… added

A big bug needs a big shoe...

comment:21 by Gabriel Hurley, 15 years ago

Cc: Gabriel Hurley added

chalk up another victim... ::sigh::

comment:22 by sebleier@…, 15 years ago

Cc: sebleier@… added

comment:23 by fas, 15 years ago

I also just ran into this bug and it almost went unnoticed. An exception would indeed be very nice indicating that this does not work yet.

comment:24 by anonymous, 15 years ago

Cc: greg@… added

comment:25 by mp, 15 years ago

Cc: mpjung@… added

comment:26 by Teemu Kurppa, 14 years ago

Cc: teemu.kurppa@… added

Another victim here. Just spent several hours with finding this bug, which had existed several days. This definitely should be either fixed or prevented: values that the incorrect query provides are often somewhat in the sensible range, depending on your row structures, which can make this bug hard to spot.

I also tested both queries that bendavis78 outlined above and they give correct results. Didn't do extensive performance testing yet, with 20K rows in the largest table, both queries were between 0.05 and 0.10 secs.

comment:27 by Dead Account, 14 years ago

Another victim.

This bug means that you can't use annotations at all unless you know exactly what is going to happen to your QuerySet later. Functions cannot expect that any QuerySet passed as a parameter is capable of being annotated as it may already have been done.

I am a reasonably to Django, but it seems odd to me that this has remained a known bug for 2 years without anyone working on it, or a the documentation being changed to reflect the lack of functionality. I understand russell's logic that bugs should not be documented, but if this cannot be fixed, then surely it should be removed from intended behaviour.

comment:28 by Chris Beaven, 14 years ago

Severity: Normal
Type: Bug

comment:29 by anonymous, 13 years ago

Easy pickings: unset
UI/UX: unset

Another victim. I have a model Foo. I also have models Bar and Baz, which each have a foreign key pointing to Foo with related_names 'bars', and 'bazs', respectively:

for foo in Foo.objects.all().annotate(bar_count=Count('bars'), baz_count=Count('bazs')):
    print foo.bar_count  # Was correct
    print foo.baz_count  # Was completely wrong, but not by any apparent pattern (ie, the numbers seemed to be almost random)

I'm unable to display the baz_count in my template now, unless I want to introduce an N+1 problem (ie, foo.bazs.count() during each iteration)

comment:30 by anonymous, 13 years ago

Whoa, thought I was doing something wrong in my code until I searched and found this. Isn't this kind of a big deal? As in, with this bug, isn't Django's aggregation basically worthless for anyone who's not doing it on a single table, or did I miss something?

comment:31 by anonymous, 13 years ago

Is this something that will ever be fixed?

in reply to:  31 comment:32 by Luke Plant, 13 years ago

Replying to anonymous:

Is this something that will ever be fixed?

It will only ever get fixed if someone cares enough to write a patch (complete with tests). I think at this point a patch which threw an exception for the failure case would be accepted. Obviously ideally this would be replaced with a proper fix at some point.

comment:33 by anonymous, 13 years ago

Really, it seems there are plenty of ticket with patches (complete with tests), and with offers to do the work, but the core team doesn't care.

in reply to:  33 comment:34 by Carl Meyer, 13 years ago

Replying to anonymous:

Really, it seems there are plenty of ticket with patches (complete with tests), and with offers to do the work, but the core team doesn't care.

Sorry you're feeling frustrated.

There are currently 15 tickets in Trac marked "Ready for checkin." The _only_ task in Django's development process that's limited to core developers is committing those 15 tickets. Fifteen tickets doesn't seem like such a terrible backlog for an entirely-volunteer crew, frankly.

Any other step in the process (writing a patch, adding tests to an existing patch, reviewing a patch and marking it "Ready for checkin" if there are tests, they pass, and it fixes the problem for you) can be done by anyone in the community. See https://docs.djangoproject.com/en/dev/internals/contributing/triaging-tickets/

There is a larger backlog of "design decision needed" tickets (298), but anyone can contribute usefully to those as well by researching pros and cons and presenting a balanced summary to the developers mailing list to start a discussion.

comment:36 by Evstifeev Roman, 13 years ago

Cc: Evstifeev Roman added

comment:37 by Aymeric Augustin, 12 years ago

#19290 is a duplicate.

comment:38 by michael@…, 12 years ago

Hmm, another victim here. Spent 2 hours trying to understand why my numbers were so odd to finally find this bug. This is the first time in several django projects I am really feeling left alone :-(
I am no database expert (which is one reason i LOVE django and it's awesome Queryset) so I don't see any way that I could provide implementation help or anything, but seeing that this bug is now open for several years really makes me shake my head. Even if there is no obvious fix available, PLEASE at least update the documentatin to point out this issue!

comment:39 by michaelB, 12 years ago

Cc: michaelB added

comment:40 by Anssi Kääriäinen, 12 years ago

This is somewhat hard problem. We could do automatic subquery pushdown for aggregates - it isn't exactly easy but most of the pieces are already there. I am not sure how well such queries will perform. Although that isn't perhaps too important. If the results aren't correct speed doesn't matter...

I think a good start to this would be documentation + detecting the problematic situations and throwing an error in those cases. Basically if we have two multijoins in the query (from filtering, aggregation or something else) and aggregation is used then the query is likely wrong. However this condition is a little too restrictive - something like:

qs.filter(translations__lang='fi', translations__name='foo')

introduces a multijoin to translations, yet the query is safe as translations.lang is unique. Detecting if we have a unique multijoin is going to be nasty. So, we need a way to bypass this error checking.

Proposal: aggregates get a new kwarg "inline" - default False. If inline is False, then multiple multijoins in single query will raise an error. If it is True, the aggregate is forced to be joined to the query (that is, to do what aggregates do today). Later on we will likely add another kwarg "subquery" - defaults again to False. If set to True the aggregate is forced to subquery (something like in comment:9). We could also add kwarg "subselect". Later on maybe also "lateral_subquery" and so on.

The proposal allows us to throw an error, yet give users the ability to force the ORM to generate the queries they are using now. It also allows for extending the same API if we get subquery aggregate support some day.

I think we need to add the information about multijoins to JoinInfo, and add the info about used joins to aggregates when they are added to the query. Then at compiler stage, check for all inline=False aggregates if there are multijoins in the query not used by the aggregate itself. If so, throw an error. Changes should be needed at query.setup_joins(), compiler.get_from_clause() and then a way to annotate the used joins into the aggregate (this should not be too hard).

comment:41 by anonymous, 12 years ago

At this point the documentation on annotations can be considered to be intentionally misleading.

It is hard not to think this given the severity of the issue and the comments above.

Yes a code fix is hard but at least an obviously tentative documentation "fix" can be pushed fairly easily.

comment:42 by Anssi Kääriäinen, 12 years ago

Component: ORM aggregationDatabase layer (models, ORM)

comment:46 by benjaoming, 10 years ago

Agreed with anonymous 01/28/2013 that there needs to be at least some documentation on this: https://github.com/django/django/pull/3654

comment:47 by Tim Graham <timograham@…>, 10 years ago

In 4b23a6c7a9232cc07ec95fe98be17efbd4449822:

Documented a current limitation of multiple table annotation; refs #10060.

comment:48 by Tim Graham <timograham@…>, 10 years ago

In 48bd44670b4c58735d5cdf6762d6b61219ecbb12:

[1.7.x] Documented a current limitation of multiple table annotation; refs #10060.

Backport of 4b23a6c7a9232cc07ec95fe98be17efbd4449822 from master

comment:49 by Tim Graham <timograham@…>, 10 years ago

In a87ade7e30c80de93ec7ad8335d8714b494a58fb:

[1.6.x] Documented a current limitation of multiple table annotation; refs #10060.

Backport of 4b23a6c7a9232cc07ec95fe98be17efbd4449822 from master

comment:50 by Camillo Bruni, 10 years ago

I recently ran into a similar problem. The deep joining causes an explosion of values possibly resulting in many duplicate values. I fixed this in https://github.com/django/django/pull/4388 by implementing the DISTINCT parameter on the Aggregate functions. Hence you can do
Sum('center__client__loan__payment_schedule__payments__principal', distinct=True)
which ignores the duplicate entries.

It might be that the issue at hand is slightly different, but this solves at least one particular case.

in reply to:  50 comment:51 by Camillo Bruni, 10 years ago

Replying to camillobruni:

I recently ran into a similar problem. The deep joining causes an explosion of values possibly resulting in many duplicate values. I fixed this in https://github.com/django/django/pull/4388 by implementing the DISTINCT parameter on the Aggregate functions. Hence you can do
Sum('center__client__loan__payment_schedule__payments__principal', distinct=True)
which ignores the duplicate entries.

It might be that the issue at hand is slightly different, but this solves at least one particular case.

Sorry for the noise, I was wrong (in hindsight: "obviously") http://www.sqlteam.com/article/how-to-use-group-by-with-distinct-aggregates-and-derived-tables describes the problem I thought of dealing with in more detail.

comment:52 by Marc Aymerich, 10 years ago

Cc: Marc Aymerich added

comment:53 by Tim Graham, 9 years ago

If I understood Anssi correctly, the existing documentation warning isn't correct, so I submitted a revision.

comment:54 by Tim Graham <timograham@…>, 9 years ago

In 3d22367:

Refs #10060 -- Corrected description of multiple annotations bug.

comment:55 by Tim Graham <timograham@…>, 9 years ago

In ea63cf8d:

[1.9.x] Refs #10060 -- Corrected description of multiple annotations bug.

Backport of 3d2236773ba88e330841b8c72183b0e978e10909 from master

comment:56 by Tomasz Nowak, 9 years ago

Cc: nowak2000@… added

in reply to:  9 ; comment:57 by powderflask, 9 years ago

Replying to bendavis78:
In comment16 bendavis78 recommends this query to solve the issue:

SELECT
  u.email,
  (SELECT SUM(points) FROM point_earning WHERE user_id=u.id) AS points_earned,
  (SELECT SUM(points) FROM point_expense WHERE user_id=u.id) AS points_spent
FROM
  "user" u

For others who might arrive here looking for a reasonable way to work around this issue, the following worked well for me. I added a custom Queryset method to the Manager class, and used a RawSQL() expression to create the annotations. This at least encapsulates the SQL code and allows the annotation to be integrated with a normal django queryset. Here's a sample for the example given above:

def annotate_sum_for_user(user_related_modelClass, field_name, annotation_name):
            raw_query = """
          SELECT SUM({field}) FROM {model} AS model
              WHERE model.user_id = user.id
        """.format(
            field = field_name,
            model = user_related_modelClass._meta.db_table,
        )

        annotation = {annotation_name: RawSQL(raw_query, [])}
        return self.annotate(**annotation)

Usage for above query on presumed User model:

    users = models.User.objects\
        .annotate_sum_for_user(PointEarning, 'points', 'points_earned')\
        .annotate_sum_for_user(PointExpense, 'points', 'points_spent')

Hope someone finds this useful, and a word of HUGE thanks to those who worked to get a statement of this issue and a link to this thread into the django documentation -- probably saved me hours.

comment:58 by Aur Saraf, 9 years ago

I just ran into this and submitted a duplicate bug (#26898).

I see a lot of sadness in this thread, joining my own.

My offer to fix if someone knowledgeable about the code can mentor me still stands. I assume at least detecting it and generating a warning shouldn't be too hard?

comment:59 by Danilo Akamine, 8 years ago

The solution proposed by powderflask is very interesting and worked well.
This ticket helped me a lot to understand the big problem.

Thank you all, guys.

comment:60 by akermen, 8 years ago

No sure this will work for all different cases, at least it worked well for me and the below simplified test case:

Model definitions of test case,

class ModelA(models.Model):
    title = models.CharField(max_length=255, blank=False, null=False, unique=True)

class ModelB(models.Model):
    parent = models.ForeignKey(ModelA, on_delete=models.CASCADE)    
    value = models.IntegerField()

class ModelC(models.Model):
    parent = models.ForeignKey(ModelA, on_delete=models.CASCADE)    

and sample data:

instance_a_1 = ModelA(title="instance1")
instance_a_1.save()
instance_b = ModelB(parent=instance_a_1, value=1)
instance_b.save()
instance_b = ModelB(parent=instance_a_1, value=3)
instance_b.save()
instance_b = ModelB(parent=instance_a_1, value=5)
instance_b.save()
instance_c = ModelC(parent=instance_a_1)
instance_c.save()
instance_c = ModelC(parent=instance_a_1)
instance_c.save()
instance_c = ModelC(parent=instance_a_1)
instance_c.save()
instance_c = ModelC(parent=instance_a_1)
instance_c.save()
instance_c = ModelC(parent=instance_a_1)
instance_c.save()

instance_a_2 = ModelA(title="instance2")
instance_a_2.save()
instance_b = ModelB(parent=instance_a_2, value=7)
instance_b.save()
instance_b = ModelB(parent=instance_a_2, value=11)
instance_b.save()
instance_c = ModelC(parent=instance_a_2)
instance_c.save()
instance_c = ModelC(parent=instance_a_2)
instance_c.save()
instance_c = ModelC(parent=instance_a_2)
instance_c.save()

Trying to get two independent annotations from two different tables (sum of values from ModelB and number of ModelC instances per ModelA instance):

for a in ModelA.objects.all() \
    .annotate(sumB=Sum('modelb__value')) \
    .annotate(countC=Count('modelc', distinct=True)):
    print "%s:  sumB: %s  countC: %s" % (a.title, a.sumB, a.countC)

we get these results

instance1:  sumB: 45  countC: 5
instance2:  sumB: 54  countC: 3

instead of these:

instance1  sumB: 9   countC: 5
instance2  sumB: 18   countC: 3

no surprise until here.

As you can see the 'sumB' values are repeated (multiplied) by a factor, lets account that factor inside single query:

for a in ModelA.objects.all() \
    .annotate(countC=Count('modelc', distinct=True)) \
    .annotate(countB=Count('modelb')) \
    .annotate(countB_distinct=Count('modelb', distinct=True)) \
    .annotate(sumB_multiplied=Sum('modelb__value')) \
    .annotate(sumB=(F('sumB_multiplied') * F('countB_distinct')) / F('countB')):
    print "%s  sumB: %s   countC: %s" % (a.title, a.sumB, a.countC)

and get correct values:

instance1  sumB: 9   countC: 5
instance2  sumB: 18   countC: 3

comment:61 by Artem.Bernatskyy, 8 years ago

Just have spent 5 hours debugging this bug.

+1

comment:62 by Maxim, 7 years ago

Cc: Maxim added

I think a note about this "feature" in the docs would be helpful.

comment:63 by Gökhan Sarı, 7 years ago

Cc: Gökhan Sarı added

comment:64 by Israel Saeta Pérez, 7 years ago

FTR: See a workaround using SubQuery for the issue with annotate described in this ticket, in this SO post: https://stackoverflow.com/questions/48598245/multiple-annotate-with-sum-and-display-data-in-admin-django

comment:65 by Adam M. Costello, 7 years ago

Cc: Adam M. Costello added

Does anyone here understand roughly what is safe to do with annotate()?

If I call annotate() only once on a queryset, with only one argument, is that always safe, no matter how many relationships I follow in annotate() and filter()?

Is it safe to pass multiple arguments to annotate() (or call annotate() multiple times) if none of the annotations follow any relationships? Even if there are relationships followed inside filter()?

Do the problems with annotate() also apply to aggregate(), or is it safe to pass multiple arguments to aggregate() ? Is it safe to call annotate() with one argument and then call aggregate()?

comment:66 by Simon Charette, 6 years ago

Since this ticket has more exposure I wanted to mention that #28296 ought to fix this by allowing a subquery argument to be passed to aggregate functions.

If it was to be implemented the original report queryset could be made to return the correct results easily

Branch.objects.annotate(
    total=Sum('center__client__loan__amount'),
    repaid=Sum('center__client__loan__payment_schedule__payments__principal', subquery=True),
)

Once this lands the ORM should be in a position to detect and warn about or even deprecate multiple cross joined annotations.

Given the current existing logic with the Subquery expression it shouldn't be too hard for someone with basic expression knowledge about the ORM to implement a patch if anyone is interested in contributing.

in reply to:  57 comment:67 by wilhelmhb, 6 years ago

Replying to powderflask:
A huge thanks to powderflask: for the ones knowledgeable enough with SQL, this is the way to go =)

I also tried the solution described here: https://stackoverflow.com/questions/48598245/multiple-annotate-with-sum-and-display-data-in-admin-django/48607830#48607830, alas without success...

Replying to bendavis78:
In comment16 bendavis78 recommends this query to solve the issue:

SELECT
  u.email,
  (SELECT SUM(points) FROM point_earning WHERE user_id=u.id) AS points_earned,
  (SELECT SUM(points) FROM point_expense WHERE user_id=u.id) AS points_spent
FROM
  "user" u

For others who might arrive here looking for a reasonable way to work around this issue, the following worked well for me. I added a custom Queryset method to the Manager class, and used a RawSQL() expression to create the annotations. This at least encapsulates the SQL code and allows the annotation to be integrated with a normal django queryset. Here's a sample for the example given above:

def annotate_sum_for_user(user_related_modelClass, field_name, annotation_name):
            raw_query = """
          SELECT SUM({field}) FROM {model} AS model
              WHERE model.user_id = user.id
        """.format(
            field = field_name,
            model = user_related_modelClass._meta.db_table,
        )

        annotation = {annotation_name: RawSQL(raw_query, [])}
        return self.annotate(**annotation)

Usage for above query on presumed User model:

    users = models.User.objects\
        .annotate_sum_for_user(PointEarning, 'points', 'points_earned')\
        .annotate_sum_for_user(PointExpense, 'points', 'points_spent')

Hope someone finds this useful, and a word of HUGE thanks to those who worked to get a statement of this issue and a link to this thread into the django documentation -- probably saved me hours.

comment:68 by Antoine, 6 years ago

Cc: Antoine added

comment:70 by Adam Sołtysik, 4 years ago

It's very surprising that aggregations aren't done in subqueries. The problem arises even with basic, one-field aggregations, when there are duplicates in the original query. So something like this, with a single reverse foreign key relation:

User.objects.filter(orders__date=today).distinct().values('first_name').annotate(Count('pk'))

may produce wrong results (will count orders instead of users), even though it looks safe with .distinct(). From what I've found, the documentation doesn't warn about such cases.

Funnily enough, when I have rewritten one of my not working queries with sum aggregation to use a subquery, it actually became faster (on a table with ~100k rows), but of course YMMV.

comment:71 by Matthijs Kooijman, 3 years ago

For anyone else running into this, a reasonable workaround seems to be to use subqueries for aggregating annotations. This is a bit verbose/hacky in Django currently, as shown by the multitude of approaches in the stackoverflow link from Antoine's comment. However, I've successfully used the django-sql-utils package for this just now. That sounds a bit bulky, but it just has two utilities, one of which is a SubqueryAggregate class (with derived SubqueryCount, SubquerySum, etc.) that make converting a regular joining aggregate into a subquery aggregate easy and concise, without changing the structure much.

For example taking the example from comment 66 and converting the second annotation to a subquery with django-sql-utils, you'd get:

Branch.objects.annotate(
    total=Sum('center__client__loan__amount'),
    repaid=SubquerySum('center__client__loan__payment_schedule__payments__principal'),
)

comment:72 by David Kwong, 2 years ago

Cc: David Kwong added

comment:73 by Natalia Bidart, 18 months ago

#34728 is a (fairly) duplicate of this one.

in reply to:  71 comment:74 by powderflask, 7 months ago

Replying to Matthijs Kooijman:

For anyone else running into this, ... I've successfully used the django-sql-utils package for this...

We've also started using django-sql-utils. Mirrors a home-spun solution we had in production, so great to replace our hacked together version with this nice piece of code.

One thing to watch for: the subquery logic only works (and only really makes sense) when you have joins! If you are aggregating on a single model, just use the regular Django aggregation functions.

Note: See TracTickets for help on using tickets.
Back to Top