Opened 16 years ago
Last modified 7 months ago
#10060 new Bug
Multiple table annotation failure
Reported by: | Owned by: | ||
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | flosch@…, bendavis78@…, cmutel@…, daniel@…, Gabriel Hurley, sebleier@…, greg@…, mpjung@…, teemu.kurppa@…, Evstifeev Roman, michaelB, Marc Aymerich, nowak2000@…, Maxim, Gökhan Sarı, Adam M. Costello, Antoine, David Kwong | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
Annotating across multiple tables results in wrong answers. i.e.
In [110]: total = Branch.objects.all().annotate(total=Sum('center__client__loan__amount')) In [111]: total[0].total Out[111]: 3433000 In [112]: repaid = Branch.objects.all().annotate(repaid=Sum('center__client__loan__payment_schedule__payments__principal')) In [113]: repaid[0].repaid Out[113]: 1976320.0 In [114]: both = Branch.objects.all().annotate(total=Sum('center__client__loan__amount'),repaid=Sum('center__client__loan__payment_schedule__payments__principal')) In [115]: both[0].repaid Out[115]: 1976320.0 In [116]: both[0].total Out[116]: 98816000 ^^^^^^^^^^^
Compare the output of total in 116 vs. 111 (the correct answer).
Attachments (1)
Change History (70)
comment:1 by , 16 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:2 by , 16 years ago
if you unzip agg_test.tar.gz, start a django shell in the project and run count.py in the shell, you will see the error being reproduced.
cheers
Sid
comment:4 by , 16 years ago
Cc: | added |
---|
comment:6 by , 16 years ago
milestone: | → 1.1 |
---|
This is a fairly big problem, but it's not going to be an easy one to fix - the solution may just have to be documentation to say "don't do that".
comment:7 by , 16 years ago
If anyone were to fix this, the first question to ask would be how you would structure the SQL. I'm not an SQL efficiency expert, but I know one way to get the correct result would be to use subqueries in the SELECT clause:
For example, let's say I have a User, PointEarning w/ ForeignKey(User), and PointExpense w/ F!oreignKey(User). Let's say I want a query to get the user's email, total points earned, and total points spent.:
SELECT auth_user.id, auth_user.email, (SELECT SUM(points) FROM app_pointearning WHERE user_id=auth_user.id) AS points_earned, (SELECT SUM(points) FROM app_pointexpense WHERE user_id=auth_user.id) AS points_spent FROM auth_user
Another way, I believe, is to join on derived tables (sub-queries). Whatever way we decide to write the SQL, django could perhaps detect when we're using aggregates accross multiple joins, and alter the structure of the query accordingly. Seems like it could be kinda messy, but at least the results would be more what the user expected.
comment:8 by , 16 years ago
Cc: | added |
---|
follow-up: 57 comment:9 by , 16 years ago
From what I've researched, joining on subqueries would be faster than subqueries in the SELECT clause. So instead of this:
SELECT u.email, (SELECT SUM(points) FROM point_earning WHERE user_id=u.id) AS points_earned, (SELECT SUM(points) FROM point_expense WHERE user_id=u.id) AS points_spent FROM "user" u
...we would want this:
SELECT u.email, a.points AS points_earned, b.points AS points_spent FROM "user" u LEFT OUTER JOIN (SELECT user_id, SUM(points) AS points FROM point_earning GROUP BY user_id) a ON a.user_id=u.id LEFT OUTER JOIN (SELECT user_id, SUM(points) AS points FROM point_expense GROUP BY user_id) b ON b.user_id=u.id ORDER BY u.id
What this does, essentially, is move the aggregate function into a derived table which we join onto the main table. I would imagine a solution to this bug would detect if annotations are being used across multiple relations, and if so, would adjust the join and select clauses as necessary. I may attempt to write a patch for this soon, but it seems like it would be tough, so there's no guarantees :-p
Thoughts?
comment:10 by , 16 years ago
Ok, after spending several hours looking through django/db/models/sql/query.py, I feel like my brain has turned to mush. This is definitely better left up to the ppl that wrote it. I hope that at some point someone who is capable can write a patch for it.
I'm against simply adding a "don't do this" in the documentation, because users would simply expect something like this to work -- it's hard to understand at a high level why simply adding another annotation from a different relation would skew the results.
comment:11 by , 16 years ago
milestone: | 1.1 |
---|
comment:12 by , 16 years ago
Would be nice to know why the milestone was deleted.
If we are pushing this to 1.2 we need to add it in "Known Bugs" in the doc page:
http://docs.djangoproject.com/en/dev/topics/db/aggregation/
comment:13 by , 16 years ago
The milestone was deleted because this is a fairly big problem, without an obvious (or particularly easy) solution. There is no potential for data loss, and there is a viable workaround - either "don't do it",or use two queries rather than one. One of the side effects of a time-based release schedule is that some bugs will always exist in releases - unfortunately, this is one of those bugs.
Regarding adding this to a "known bugs" section - we don't document bugs. We document intended behavior. The ticket tracker is for bugs.
comment:14 by , 16 years ago
I'm going to second adding this to the aggregation documentation, because I just wasted two hours re-discovering this bug.
comment:15 by , 16 years ago
I also agree that this should be documented, but perhaps an exception should be raise when attempting to do multiple annotations on different tables/joins? I can't think of a situation where that would ever guarantee accurate results.
comment:16 by , 16 years ago
Regarding the example sql queries in comment 9 above, I've discovered that the first query is actually faster. My database has grown fairly large in the past few months, so I've been able to test with more clear results (this is using MySQL, by the way).
Using the JOIN method (where the aggregates are done in a derived join), the query was extremely slow. It took over 30 minutes to execute with 57,000 rows in the users table. When I changed the query to the SELECT method (where the aggregates are done in a subquery in the select clause), the query took only 2 seconds with the exact same results. I spent a whole day on this, and couldn't find a solid explanation for why the derived join method was taking so long. I'd say at this point if anyone moves forward with fixing this bug we should put the aggregate in the subquery of the select clause.
I've looked around for an official stance on how to write a query with multiple aggregations across different joined tables, and haven't come up with a solid answer. If anyone does find "the right way" to do it, I'd be interested to know -- I think that definitely needs to be settled before this bug can be fixed.
comment:18 by , 15 years ago
I can confirm this one... It happens with aggregate too wrong values in 2 - 5% range.
comment:19 by , 15 years ago
Cc: | added |
---|
comment:22 by , 15 years ago
Cc: | added |
---|
comment:23 by , 15 years ago
I also just ran into this bug and it almost went unnoticed. An exception would indeed be very nice indicating that this does not work yet.
comment:24 by , 15 years ago
Cc: | added |
---|
comment:25 by , 15 years ago
Cc: | added |
---|
comment:26 by , 14 years ago
Cc: | added |
---|
Another victim here. Just spent several hours with finding this bug, which had existed several days. This definitely should be either fixed or prevented: values that the incorrect query provides are often somewhat in the sensible range, depending on your row structures, which can make this bug hard to spot.
I also tested both queries that bendavis78 outlined above and they give correct results. Didn't do extensive performance testing yet, with 20K rows in the largest table, both queries were between 0.05 and 0.10 secs.
comment:27 by , 14 years ago
Another victim.
This bug means that you can't use annotations at all unless you know exactly what is going to happen to your QuerySet later. Functions cannot expect that any QuerySet passed as a parameter is capable of being annotated as it may already have been done.
I am a reasonably to Django, but it seems odd to me that this has remained a known bug for 2 years without anyone working on it, or a the documentation being changed to reflect the lack of functionality. I understand russell's logic that bugs should not be documented, but if this cannot be fixed, then surely it should be removed from intended behaviour.
comment:28 by , 14 years ago
Severity: | → Normal |
---|---|
Type: | → Bug |
comment:29 by , 13 years ago
Easy pickings: | unset |
---|---|
UI/UX: | unset |
Another victim. I have a model Foo
. I also have models Bar
and Baz
, which each have a foreign key pointing to Foo
with related_names 'bars'
, and 'bazs'
, respectively:
for foo in Foo.objects.all().annotate(bar_count=Count('bars'), baz_count=Count('bazs')): print foo.bar_count # Was correct print foo.baz_count # Was completely wrong, but not by any apparent pattern (ie, the numbers seemed to be almost random)
I'm unable to display the baz_count
in my template now, unless I want to introduce an N+1 problem (ie, foo.bazs.count()
during each iteration)
comment:30 by , 13 years ago
Whoa, thought I was doing something wrong in my code until I searched and found this. Isn't this kind of a big deal? As in, with this bug, isn't Django's aggregation basically worthless for anyone who's not doing it on a single table, or did I miss something?
comment:32 by , 13 years ago
Replying to anonymous:
Is this something that will ever be fixed?
It will only ever get fixed if someone cares enough to write a patch (complete with tests). I think at this point a patch which threw an exception for the failure case would be accepted. Obviously ideally this would be replaced with a proper fix at some point.
follow-up: 34 comment:33 by , 13 years ago
Really, it seems there are plenty of ticket with patches (complete with tests), and with offers to do the work, but the core team doesn't care.
comment:34 by , 13 years ago
Replying to anonymous:
Really, it seems there are plenty of ticket with patches (complete with tests), and with offers to do the work, but the core team doesn't care.
Sorry you're feeling frustrated.
There are currently 15 tickets in Trac marked "Ready for checkin." The _only_ task in Django's development process that's limited to core developers is committing those 15 tickets. Fifteen tickets doesn't seem like such a terrible backlog for an entirely-volunteer crew, frankly.
Any other step in the process (writing a patch, adding tests to an existing patch, reviewing a patch and marking it "Ready for checkin" if there are tests, they pass, and it fixes the problem for you) can be done by anyone in the community. See https://docs.djangoproject.com/en/dev/internals/contributing/triaging-tickets/
There is a larger backlog of "design decision needed" tickets (298), but anyone can contribute usefully to those as well by researching pros and cons and presenting a balanced summary to the developers mailing list to start a discussion.
comment:36 by , 13 years ago
Cc: | added |
---|
comment:38 by , 12 years ago
Hmm, another victim here. Spent 2 hours trying to understand why my numbers were so odd to finally find this bug. This is the first time in several django projects I am really feeling left alone :-(
I am no database expert (which is one reason i LOVE django and it's awesome Queryset) so I don't see any way that I could provide implementation help or anything, but seeing that this bug is now open for several years really makes me shake my head. Even if there is no obvious fix available, PLEASE at least update the documentatin to point out this issue!
comment:39 by , 12 years ago
Cc: | added |
---|
comment:40 by , 12 years ago
This is somewhat hard problem. We could do automatic subquery pushdown for aggregates - it isn't exactly easy but most of the pieces are already there. I am not sure how well such queries will perform. Although that isn't perhaps too important. If the results aren't correct speed doesn't matter...
I think a good start to this would be documentation + detecting the problematic situations and throwing an error in those cases. Basically if we have two multijoins in the query (from filtering, aggregation or something else) and aggregation is used then the query is likely wrong. However this condition is a little too restrictive - something like:
qs.filter(translations__lang='fi', translations__name='foo')
introduces a multijoin to translations, yet the query is safe as translations.lang is unique. Detecting if we have a unique multijoin is going to be nasty. So, we need a way to bypass this error checking.
Proposal: aggregates get a new kwarg "inline" - default False. If inline is False, then multiple multijoins in single query will raise an error. If it is True, the aggregate is forced to be joined to the query (that is, to do what aggregates do today). Later on we will likely add another kwarg "subquery" - defaults again to False. If set to True the aggregate is forced to subquery (something like in comment:9). We could also add kwarg "subselect". Later on maybe also "lateral_subquery" and so on.
The proposal allows us to throw an error, yet give users the ability to force the ORM to generate the queries they are using now. It also allows for extending the same API if we get subquery aggregate support some day.
I think we need to add the information about multijoins to JoinInfo, and add the info about used joins to aggregates when they are added to the query. Then at compiler stage, check for all inline=False aggregates if there are multijoins in the query not used by the aggregate itself. If so, throw an error. Changes should be needed at query.setup_joins(), compiler.get_from_clause() and then a way to annotate the used joins into the aggregate (this should not be too hard).
comment:41 by , 12 years ago
At this point the documentation on annotations can be considered to be intentionally misleading.
It is hard not to think this given the severity of the issue and the comments above.
Yes a code fix is hard but at least an obviously tentative documentation "fix" can be pushed fairly easily.
comment:42 by , 12 years ago
Component: | ORM aggregation → Database layer (models, ORM) |
---|
comment:46 by , 10 years ago
Agreed with anonymous 01/28/2013 that there needs to be at least some documentation on this: https://github.com/django/django/pull/3654
follow-up: 51 comment:50 by , 10 years ago
I recently ran into a similar problem. The deep joining causes an explosion of values possibly resulting in many duplicate values. I fixed this in https://github.com/django/django/pull/4388 by implementing the DISTINCT parameter on the Aggregate functions. Hence you can do
Sum('center__client__loan__payment_schedule__payments__principal', distinct=True)
which ignores the duplicate entries.
It might be that the issue at hand is slightly different, but this solves at least one particular case.
comment:51 by , 10 years ago
Replying to camillobruni:
I recently ran into a similar problem. The deep joining causes an explosion of values possibly resulting in many duplicate values. I fixed this in https://github.com/django/django/pull/4388 by implementing the DISTINCT parameter on the Aggregate functions. Hence you can do
Sum('center__client__loan__payment_schedule__payments__principal', distinct=True)
which ignores the duplicate entries.
It might be that the issue at hand is slightly different, but this solves at least one particular case.
Sorry for the noise, I was wrong (in hindsight: "obviously") http://www.sqlteam.com/article/how-to-use-group-by-with-distinct-aggregates-and-derived-tables describes the problem I thought of dealing with in more detail.
comment:52 by , 10 years ago
Cc: | added |
---|
comment:53 by , 9 years ago
If I understood Anssi correctly, the existing documentation warning isn't correct, so I submitted a revision.
comment:56 by , 9 years ago
Cc: | added |
---|
follow-up: 67 comment:57 by , 9 years ago
Replying to bendavis78:
In comment16 bendavis78 recommends this query to solve the issue:
SELECT u.email, (SELECT SUM(points) FROM point_earning WHERE user_id=u.id) AS points_earned, (SELECT SUM(points) FROM point_expense WHERE user_id=u.id) AS points_spent FROM "user" u
For others who might arrive here looking for a reasonable way to work around this issue, the following worked well for me. I added a custom Queryset method to the Manager class, and used a RawSQL() expression to create the annotations. This at least encapsulates the SQL code and allows the annotation to be integrated with a normal django queryset. Here's a sample for the example given above:
def annotate_sum_for_user(user_related_modelClass, field_name, annotation_name): raw_query = """ SELECT SUM({field}) FROM {model} AS model WHERE model.user_id = user.id """.format( field = field_name, model = user_related_modelClass._meta.db_table, ) annotation = {annotation_name: RawSQL(raw_query, [])} return self.annotate(**annotation)
Usage for above query on presumed User model:
users = models.User.objects\ .annotate_sum_for_user(PointEarning, 'points', 'points_earned')\ .annotate_sum_for_user(PointExpense, 'points', 'points_spent')
Hope someone finds this useful, and a word of HUGE thanks to those who worked to get a statement of this issue and a link to this thread into the django documentation -- probably saved me hours.
comment:58 by , 9 years ago
I just ran into this and submitted a duplicate bug (#26898).
I see a lot of sadness in this thread, joining my own.
My offer to fix if someone knowledgeable about the code can mentor me still stands. I assume at least detecting it and generating a warning shouldn't be too hard?
comment:59 by , 8 years ago
The solution proposed by powderflask is very interesting and worked well.
This ticket helped me a lot to understand the big problem.
Thank you all, guys.
comment:60 by , 8 years ago
No sure this will work for all different cases, at least it worked well for me and the below simplified test case:
Model definitions of test case,
class ModelA(models.Model): title = models.CharField(max_length=255, blank=False, null=False, unique=True) class ModelB(models.Model): parent = models.ForeignKey(ModelA, on_delete=models.CASCADE) value = models.IntegerField() class ModelC(models.Model): parent = models.ForeignKey(ModelA, on_delete=models.CASCADE)
and sample data:
instance_a_1 = ModelA(title="instance1") instance_a_1.save() instance_b = ModelB(parent=instance_a_1, value=1) instance_b.save() instance_b = ModelB(parent=instance_a_1, value=3) instance_b.save() instance_b = ModelB(parent=instance_a_1, value=5) instance_b.save() instance_c = ModelC(parent=instance_a_1) instance_c.save() instance_c = ModelC(parent=instance_a_1) instance_c.save() instance_c = ModelC(parent=instance_a_1) instance_c.save() instance_c = ModelC(parent=instance_a_1) instance_c.save() instance_c = ModelC(parent=instance_a_1) instance_c.save() instance_a_2 = ModelA(title="instance2") instance_a_2.save() instance_b = ModelB(parent=instance_a_2, value=7) instance_b.save() instance_b = ModelB(parent=instance_a_2, value=11) instance_b.save() instance_c = ModelC(parent=instance_a_2) instance_c.save() instance_c = ModelC(parent=instance_a_2) instance_c.save() instance_c = ModelC(parent=instance_a_2) instance_c.save()
Trying to get two independent annotations from two different tables (sum of values from ModelB and number of ModelC instances per ModelA instance):
for a in ModelA.objects.all() \ .annotate(sumB=Sum('modelb__value')) \ .annotate(countC=Count('modelc', distinct=True)): print "%s: sumB: %s countC: %s" % (a.title, a.sumB, a.countC)
we get these results
instance1: sumB: 45 countC: 5 instance2: sumB: 54 countC: 3
instead of these:
instance1 sumB: 9 countC: 5 instance2 sumB: 18 countC: 3
no surprise until here.
As you can see the 'sumB' values are repeated (multiplied) by a factor, lets account that factor inside single query:
for a in ModelA.objects.all() \ .annotate(countC=Count('modelc', distinct=True)) \ .annotate(countB=Count('modelb')) \ .annotate(countB_distinct=Count('modelb', distinct=True)) \ .annotate(sumB_multiplied=Sum('modelb__value')) \ .annotate(sumB=(F('sumB_multiplied') * F('countB_distinct')) / F('countB')): print "%s sumB: %s countC: %s" % (a.title, a.sumB, a.countC)
and get correct values:
instance1 sumB: 9 countC: 5 instance2 sumB: 18 countC: 3
comment:62 by , 7 years ago
Cc: | added |
---|
I think a note about this "feature" in the docs would be helpful.
comment:63 by , 7 years ago
Cc: | added |
---|
comment:64 by , 7 years ago
FTR: See a workaround using SubQuery
for the issue with annotate described in this ticket, in this SO post: https://stackoverflow.com/questions/48598245/multiple-annotate-with-sum-and-display-data-in-admin-django
comment:65 by , 7 years ago
Cc: | added |
---|
Does anyone here understand roughly what is safe to do with annotate()?
If I call annotate() only once on a queryset, with only one argument, is that always safe, no matter how many relationships I follow in annotate() and filter()?
Is it safe to pass multiple arguments to annotate() (or call annotate() multiple times) if none of the annotations follow any relationships? Even if there are relationships followed inside filter()?
Do the problems with annotate() also apply to aggregate(), or is it safe to pass multiple arguments to aggregate() ? Is it safe to call annotate() with one argument and then call aggregate()?
comment:66 by , 6 years ago
Since this ticket has more exposure I wanted to mention that #28296 ought to fix this by allowing a subquery
argument to be passed to aggregate functions.
If it was to be implemented the original report queryset could be made to return the correct results easily
Branch.objects.annotate( total=Sum('center__client__loan__amount'), repaid=Sum('center__client__loan__payment_schedule__payments__principal', subquery=True), )
Once this lands the ORM should be in a position to detect and warn about or even deprecate multiple cross joined annotations.
Given the current existing logic with the Subquery
expression it shouldn't be too hard for someone with basic expression knowledge about the ORM to implement a patch if anyone is interested in contributing.
comment:67 by , 6 years ago
Replying to powderflask:
A huge thanks to powderflask: for the ones knowledgeable enough with SQL, this is the way to go =)
I also tried the solution described here: https://stackoverflow.com/questions/48598245/multiple-annotate-with-sum-and-display-data-in-admin-django/48607830#48607830, alas without success...
Replying to bendavis78:
In comment16 bendavis78 recommends this query to solve the issue:
SELECT u.email, (SELECT SUM(points) FROM point_earning WHERE user_id=u.id) AS points_earned, (SELECT SUM(points) FROM point_expense WHERE user_id=u.id) AS points_spent FROM "user" uFor others who might arrive here looking for a reasonable way to work around this issue, the following worked well for me. I added a custom Queryset method to the Manager class, and used a RawSQL() expression to create the annotations. This at least encapsulates the SQL code and allows the annotation to be integrated with a normal django queryset. Here's a sample for the example given above:
def annotate_sum_for_user(user_related_modelClass, field_name, annotation_name): raw_query = """ SELECT SUM({field}) FROM {model} AS model WHERE model.user_id = user.id """.format( field = field_name, model = user_related_modelClass._meta.db_table, ) annotation = {annotation_name: RawSQL(raw_query, [])} return self.annotate(**annotation)Usage for above query on presumed User model:
users = models.User.objects\ .annotate_sum_for_user(PointEarning, 'points', 'points_earned')\ .annotate_sum_for_user(PointExpense, 'points', 'points_spent')Hope someone finds this useful, and a word of HUGE thanks to those who worked to get a statement of this issue and a link to this thread into the django documentation -- probably saved me hours.
comment:68 by , 6 years ago
Cc: | added |
---|
comment:70 by , 4 years ago
It's very surprising that aggregations aren't done in subqueries. The problem arises even with basic, one-field aggregations, when there are duplicates in the original query. So something like this, with a single reverse foreign key relation:
User.objects.filter(orders__date=today).distinct().values('first_name').annotate(Count('pk'))
may produce wrong results (will count orders instead of users), even though it looks safe with .distinct()
. From what I've found, the documentation doesn't warn about such cases.
Funnily enough, when I have rewritten one of my not working queries with sum aggregation to use a subquery, it actually became faster (on a table with ~100k rows), but of course YMMV.
follow-up: 74 comment:71 by , 3 years ago
For anyone else running into this, a reasonable workaround seems to be to use subqueries for aggregating annotations. This is a bit verbose/hacky in Django currently, as shown by the multitude of approaches in the stackoverflow link from Antoine's comment. However, I've successfully used the django-sql-utils package for this just now. That sounds a bit bulky, but it just has two utilities, one of which is a SubqueryAggregate
class (with derived SubqueryCount
, SubquerySum
, etc.) that make converting a regular joining aggregate into a subquery aggregate easy and concise, without changing the structure much.
For example taking the example from comment 66 and converting the second annotation to a subquery with django-sql-utils, you'd get:
Branch.objects.annotate( total=Sum('center__client__loan__amount'), repaid=SubquerySum('center__client__loan__payment_schedule__payments__principal'), )
comment:72 by , 2 years ago
Cc: | added |
---|
comment:74 by , 7 months ago
Replying to Matthijs Kooijman:
For anyone else running into this, ... I've successfully used the django-sql-utils package for this...
We've also started using django-sql-utils
. Mirrors a home-spun solution we had in production, so great to replace our hacked together version with this nice piece of code.
One thing to watch for: the subquery logic only works (and only really makes sense) when you have joins! If you are aggregating on a single model, just use the regular Django aggregation functions.
It would help a great deal if you gave us the actual models and sample data. You have highlighted an inconsistency, but you haven't given us the ability to reproduce the failure.