Code

Opened 5 years ago

Last modified 6 months ago

#11305 new New feature

Support for "Conditional Aggregates"

Reported by: bendavis78 Owned by:
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords: aggregate, annotate
Cc: cal@…, akaariai, unknownlighter@…, andreterra@…, iljamaas, mmitar@…, nkryptic@…, flo@…, garrypolley, trbs@… Triage Stage: Someday/Maybe
Has patch: yes Needs documentation: yes
Needs tests: yes Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

This is really a feature request, but I'm posting it here in the hopes that it gets a little backing from the community.

I had a situation where I needed to write a query that would sum up demographics information for items ordered. I needed to count the total number of orders for each type of ethnicity, grouped by the item ordered. So the results should look something like this:

item        | caucasian_count | african_american_count | ...
------------+-----------------+------------------------+ 
Test item   | 55              | 72                     | 
Test item 2 | 23              | 48                     | 
...

The models look roughly like this:

class Item(models.Model):
    title = CharField(max_length=255)
    #... (snip) ...

class Ethnicity(models.Model):
    name = CharField(max_length=20)

class User(models.Model):
    #... (snip) ...
    ethnicity = ForeignKey(Ethnicity)

class Order(models.Model):
    #... (snip) ...
    user = ForeignKey(User)
    item = ForeignKey(Item)

The best way for me to accomplish this in a single query was to use conditional aggregates. In SQL, this would be done using

SELECT 
  COUNT(IF(ethnicity='caucasian',TRUE,NULL)) AS caucasian_count, 
  COUNT(IF(ethnicity='african_american',TRUE,NULL)) AS african_american_count 
  ...

Now, I could have just as well written a raw SQL query, but I like django's ORM so much it pains me to not take advantage of it as much as I can.

My idea was for a new type of aggregate class called "CountIf()", which would act just like count, except that you could also pass filter conditions. The resulting django query might look something like this:

query = Order.objects.values('item') #Orders grouped by item
for ethnicity in Ethnicity.objects.all():
    query = query.annotate(CountIf(user__ethnicity=ethnicity))

... which would then generate the SQL clause similar to above. Unfortunately, the task of converting the condition itself into SQL was a bit too daunting, so I took a small workaround that only required minimal SQL:

from django.db.models.sql import aggregates

class CountIf(aggregates.Count):
    """ 
    Hack Count() to get a conditional count working.
    """
    is_ordinal = True
    sql_function = 'COUNT'
    sql_template = '%(function)s(IF(%(condition)s,TRUE,NULL))'

which allowed me to do something like this:

query = Order.objects.values('item') #Orders grouped by item
for ethnicity in Ethnicity.objects.all():
    condition = 'myap_user.ethnicity_id=%s' % ethnicity.id
    query = query.annotate(CountIf(user__ethnicity, condition=condition))

If we could figure out a way to convert django-esque filter conditions into the SQL necessary for the condition in the "sql_template" of the aggregate, that would be epic! It seems like someone with a good knowlege of the inner-workings of django.db.models.sql.query would be able to figure it out pretty quickly, but hopefully someone will let me know if this is just a pipe dream :)

Attachments (7)

poc_11305.2.patch (17.9 KB) - added by akaariai 3 years ago.
poc_11305.patch (28.1 KB) - added by akaariai 3 years ago.
Proof of concept (latest, ignore the .2.)
poc_11305-3.patch (27.8 KB) - added by nate_b 2 years ago.
Update for a clean apply.
conditional_agg.patch (29.0 KB) - added by airstrike 2 years ago.
Update to apply cleanly to trunk (1.4+).
conditional_aggregates.1.6.patch (29.3 KB) - added by airstrike 19 months ago.
Updated to work with 1.5+, except for one test which isn't passing.
conditional_aggregates.1.6.cleanup.patch (28.5 KB) - added by akaariai 19 months ago.
11305-2013-09-07-master.patch (23.9 KB) - added by garrypolley 10 months ago.
patch_update_for_Django_1.6-1.7

Download all attachments as: .zip

Change History (43)

comment:1 Changed 5 years ago by russellm

  • Component changed from Uncategorized to ORM aggregation
  • Needs documentation unset
  • Needs tests unset
  • Owner nobody deleted
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Someday/Maybe

I'm slightly concerned that this could a little too specific. What you're proposing isn't a simple aggregate - it's a very specific combination of aggregates. I'm not convinced that it is a good idea to start embedding extremely complex aggregate clauses as part of the Django core.

However, the Django core should be exposing the tools that make it possible for end-users to write their own complex aggregates. It looks like we're most of the way there - you've been able to write a custom aggregate that writes most of the SQL. All we're missing is a clean interface for converting field lookup syntax into joins.

The SQL query interface contains utility functions that do this, but these functions aren't easily digested by end users (hell, they make my head spin whenever I look at them). There is also an advantage to keeping the SQL query interface as internal code - it means we don't have to maintain backwards compatibility guarantees. However, as a 'someday/maybe' goal, making SQL generation a clean interface is certainly something to strive for.

Anyone looking to implement this might want to look for hints in the implementation of F() clauses. This gives a pretty clean code path for lookups and joins.

comment:2 Changed 5 years ago by Alex

I've implemented something similar to this this way: http://paste.pocoo.org/show/123690/

comment:3 Changed 5 years ago by jlantz

I like Alex's solution, but it runs into errors when using a condition on a foreign key field since the where construction part doesn't add joins to the query. I've hacked around this problem to meet my needs. You can read more about the problem and the hackish solution at http://voteruniverse.com/Members/jlantz/blog/conditional-aggregates-in-django

comment:4 Changed 4 years ago by kenkyhuang

Support for conditional aggregates would make Django more appealing for developers who work on projects that primarily serve up numbers. Computation off of annotated aggregates would be used quite often in those cases, i.e. metrics, segmentation analysis, etc.

Examples -

Financial websites, i.e. Google Finance
Personal finance, i.e. Mint
Search analytics, i.e. Google Analytics

comment:5 Changed 3 years ago by julien

  • Severity set to Normal
  • Type set to New feature

comment:6 Changed 3 years ago by foxwhisper

  • Cc cal@… added
  • Easy pickings unset
  • UI/UX unset

Hi all,

I came across this same problem myself today, and spoke to Russ who pointed me to this thread.

I was planning on writing up a patch for this myself (for consideration into the core), which had support for Q() / F(), and such. However, the patch would have needed to be cross database compatible, and sadly, I just don't have the spare time to write this for all the databases.

Full discussion here:
http://groups.google.com/group/django-users/browse_thread/thread/723ce5a13530992d

I would be more than willing to write the MySQL aggregates patch, if others can contribute with the other supported databases and testing etc. If anyone would like to join forces on this to help get it included in the core, let me know via ticket.

Cal Leeming

PS: My use case for this +1 feature vote, is due to having 40 million rows of data, and needing to avoid using layers of filter().aggregate(), which would in turn cause multiple queries needing to be run.

comment:7 Changed 3 years ago by akaariai

  • Cc akaariai added
  • Has patch set
  • Needs documentation set
  • Needs tests set
  • Patch needs improvement set

I implemented a proof of concept patch for this.

Using the patch, one can do queries like (using the modeltests/aggregation models):

# The average age of author's friends, whose age is lower than the authors age.
vals = Author.objects.aggregate(Avg("friends__age", only=Q(friends__age__lt=F('age'))))
vals["friends__age__avg"] == 30.43

This will create the following SQL:

SELECT AVG(CASE WHEN T3."age" < "aggregation_author"."age" THEN T3."age" ELSE null END) 
       AS "friends__age__avg" 
  FROM "aggregation_author" 
  LEFT OUTER JOIN "aggregation_author_friends" 
       ON ("aggregation_author"."id" = "aggregation_author_friends"."from_author_id") 
  LEFT OUTER JOIN "aggregation_author" T3 
       ON ("aggregation_author_friends"."to_author_id" = T3."id")

Every aggregate now can use the "only" kwarg. The only kwarq must be a Q object. Lookups in only are allowed only if they do not generate additional joins.

There are 3 main issues with this patch:

  • The biggest problem is that aggregates need to return query parameters. These parameters should be escaped by the database backend. This causes all sorts of ugliness. I have checked in various places if as_sql returns a tuple, and if so, added the parameters to the query parameters. This isn't an ideal solution, but my skills aren't good enough to work out a better solution.
  • Does the CASE WHEN construction work on all supported databases? I have tested this on PostgreSQL and SQLite3. I think this will work on any DB but I haven't tested it. MySQL 4.x is the main concern.
  • The usage of add_q in add_aggregate isn't a clean solution.

There are some new tests included. I have ran the full test suite both on PostgreSQL 8.4 and SQLite3 version 3.7.4, and this passes the test suite. Annotations should also work.

As said, this is a POC patch, and I do expect it to fail if given a complex enough query. Also there aren't nearly enough tests, and there is no documentation.

comment:8 Changed 3 years ago by akaariai

Fixed one error in sql/query.py: if add_aggregate's call to add_q would add the q to self.having, it would get lost. Fixed this by disallowing touching of self.having in add_aggregate.

Also changed Exception to FieldError in sql/query.py add_aggregate.

Changed 3 years ago by akaariai

comment:9 Changed 3 years ago by akaariai

Updated patch. Now with aggregate.relabel_aliases() and .extra() fixed.

I also included the ability to do F() calculations in aggregate and annotate:

# calculate how much younger the author's friends are on average 
# (only those friends included that are actually younger than the author)
vals = Author.objects.aggregate(
    friends_younger_avg=Avg(F('age') - F("friends__age"), 
    only=Q(friends__age__lt=F('age')))
)
vals["friends_younger_avg"] == 7.29

which generates the following SQL:

SELECT AVG(CASE 
           WHEN T3."age" < "aggregation_author"."age" 
           THEN "aggregation_author"."age" - "T3"."age" 
           ELSE null 
           END) AS "friends_younger_avg" 
  FROM "aggregation_author" 
  LEFT OUTER JOIN "aggregation_author_friends" 
       ON ("aggregation_author"."id" = "aggregation_author_friends"."from_author_id") 
  LEFT OUTER JOIN "aggregation_author" T3 
       ON ("aggregation_author_friends"."to_author_id" = T3."id")

The last part is based on work of Michael Elsdörfer (miracle2k) in ticket #10972.

I would be grateful if somebody has the time to say if this is at all wanted in Django core. It is too soon to review the patch, but on the other hand I am not willing to give it the polish it needs if this patch has no hope of getting committed.

I also have a version which allows annotating fields without aggregation: qs.field_annotate(age_x2=F('age')*2). It is even more in proof-of-concept state than this patch. The idea is from ticket #14030.

Github branches: https://github.com/akaariai/django/tree/conditional_aggregate and https://github.com/akaariai/django/tree/field_annotation

Changed 3 years ago by akaariai

Proof of concept (latest, ignore the .2.)

comment:10 Changed 3 years ago by foxwhisper

+1 on getting this included in the core!!

comment:11 Changed 3 years ago by akaariai

Just a quick note: I have been thinking about this, and I am now -0 about the approach taken in my patch. I would rather work on an approach that would allow creating custom aggregates more easily. Currently creating custom aggregates is painful, and in many cases downright impossible.

Conditional aggregates are really useful, but they are just a special case of an aggregate that take parameters.

comment:12 Changed 3 years ago by crayz_train

I need the equivalent of sumifs (http://office.microsoft.com/en-us/excel-help/sumifs-function-HA010047504.aspx?CTT=3) and countifs (http://office.microsoft.com/en-us/excel-help/countifs-function-HA010047494.aspx) to replace a report that is currently baked into an excel spreadsheet somewhere.

In a nutshell, we have orders which have a quantity of items, a total amount for the order, tax amount, and are related to a specific customer name. I need to output something like this:
Customer, Total Items (per customer), Total Orders (per customer), Total Amount (per customer), Total Tax (per customer)
UserA,6,3,$50,$5
UserB,34,2,$1800,$100

I would think this sort of thing to be a very common use case. It can be done to the queryset, but sql is good at this sort of thing, so I think it would be better to be able to generate this type of output as a queryset via the ORM. I'm debating whether to fall back to raw() for now.

I'm proselytizing for integration of support for conditional aggregates. I like django much better than Excel, but excel has these tools...

Version 1, edited 3 years ago by crayz_train (previous) (next) (diff)

comment:13 Changed 3 years ago by akaariai

  • Owner set to akaariai

I have turned my mind on the -0. The approach is not the cleanest, but the approach could be included as is (minor refactoring needed, though). Later, when there is better support for custom aggregates and aggregates that take parameters, we could use that framework to implement the conditional aggregates. The user visible API would not change.

So, I will try to clean up the patch a bit. I hope there is enough interest from core developers that we can push this forward. IMHO this is very useful feature, and there seems to be others who think so, too.

This still needs documentation, more tests and then review(s). Helping by testing and reviewing the patch is the way to get this included. Though there is not much point until I have time to update the patch. I hope to do it this weekend.

comment:14 Changed 3 years ago by foxwhisper

akaariai,

Thanks for all your hard work on this - I've added this onto my ever growing todo list as it's a feature that would massively benefit the code we write here on a daily basis.

comment:15 Changed 2 years ago by anonymous

Fully support this being added to the core... conditional aggregate was the first type of aggregate I attempted to do using Django :)

Changed 2 years ago by nate_b

Update for a clean apply.

comment:16 Changed 2 years ago by nate_b

I tried to apply this patch, and it would no longer apply cleanly. The updated poc_11305-3.patch resolves these conflicts.

comment:17 Changed 2 years ago by akaariai

I haven't had time to do anything about this patch. And I will not have time in the near future.

There seems to be at least one change I would like to make. Now, aggregates return sql, params from their .as_sql() but some other columns return just sql. That seems a little unclear, it would be better to return sql, params always from .as_sql(). Of course, the params could be empty.

Otherwise what this feature needs is people reviewing it. Just trying it out for your problem case is enough. Report if it worked in the way you expected, and if the API was sane for your use case. Try to also include a brief description of what you tried. That would help to verify that the feature as written is actually useful. Even if there is no documentation, that should not stop you from doing this. There are some tests in the end of the patch, and those should hopefully make it clear how it works.

Again, no knowledge of the ORM is required. Just try it and see if it does something sane. A very good guide to reviewing patches is available from PostgreSQL wiki. Of course, this is not directly applicable to Django, but still worth a read to see for things you can check.

comment:18 Changed 2 years ago by LighteR

  • Cc unknownlighter@… added

Changed 2 years ago by airstrike

Update to apply cleanly to trunk (1.4+).

comment:19 Changed 2 years ago by airstrike

  • Cc andreterra@… added

comment:20 Changed 2 years ago by iljamaas

  • Cc iljamaas added

comment:21 Changed 2 years ago by akaariai

  • Owner akaariai deleted

I am not planning to work on this ticket in the near future, so I will unclaim this one.

comment:22 Changed 2 years ago by mitar

  • Cc mmitar@… added

Changed 19 months ago by airstrike

Updated to work with 1.5+, except for one test which isn't passing.

comment:23 Changed 19 months ago by airstrike

  • Version changed from 1.0 to 1.5-beta-1

I've updated the patch to work with 1.6, except for a single test which isn't passing. If anyone could take a look at it, I'd appreciate it. I lack the expertise and haven't had the time to make up for it, so any help is welcome.

comment:24 Changed 19 months ago by airstrike

C:\tmp\django>tests\runtests.py aggregation --settings=test_sqlite --failfast
Creating test database for alias 'default'...
Creating test database for alias 'other'...
...E
======================================================================
ERROR: test_annotate_basic (modeltests.aggregation.tests.BaseAggregateTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\tmp\django\tests\modeltests\aggregation\tests.py", line 133, in test_annotate_basic
    b = books.get(pk=1)
  File "c:\virtual\exypnos\src\django\django\db\models\query.py", line 374, in get
    num = len(clone)
  File "c:\virtual\exypnos\src\django\django\db\models\query.py", line 90, in __len__
    self._result_cache = list(self.iterator())
  File "c:\virtual\exypnos\src\django\django\db\models\query.py", line 300, in iterator
    for row in compiler.results_iter():
  File "c:\virtual\exypnos\src\django\django\db\models\sql\compiler.py", line 732, in results_iter
    for rows in self.execute_sql(MULTI):
  File "c:\virtual\exypnos\src\django\django\db\models\sql\compiler.py", line 787, in execute_sql
    sql, params = self.as_sql()
  File "c:\virtual\exypnos\src\django\django\db\models\sql\compiler.py", line 85, in as_sql
    where, w_params = self.query.where.as_sql(qn=qn, connection=self.connection)
  File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 97, in as_sql
    sql, params = child.as_sql(qn=qn, connection=connection)
  File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 97, in as_sql
    sql, params = child.as_sql(qn=qn, connection=connection)
  File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 97, in as_sql
    sql, params = child.as_sql(qn=qn, connection=connection)
  File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 100, in as_sql
    sql, params = self.make_atom(child, qn, connection)
  File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 162, in make_atom
    lvalue, params = lvalue.process(lookup_type, params_or_value, connection)
  File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 362, in process
    connection=connection, prepared=True)
  File "c:\virtual\exypnos\src\django\django\db\models\fields\related.py", line 161, in get_db_prep_lookup
    sql, params = value.as_sql()
  File "c:\virtual\exypnos\src\django\django\db\models\sql\compiler.py", line 85, in as_sql
    where, w_params = self.query.where.as_sql(qn=qn, connection=self.connection)
  File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 100, in as_sql
    sql, params = self.make_atom(child, qn, connection)
  File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 162, in make_atom
    lvalue, params = lvalue.process(lookup_type, params_or_value, connection)
  File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 362, in process
    connection=connection, prepared=True)
  File "c:\virtual\exypnos\src\django\django\db\models\fields\related.py", line 163, in get_db_prep_lookup
    sql, params = value._as_sql(connection=connection)
  File "c:\virtual\exypnos\src\django\django\db\models\query.py", line 964, in _as_sql
    return obj.query.get_compiler(connection=connection).as_nested_sql()
  File "c:\virtual\exypnos\src\django\django\db\models\sql\compiler.py", line 162, in as_nested_sql
    return obj.get_compiler(connection=self.connection).as_sql()
  File "c:\virtual\exypnos\src\django\django\db\models\sql\compiler.py", line 86, in as_sql
    having, h_params = self.query.having.as_sql(qn=qn, connection=self.connection)
  File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 100, in as_sql
    sql, params = self.make_atom(child, qn, connection)
  File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 166, in make_atom
    params = lvalue.field.get_db_prep_lookup(lookup_type, params_or_value, connection)
AttributeError: 'NoneType' object has no attribute 'get_db_prep_lookup'

----------------------------------------------------------------------
Ran 4 tests in 0.311s

FAILED (errors=1)
Destroying test database for alias 'default'...
Destroying test database for alias 'other'...

comment:25 Changed 19 months ago by akaariai

The problem was that source = None ended up in a WhereNode -> hence the error. This was the TODO in the add_aggreate() method.

I added a _very_ basic field type resolving strategy, and now the patch passes all tests. If I recall correctly the #14030 work does have a much better solution to the field type resolving, and it might be wise to see if that one can be worked into master first.

Changed 19 months ago by akaariai

comment:26 Changed 18 months ago by nkryptic

  • Cc nkryptic added

comment:27 Changed 17 months ago by nkryptic

  • Cc nkryptic removed

comment:28 Changed 17 months ago by nkryptic

  • Cc nkryptic@… added

comment:29 Changed 17 months ago by akaariai

  • Component changed from ORM aggregation to Database layer (models, ORM)

comment:30 Changed 16 months ago by fhahn

  • Cc flo@… added
  • Keywords aggregate, annotate added
  • Version changed from 1.5-beta-1 to master

comment:31 Changed 11 months ago by rhunwicks

+1 for this

In the meantime there is https://github.com/henriquebastos/django-aggregate-if which is also on pypi at https://pypi.python.org/pypi/django-aggregate-if/0.3.1. I'm not associated with that code, other than as a satisfied user

It matches the API proposed here, so upgrading should be straightforward if this ticket is fixed in trunk.

comment:32 Changed 10 months ago by garrypolley

  • Cc garrypolley added

Changed 10 months ago by garrypolley

patch_update_for_Django_1.6-1.7

comment:33 Changed 10 months ago by garrypolley

The patch I added (11305-2013-09-07-master.patch) is mostly broken. I'm fixing it now and hope to update it. However, the tests currently fail with wrong values, which likely means it's broken. :(

comment:34 Changed 10 months ago by garrypolley

I think the idea that akaariai has is a good one.

I'll continue to try and hack away at this, however, I don't think the current patch can be easily applied. There have been some bug fixes to aggregates that removed some stuff that the original patch was using. I'll keep the idea of Q, F expressions being what modify Sum, Avg, etc.

I like this ticket, and need it, so I'll keep working on it.

comment:35 Changed 9 months ago by trbs

  • Cc trbs@… added

comment:36 Changed 6 months ago by smeatonj

I think that it's worth mentioning here that work on extending the usefulness of aggregates and expressions (#14030) has led to an API capable of constructing conditional aggregates by composing different functions. The support is experimental at the moment, and the 14030 patch needs thorough review, but the implementation might be of interest.

https://github.com/django/django/pull/2184 and specifically https://github.com/jarshwah/django/commit/13696d39d249b011902e0b932ece2bb36ae45c58

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as new
The owner will be changed from (none) to anonymous. Next status will be 'assigned'
as The resolution will be set. Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.