Opened 15 years ago
Closed 10 years ago
#11305 closed New feature (fixed)
Support for "Conditional Aggregates"
Reported by: | Ben Davis | Owned by: | |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | aggregate, annotate |
Cc: | cal@…, Anssi Kääriäinen, unknownlighter@…, andreterra@…, Ilja Maas, mmitar@…, nkryptic@…, flo@…, Garry Polley, trbs@… | Triage Stage: | Someday/Maybe |
Has patch: | yes | Needs documentation: | yes |
Needs tests: | yes | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
This is really a feature request, but I'm posting it here in the hopes that it gets a little backing from the community.
I had a situation where I needed to write a query that would sum up demographics information for items ordered. I needed to count the total number of orders for each type of ethnicity, grouped by the item ordered. So the results should look something like this:
item | caucasian_count | african_american_count | ... ------------+-----------------+------------------------+ Test item | 55 | 72 | Test item 2 | 23 | 48 | ...
The models look roughly like this:
class Item(models.Model): title = CharField(max_length=255) #... (snip) ... class Ethnicity(models.Model): name = CharField(max_length=20) class User(models.Model): #... (snip) ... ethnicity = ForeignKey(Ethnicity) class Order(models.Model): #... (snip) ... user = ForeignKey(User) item = ForeignKey(Item)
The best way for me to accomplish this in a single query was to use conditional aggregates. In SQL, this would be done using
SELECT COUNT(IF(ethnicity='caucasian',TRUE,NULL)) AS caucasian_count, COUNT(IF(ethnicity='african_american',TRUE,NULL)) AS african_american_count ...
Now, I could have just as well written a raw SQL query, but I like django's ORM so much it pains me to not take advantage of it as much as I can.
My idea was for a new type of aggregate class called "CountIf()", which would act just like count, except that you could also pass filter conditions. The resulting django query might look something like this:
query = Order.objects.values('item') #Orders grouped by item for ethnicity in Ethnicity.objects.all(): query = query.annotate(CountIf(user__ethnicity=ethnicity))
... which would then generate the SQL clause similar to above. Unfortunately, the task of converting the condition itself into SQL was a bit too daunting, so I took a small workaround that only required minimal SQL:
from django.db.models.sql import aggregates class CountIf(aggregates.Count): """ Hack Count() to get a conditional count working. """ is_ordinal = True sql_function = 'COUNT' sql_template = '%(function)s(IF(%(condition)s,TRUE,NULL))'
which allowed me to do something like this:
query = Order.objects.values('item') #Orders grouped by item for ethnicity in Ethnicity.objects.all(): condition = 'myap_user.ethnicity_id=%s' % ethnicity.id query = query.annotate(CountIf(user__ethnicity, condition=condition))
If we could figure out a way to convert django-esque filter conditions into the SQL necessary for the condition in the "sql_template" of the aggregate, that would be epic! It seems like someone with a good knowlege of the inner-workings of django.db.models.sql.query would be able to figure it out pretty quickly, but hopefully someone will let me know if this is just a pipe dream :)
Attachments (7)
Change History (47)
comment:1 by , 15 years ago
Component: | Uncategorized → ORM aggregation |
---|---|
Owner: | removed |
Triage Stage: | Unreviewed → Someday/Maybe |
comment:2 by , 15 years ago
I've implemented something similar to this this way: http://paste.pocoo.org/show/123690/
comment:3 by , 15 years ago
I like Alex's solution, but it runs into errors when using a condition on a foreign key field since the where construction part doesn't add joins to the query. I've hacked around this problem to meet my needs. You can read more about the problem and the hackish solution at http://voteruniverse.com/Members/jlantz/blog/conditional-aggregates-in-django
comment:4 by , 15 years ago
Support for conditional aggregates would make Django more appealing for developers who work on projects that primarily serve up numbers. Computation off of annotated aggregates would be used quite often in those cases, i.e. metrics, segmentation analysis, etc.
Examples -
Financial websites, i.e. Google Finance
Personal finance, i.e. Mint
Search analytics, i.e. Google Analytics
comment:5 by , 14 years ago
Severity: | → Normal |
---|---|
Type: | → New feature |
comment:6 by , 13 years ago
Cc: | added |
---|---|
Easy pickings: | unset |
UI/UX: | unset |
Hi all,
I came across this same problem myself today, and spoke to Russ who pointed me to this thread.
I was planning on writing up a patch for this myself (for consideration into the core), which had support for Q() / F(), and such. However, the patch would have needed to be cross database compatible, and sadly, I just don't have the spare time to write this for all the databases.
Full discussion here:
http://groups.google.com/group/django-users/browse_thread/thread/723ce5a13530992d
I would be more than willing to write the MySQL aggregates patch, if others can contribute with the other supported databases and testing etc. If anyone would like to join forces on this to help get it included in the core, let me know via ticket.
Cal Leeming
PS: My use case for this +1 feature vote, is due to having 40 million rows of data, and needing to avoid using layers of filter().aggregate(), which would in turn cause multiple queries needing to be run.
comment:7 by , 13 years ago
Cc: | added |
---|---|
Has patch: | set |
Needs documentation: | set |
Needs tests: | set |
Patch needs improvement: | set |
I implemented a proof of concept patch for this.
Using the patch, one can do queries like (using the modeltests/aggregation models):
# The average age of author's friends, whose age is lower than the authors age. vals = Author.objects.aggregate(Avg("friends__age", only=Q(friends__age__lt=F('age')))) vals["friends__age__avg"] == 30.43
This will create the following SQL:
SELECT AVG(CASE WHEN T3."age" < "aggregation_author"."age" THEN T3."age" ELSE null END) AS "friends__age__avg" FROM "aggregation_author" LEFT OUTER JOIN "aggregation_author_friends" ON ("aggregation_author"."id" = "aggregation_author_friends"."from_author_id") LEFT OUTER JOIN "aggregation_author" T3 ON ("aggregation_author_friends"."to_author_id" = T3."id")
Every aggregate now can use the "only" kwarg. The only kwarq must be a Q object. Lookups in only are allowed only if they do not generate additional joins.
There are 3 main issues with this patch:
- The biggest problem is that aggregates need to return query parameters. These parameters should be escaped by the database backend. This causes all sorts of ugliness. I have checked in various places if as_sql returns a tuple, and if so, added the parameters to the query parameters. This isn't an ideal solution, but my skills aren't good enough to work out a better solution.
- Does the CASE WHEN construction work on all supported databases? I have tested this on PostgreSQL and SQLite3. I think this will work on any DB but I haven't tested it. MySQL 4.x is the main concern.
- The usage of add_q in add_aggregate isn't a clean solution.
There are some new tests included. I have ran the full test suite both on PostgreSQL 8.4 and SQLite3 version 3.7.4, and this passes the test suite. Annotations should also work.
As said, this is a POC patch, and I do expect it to fail if given a complex enough query. Also there aren't nearly enough tests, and there is no documentation.
comment:8 by , 13 years ago
Fixed one error in sql/query.py: if add_aggregate's call to add_q would add the q to self.having, it would get lost. Fixed this by disallowing touching of self.having in add_aggregate.
Also changed Exception to FieldError in sql/query.py add_aggregate.
by , 13 years ago
Attachment: | poc_11305.2.patch added |
---|
comment:9 by , 13 years ago
Updated patch. Now with aggregate.relabel_aliases() and .extra() fixed.
I also included the ability to do F() calculations in aggregate and annotate:
# calculate how much younger the author's friends are on average # (only those friends included that are actually younger than the author) vals = Author.objects.aggregate( friends_younger_avg=Avg(F('age') - F("friends__age"), only=Q(friends__age__lt=F('age'))) ) vals["friends_younger_avg"] == 7.29
which generates the following SQL:
SELECT AVG(CASE WHEN T3."age" < "aggregation_author"."age" THEN "aggregation_author"."age" - "T3"."age" ELSE null END) AS "friends_younger_avg" FROM "aggregation_author" LEFT OUTER JOIN "aggregation_author_friends" ON ("aggregation_author"."id" = "aggregation_author_friends"."from_author_id") LEFT OUTER JOIN "aggregation_author" T3 ON ("aggregation_author_friends"."to_author_id" = T3."id")
The last part is based on work of Michael Elsdörfer (miracle2k) in ticket #10972.
I would be grateful if somebody has the time to say if this is at all wanted in Django core. It is too soon to review the patch, but on the other hand I am not willing to give it the polish it needs if this patch has no hope of getting committed.
I also have a version which allows annotating fields without aggregation: qs.field_annotate(age_x2=F('age')*2). It is even more in proof-of-concept state than this patch. The idea is from ticket #14030.
Github branches: https://github.com/akaariai/django/tree/conditional_aggregate and https://github.com/akaariai/django/tree/field_annotation
comment:11 by , 13 years ago
Just a quick note: I have been thinking about this, and I am now -0 about the approach taken in my patch. I would rather work on an approach that would allow creating custom aggregates more easily. Currently creating custom aggregates is painful, and in many cases downright impossible.
Conditional aggregates are really useful, but they are just a special case of an aggregate that take parameters.
comment:12 by , 13 years ago
I need the equivalent of sumifs (http://office.microsoft.com/en-us/excel-help/sumifs-function-HA010047504.aspx?CTT=3) and countifs (http://office.microsoft.com/en-us/excel-help/countifs-function-HA010047494.aspx) to replace a report that is currently baked into an excel spreadsheet somewhere.
In a nutshell, we have orders which have a quantity of items, a total amount for the order, tax amount, and are related to a specific customer name. I need to output something like this:
Customer, Total Items (per customer), Total Orders (per customer), Total Amount (per customer), Total Tax (per customer)
UserA,6,3,$50,$5
UserB,34,2,$1800,$100
I would think this sort of thing to be a very common use case. It can be done to the queryset, but sql is good at this sort of thing, so I think it would be better to be able to generate this type of output as a queryset. I'm proselytizing for the integrating the support for conditional aggregates. I like django much better than Excel, but excel has these tools...
comment:13 by , 13 years ago
Owner: | set to |
---|
I have turned my mind on the -0. The approach is not the cleanest, but the approach could be included as is (minor refactoring needed, though). Later, when there is better support for custom aggregates and aggregates that take parameters, we could use that framework to implement the conditional aggregates. The user visible API would not change.
So, I will try to clean up the patch a bit. I hope there is enough interest from core developers that we can push this forward. IMHO this is very useful feature, and there seems to be others who think so, too.
This still needs documentation, more tests and then review(s). Helping by testing and reviewing the patch is the way to get this included. Though there is not much point until I have time to update the patch. I hope to do it this weekend.
comment:14 by , 13 years ago
akaariai,
Thanks for all your hard work on this - I've added this onto my ever growing todo list as it's a feature that would massively benefit the code we write here on a daily basis.
comment:15 by , 13 years ago
Fully support this being added to the core... conditional aggregate was the first type of aggregate I attempted to do using Django :)
comment:16 by , 13 years ago
I tried to apply this patch, and it would no longer apply cleanly. The updated poc_11305-3.patch resolves these conflicts.
comment:17 by , 13 years ago
I haven't had time to do anything about this patch. And I will not have time in the near future.
There seems to be at least one change I would like to make. Now, aggregates return sql, params from their .as_sql() but some other columns return just sql. That seems a little unclear, it would be better to return sql, params always from .as_sql(). Of course, the params could be empty.
Otherwise what this feature needs is people reviewing it. Just trying it out for your problem case is enough. Report if it worked in the way you expected, and if the API was sane for your use case. Try to also include a brief description of what you tried. That would help to verify that the feature as written is actually useful. Even if there is no documentation, that should not stop you from doing this. There are some tests in the end of the patch, and those should hopefully make it clear how it works.
Again, no knowledge of the ORM is required. Just try it and see if it does something sane. A very good guide to reviewing patches is available from PostgreSQL wiki. Of course, this is not directly applicable to Django, but still worth a read to see for things you can check.
comment:18 by , 13 years ago
Cc: | added |
---|
comment:19 by , 12 years ago
Cc: | added |
---|
comment:20 by , 12 years ago
Cc: | added |
---|
comment:21 by , 12 years ago
Owner: | removed |
---|
I am not planning to work on this ticket in the near future, so I will unclaim this one.
comment:22 by , 12 years ago
Cc: | added |
---|
by , 12 years ago
Attachment: | conditional_aggregates.1.6.patch added |
---|
Updated to work with 1.5+, except for one test which isn't passing.
comment:23 by , 12 years ago
Version: | 1.0 → 1.5-beta-1 |
---|
I've updated the patch to work with 1.6, except for a single test which isn't passing. If anyone could take a look at it, I'd appreciate it. I lack the expertise and haven't had the time to make up for it, so any help is welcome.
comment:24 by , 12 years ago
C:\tmp\django>tests\runtests.py aggregation --settings=test_sqlite --failfast Creating test database for alias 'default'... Creating test database for alias 'other'... ...E ====================================================================== ERROR: test_annotate_basic (modeltests.aggregation.tests.BaseAggregateTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\tmp\django\tests\modeltests\aggregation\tests.py", line 133, in test_annotate_basic b = books.get(pk=1) File "c:\virtual\exypnos\src\django\django\db\models\query.py", line 374, in get num = len(clone) File "c:\virtual\exypnos\src\django\django\db\models\query.py", line 90, in __len__ self._result_cache = list(self.iterator()) File "c:\virtual\exypnos\src\django\django\db\models\query.py", line 300, in iterator for row in compiler.results_iter(): File "c:\virtual\exypnos\src\django\django\db\models\sql\compiler.py", line 732, in results_iter for rows in self.execute_sql(MULTI): File "c:\virtual\exypnos\src\django\django\db\models\sql\compiler.py", line 787, in execute_sql sql, params = self.as_sql() File "c:\virtual\exypnos\src\django\django\db\models\sql\compiler.py", line 85, in as_sql where, w_params = self.query.where.as_sql(qn=qn, connection=self.connection) File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 97, in as_sql sql, params = child.as_sql(qn=qn, connection=connection) File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 97, in as_sql sql, params = child.as_sql(qn=qn, connection=connection) File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 97, in as_sql sql, params = child.as_sql(qn=qn, connection=connection) File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 100, in as_sql sql, params = self.make_atom(child, qn, connection) File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 162, in make_atom lvalue, params = lvalue.process(lookup_type, params_or_value, connection) File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 362, in process connection=connection, prepared=True) File "c:\virtual\exypnos\src\django\django\db\models\fields\related.py", line 161, in get_db_prep_lookup sql, params = value.as_sql() File "c:\virtual\exypnos\src\django\django\db\models\sql\compiler.py", line 85, in as_sql where, w_params = self.query.where.as_sql(qn=qn, connection=self.connection) File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 100, in as_sql sql, params = self.make_atom(child, qn, connection) File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 162, in make_atom lvalue, params = lvalue.process(lookup_type, params_or_value, connection) File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 362, in process connection=connection, prepared=True) File "c:\virtual\exypnos\src\django\django\db\models\fields\related.py", line 163, in get_db_prep_lookup sql, params = value._as_sql(connection=connection) File "c:\virtual\exypnos\src\django\django\db\models\query.py", line 964, in _as_sql return obj.query.get_compiler(connection=connection).as_nested_sql() File "c:\virtual\exypnos\src\django\django\db\models\sql\compiler.py", line 162, in as_nested_sql return obj.get_compiler(connection=self.connection).as_sql() File "c:\virtual\exypnos\src\django\django\db\models\sql\compiler.py", line 86, in as_sql having, h_params = self.query.having.as_sql(qn=qn, connection=self.connection) File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 100, in as_sql sql, params = self.make_atom(child, qn, connection) File "c:\virtual\exypnos\src\django\django\db\models\sql\where.py", line 166, in make_atom params = lvalue.field.get_db_prep_lookup(lookup_type, params_or_value, connection) AttributeError: 'NoneType' object has no attribute 'get_db_prep_lookup' ---------------------------------------------------------------------- Ran 4 tests in 0.311s FAILED (errors=1) Destroying test database for alias 'default'... Destroying test database for alias 'other'...
comment:25 by , 12 years ago
The problem was that source = None ended up in a WhereNode -> hence the error. This was the TODO in the add_aggreate() method.
I added a _very_ basic field type resolving strategy, and now the patch passes all tests. If I recall correctly the #14030 work does have a much better solution to the field type resolving, and it might be wise to see if that one can be worked into master first.
by , 12 years ago
Attachment: | conditional_aggregates.1.6.cleanup.patch added |
---|
comment:26 by , 12 years ago
Cc: | added |
---|
comment:27 by , 12 years ago
Cc: | removed |
---|
comment:28 by , 12 years ago
Cc: | added |
---|
comment:29 by , 12 years ago
Component: | ORM aggregation → Database layer (models, ORM) |
---|
comment:30 by , 12 years ago
Cc: | added |
---|---|
Keywords: | aggregate annotate added |
Version: | 1.5-beta-1 → master |
comment:31 by , 11 years ago
+1 for this
In the meantime there is https://github.com/henriquebastos/django-aggregate-if which is also on pypi at https://pypi.python.org/pypi/django-aggregate-if/0.3.1. I'm not associated with that code, other than as a satisfied user
It matches the API proposed here, so upgrading should be straightforward if this ticket is fixed in trunk.
comment:32 by , 11 years ago
Cc: | added |
---|
comment:33 by , 11 years ago
The patch I added (11305-2013-09-07-master.patch) is mostly broken. I'm fixing it now and hope to update it. However, the tests currently fail with wrong values, which likely means it's broken. :(
comment:34 by , 11 years ago
I think the idea that akaariai has is a good one.
I'll continue to try and hack away at this, however, I don't think the current patch can be easily applied. There have been some bug fixes to aggregates that removed some stuff that the original patch was using. I'll keep the idea of Q, F expressions being what modify Sum, Avg, etc.
I like this ticket, and need it, so I'll keep working on it.
comment:35 by , 11 years ago
Cc: | added |
---|
comment:36 by , 11 years ago
I think that it's worth mentioning here that work on extending the usefulness of aggregates and expressions (#14030) has led to an API capable of constructing conditional aggregates by composing different functions. The support is experimental at the moment, and the 14030 patch needs thorough review, but the implementation might be of interest.
https://github.com/django/django/pull/2184 and specifically https://github.com/jarshwah/django/commit/13696d39d249b011902e0b932ece2bb36ae45c58
comment:37 by , 10 years ago
It should be noted that the use-case in the ticket description might be better served by allowing better grouping options. That is, the ideal SQL is not
SELECT item, COUNT(IF(ethnicity='caucasian',TRUE,NULL)) AS caucasian_count, COUNT(IF(ethnicity='african_american',TRUE,NULL)) AS african_american_count ...
but rather
SELECT item, ethnicity, COUNT(*) FROM orders JOIN items ON ... JOIN users ON ... GROUP BY item, ethnicity
Such grouping is currently not supported by the Django ORM, and I am not aware of a proposal to add it.
comment:38 by , 10 years ago
Forgive me if I'm missing something, but isn't the suggested query you noted the following:
orders = Order.objects.values('item__name', 'person__ethnicity__name').annotate(Count('id')) >>> for order in orders: ... print order ... {'item__name': u'B', 'person__ethnicity__name': u'caucasian', 'id__count': 212} {'item__name': u'C', 'person__ethnicity__name': u'caucasian', 'id__count': 214} {'item__name': u'A', 'person__ethnicity__name': u'caucasian', 'id__count': 227} {'item__name': u'C', 'person__ethnicity__name': u'black', 'id__count': 124} {'item__name': u'A', 'person__ethnicity__name': u'black', 'id__count': 118} {'item__name': u'B', 'person__ethnicity__name': u'black', 'id__count': 104} >>> print( Order.objects.values('item__name', 'person__ethnicity__name').annotate(Count('id')).query ) SELECT "ticket11305_item"."name", "ticket11305_ethnicity"."name", COUNT("ticket11305_order"."id") AS "id__count" FROM "ticket11305_order" INNER JOIN "ticket11305_item" ON ( "ticket11305_order"."item_id" = "ticket11305_item"."id" ) INNER JOIN "ticket11305_person" ON ( "ticket11305_order"."person_id" = "ticket11305_person"."id" ) INNER JOIN "ticket11305_ethnicity" ON ( "ticket11305_person"."ethnicity_id" = "ticket11305_ethnicity"."id" ) GROUP BY "ticket11305_item"."name", "ticket11305_ethnicity"."name"
comment:39 by , 10 years ago
#14030 has been merged, making it possible to move on here, as far as I understand.
comment:40 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
#24031 is now implemented as well as #14030 so it's now possible to use conditional aggregates:
>>> Client.objects.aggregate( ... regular=Sum( ... Case(When(account_type=Client.REGULAR, then=Value(1)), ... output_field=IntegerField()) ... ), ... gold=Sum( ... Case(When(account_type=Client.GOLD, then=Value(1)), ... output_field=IntegerField()) ... ), ... platinum=Sum( ... Case(When(account_type=Client.PLATINUM, then=Value(1)), ... output_field=IntegerField()) ... ) ... )
I'm slightly concerned that this could a little too specific. What you're proposing isn't a simple aggregate - it's a very specific combination of aggregates. I'm not convinced that it is a good idea to start embedding extremely complex aggregate clauses as part of the Django core.
However, the Django core should be exposing the tools that make it possible for end-users to write their own complex aggregates. It looks like we're most of the way there - you've been able to write a custom aggregate that writes most of the SQL. All we're missing is a clean interface for converting field lookup syntax into joins.
The SQL query interface contains utility functions that do this, but these functions aren't easily digested by end users (hell, they make my head spin whenever I look at them). There is also an advantage to keeping the SQL query interface as internal code - it means we don't have to maintain backwards compatibility guarantees. However, as a 'someday/maybe' goal, making SQL generation a clean interface is certainly something to strive for.
Anyone looking to implement this might want to look for hints in the implementation of F() clauses. This gives a pretty clean code path for lookups and joins.