#14030 closed New feature (fixed)
Use F() objects in aggregates(), annotates() and values()
Reported by: | delfick | Owned by: | Josh Smeaton |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | aggregate, annotate |
Cc: | Robin, Johannes Bornhold, cg@…, miloslav.pojman@…, taylor.mitchell@…, michael@…, romainr, janos@…, michal.modzelewski@…, Michael Angeletti, nkryptic@…, flo@…, titanjer@…, Artem Skoretskiy, josh.smeaton@…, jorgecarleitao@…, mmanfre@…, Sergey Dobrov, info+coding@… | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Hello,
Would it be possible to be able to use F() objects in aggregate queries ?
(and others such as annotate() and values())
So I can do something like
myModel.objects.aggregate(Sum(F('field1') - F('field2')))
Attachments (2)
Change History (74)
comment:1 by , 14 years ago
Component: | Uncategorized → ORM aggregation |
---|---|
Owner: | removed |
comment:2 by , 14 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:3 by , 14 years ago
Cc: | added |
---|
comment:4 by , 14 years ago
Cc: | added |
---|
comment:5 by , 14 years ago
Cc: | added |
---|
comment:6 by , 14 years ago
Cc: | added |
---|
comment:7 by , 13 years ago
Severity: | → Normal |
---|---|
Type: | → New feature |
comment:8 by , 13 years ago
Cc: | added |
---|---|
Easy pickings: | unset |
UI/UX: | unset |
comment:10 by , 13 years ago
For the record, the patch in "Support for Conditional Aggregates" (#11305) implements this for .aggregate() and .annotate().
comment:11 by , 13 years ago
Cc: | added |
---|
comment:12 by , 13 years ago
Cc: | added |
---|
by , 13 years ago
Attachment: | 14030.patch added |
---|
comment:13 by , 13 years ago
Has patch: | set |
---|---|
Owner: | set to |
Status: | new → assigned |
This patch rebases out the changes from akaariai's patch to #11305 necessary to use F() objects inside aggregation functions, as well as adding the functionality required for them to be used directly as a kwarg to aggregate and annotate.
Just as a note: if accepted, a very clean patch can be made from the difference between this and akaariai's #11305 patch.
comment:14 by , 13 years ago
Cc: | added |
---|
comment:15 by , 13 years ago
Patch needs improvement: | set |
---|---|
Version: | 1.2 |
I am wondering about the hack in the patch: that is, adding an empty Aggregate over the given F expression. I guess the reason is to make this work:
>>> q = Entry.objects.annotate(cpb_ratio=F('n_comments')*1.0/F('n_pingbacks'))
Now, I have been wondering about something similar. In fact, I have been thinking of creating expressions you can inject into the query by .annotate().
caseexpr = RawSQL("case when %s > 0 then 'has '||%s||' comments' else 'no comments' end", params=(F('n_comments'), F('n_comments')))
And you could then do
>>> q = Entry.objects.annotate(has_comments=caseexpr).order_by('has_comments') (or, in my dreams the .order_by parameter could be .order_by(RawSQL('%s asc nulls last', params=(F('has_comments'),))
In effect, you could get rid of most .extra() uses by the above annotations. I have some reason to believe that the above changes, if implemented correctly, would make the ORM both faster and easier to understand. Of course, without an actual patch it is easy to claim that... :)
Now, my point here is that I really like the ability to use annotations for other expressions than aggregates. But I really, really do not like the idea of adding an empty aggregate that in fact is not an aggregate. I bet that will break under some complex query, because Django thinks you have an aggregate in the query, but in SQL you do not have one. So, +1 on the idea, -1 on the current implementation.
BTW: I am clearing the version field, from the documentation it seems the version field "Finally, it is possible to use the version attribute to indicate in which version the reported bug was identified.". Now, I read that as if there is no bug, there should not be any version either. Learning this as I write, so please correct if I am wrong.
by , 13 years ago
Attachment: | 14030-2.patch added |
---|
Removed "aggregate that in fact is not an aggregate" hack
comment:16 by , 13 years ago
Admittedly, that was a bit of a lazy work around. I think my second patch is an improvement, and I'd love your feedback. My approach this time was instead of squishing the F()
object into an Aggregate
, I loosened the requirements of the query object so that it could directly handle a SQLEvaluator
.
comment:17 by , 13 years ago
The patches posted here are insufficient to effectively solve this problem.
I have come up with a better solution: refactoring of Aggregate
into ExpressionNode
.
I have uploaded my branch; it can be found here:
https://github.com/NateBragg/django/tree/14030
Some particular points of note:
- I tried to preserve as much interface as possible; I didn't know how much was considered to be more public, so generally I tried to add rather than subtract. However, I did remove a couple things - if you see something missing that shouldn't be, let me know.
- Currently, I'm getting the entire test suite passed on sqlite, postgres, mysql, oracle, and postgis. I was unable to test on oracle spatial - any help with that would be appreciated.
- When fields are combined, they are coerced to a common type;
IntegerField
s are coerced toFloatField
s, which are coerced intoDecimalField
s as needed. Any other kinds of combinations must be of the same types. Also, when coerced to aDecimalField
, the precision is limited by the originalDecimalField
. If this is not correct, or other coercions should be added, I'd like to correct that. - When joins are required, they tend to be
LEFT OUTER
; I'd like some feedback on this, as I'm not 100% sure its always the best behavior. - As the aggregates are a little more complicated, on trivial cases there is a minor reduction in performance; using djangobench, I measured somewhere between a 3% and 8% increase in runtime.
- I don't have enough tests - mostly for a lack of creativity. What kind of composed aggregates and expressions would you like to see? I'll add tests for them.
comment:18 by , 12 years ago
Cc: | added |
---|
comment:19 by , 12 years ago
Now that the Django repo has been moved to github, I have made a pull request of this patch:
comment:20 by , 12 years ago
FYI, I've updated the patch in #11305 to apply cleanly to trunk, but I'm not sure how it would conflict with Nate's branch. Just a heads up!
comment:21 by , 12 years ago
I closed the above pull request because I wanted to first clean up the ORM, then add new stuff. The ORM hasn't been "cleaned up" even if some work has been done.
I now feel it would be a good idea to just work this into master. This should clean up the implementation of aggregates a bit. If we want to refactor other parts of the ORM we can do this even after this gets merged.
I don't expect to get this actually into master anytime soon. Too much to do, too little time. But, I do intend to work on this when time permits.
comment:22 by , 12 years ago
Cc: | added |
---|
comment:23 by , 12 years ago
Cc: | added |
---|
comment:24 by , 12 years ago
Component: | ORM aggregation → Database layer (models, ORM) |
---|
comment:25 by , 12 years ago
Keywords: | aggregate annotate added |
---|
comment:26 by , 12 years ago
Cc: | added |
---|
comment:27 by , 12 years ago
Cc: | added |
---|
comment:28 by , 12 years ago
I have mostly rebased the patch to current master, available from https://github.com/akaariai/django/commit/5e37ecc64610c0582a9c99bf4c276351c528262f
The current code doesn't seem good to me. The biggest problem is in the existing F() <-> SQLEvaluator interaction which isn't too clean (but works OK). Add in aggregates and the code is hard to understand. I don't know if there is a possibility to refactor the whole SQLEvaluator <-> F() structure. It just doesn't feel right currently, but on the other hand I don't know what the right structure is. Maybe carefully inspecting why things are done in the way they are done is the right approach.
As for the current patch - the result type coersion in sql/expressions.py feels scary. It does work for the very limited tests in the patch, but I am pretty sure that it is broken for many use cases. Also, the current patch will make defining custom aggregates very hard, this needs improvement. Still, checking if a node contains aggregates isn't clear currently.
The code isn't ready for commit, even if it manages to pass all tests on sqlite. PostgreSQL passes transactions and expressions tests and postgis passes all gis tests.
comment:29 by , 12 years ago
I'm currently looking into the aggregation stuff, because I want to write a proposal for the "Improve annotation and aggregation" GSOC 2013 idea. I'll post a few things that came to my mind when looking at this issue.
I'd make django.db.models.aggregates.Aggregate a subclass of ExpressionNode as well, but I wouldn't move the SQL creation for the aggregate functions in django.db.backends.BaseDatabaseOperations.combine_expression (not like the patch), because this would make adding custom aggregate functions really hard, because all sql functions must be part of a dict in django.db.backends.BaseDatabaseOperations. When django.db.models.aggregates.Aggregate is a subclass of expression node, it's possible to make django.db.models.sql.aggregates.Aggregate a subclass of SQLEvaluator and remove some duplicated code (as shown in your patch).
I'd rather update BaseDatabaseOperations.combine_expressions to check if the current connector is a aggregate function and generate the SQL accordingly. I think the Aggregate class should be the only place that has to know about the SQL code of the specific function, maybe through a as_sql method or something similar. In order to make it easier to write custom aggregate classes, we could add some callbacks to the Aggregate class to generate SQL for different backends (the existing aggregate functions are the same for all backends, but there are some (like string manipulation functions) that are different on some backends), instead of handling the custom SQL generation in the database backend (like for the mathematical operations, see https://github.com/django/django/blob/master/django/db/backends/oracle/base.py#L436). This would allow cross platform custom aggregations without modifying any backend code, but on the other hand this would be different from the way all the database specific SQL generation is handled at the moment, so I'm not sure about that.
To check if the connector is a aggreate I'd pass the whole node object to combine_expressions instead of the connector (string), to use isinstance to check if the node is a aggregate function. This way, it would keep the aggregation handling more extensible.
After fixing this issue I'd like to tackle #11305 (Support for "Conditional Aggregates"). When Aggregate is a subclass of ExpressionNode, it shouldn't be too hard to implement nesting of aggregates and this would allow adding new aggregate functions like an If()
that could be used inside any existing aggregate, akaariai's example in the discussion about #14030 would look like:
# calculate how much younger the author's friends are on average # (only those friends included that are actually younger than the author) vals = Author.objects.aggregate( friends_younger_avg=Avg(If(condition=Q(friends__age__lt=F('age')), true=F('age') - F('friends__age')) )
I think this would provide a flexible API that could be used for a lot of cases, because it could be combined with all other aggregate functions.
I'm quite new to contributing to Django, so I apologize for any misjudgement!
comment:30 by , 11 years ago
I'm wondering if there's been any traction on this ticket? It appeared like akaariai was looking into it, but after rebasing it hasn't been touched. I can't find a GSOC 2013 proposal linked to this ticket, so I'm unsure if the last update resulted in anything either. This is a ticket I'm deeply interested in, and wouldn't mind investing some time into it myself.
Would it be better to pick up the current branch and work on some of the issues that have been raised ( F() <-> SQLEvaluator ) etc? Or would it perhaps be a better idea to tackle this from the current master, and start anew? I guess I'm looking for a little direction before jumping into it.
comment:31 by , 11 years ago
Needs documentation: | set |
---|
I've made an attempt at refactoring Aggregates into ExpressionNode as nateb originally did. The points brought up by @akaariai were mostly addressed:
- The biggest problem is in the existing F() <-> SQLEvaluator interaction
Addressed by removing the SQLEvaluator structure, and allowing the Expression to evaluate itself. Backends can customise the sql generation using the as_{vendor} approach, introduced in the latest lookups refactor.
- the result type coersion in sql/expressions.py feels scary
Addressed by introducing an output_type
param, which is required for mixed-type expressions.
Discussion on this particular attempt is on the mailing list: https://groups.google.com/forum/#!topic/django-developers/8vEAwSwJGMc
The incomplete pull request is: https://github.com/django/django/pull/2184
comment:32 by , 11 years ago
Cc: | added |
---|
I would add one more use case here (as core devs proposed in #21738):
User.objects.annotate(new_is_active=SQL('CASE WHEN id > 100 THEN 1 ELSE 0 END')).update(is_active=models.F('new_is_active'))
comment:33 by , 11 years ago
Needs tests: | set |
---|
comment:34 by , 11 years ago
Needs tests: | unset |
---|
comment:35 by , 11 years ago
Support for non-aggregate annotations has been partially implemented on-top of the aggregates refactor, details (and diff) available here: https://github.com/django/django/pull/2184
comment:36 by , 11 years ago
Cc: | added |
---|
comment:37 by , 11 years ago
Owner: | changed from | to
---|---|
Patch needs improvement: | unset |
comment:38 by , 11 years ago
Quick note - the implementation above covers ticket #20930. If accepted, both tickets can be closed.
comment:39 by , 10 years ago
https://github.com/django/django/pull/2496
Final PR implementing non-aggregate annotations and aggregate arithmetic
comment:40 by , 10 years ago
Needs documentation: | unset |
---|---|
Triage Stage: | Accepted → Ready for checkin |
comment:41 by , 10 years ago
I am finally having some extra time, so expect review soon.
I'm wondering if I should write a DEP about this change. I am not sure how well DEPs work for this kind of change. I guess I have to try writing one to find that out.
comment:42 by , 10 years ago
Triage Stage: | Ready for checkin → Accepted |
---|---|
Version: | → master |
comment:43 by , 10 years ago
I was about to post to django-dev about this when I discovered it was already being worked on. Very pleased about this. Having an API for commonly used SQL expressions, available anywhere in the query, will really open up the door for flexible queries. I think this could even pave the way to deprecating .extra()
:-)
I've posted some use cases that I'm hoping will be covered by this: https://gist.github.com/bendavis78/865611a9e36c0897cb00 . Would any of these be difficult to implement with the addition of this patch?
comment:44 by , 10 years ago
All of those scenarios in your gist will be covered, yes. There is going to be some overlap with the new Lookups/Transforms API and this patch, but I *think* we can also get them to work together, since they both support the Query Expression API. In particular, your datepart example could be:
.annotate(month=F('the_date__month')) # or .annotate(month=Extract('the_date', MONTH))
Of course this requires implementing the Extract() expression, and registering it with the DateField transform. Also, F() expressions don't yet support the transformlookup syntax, but that'd be the next thing I'd implement once (if) this patch is committed.
The other scenarios you bring up are a lot simpler:
class Coalesce(Func): function = 'COALESCE'
comment:45 by , 10 years ago
Cc: | added |
---|
comment:46 by , 10 years ago
Cc: | added |
---|---|
Patch needs improvement: | set |
comment:47 by , 10 years ago
The problem stems from the following block inside compiler.results_iter()
if has_annotation_select: loaded_fields = self.query.get_loaded_field_names().get(self.query.model, set()) or self.query.select annotation_start = len(self.query.extra_select) + len(loaded_fields) annotation_end = annotation_start + len(self.query.annotation_select)
For certain queries, loaded_fields ends up being empty. The set of fields appears to be fixed underneath the if resolve_columns
block, but we still have a situation where annotation_start and end are effectively wrong. Firstly, is the lack of a populated query.select already broken? I don't think it is, because we branch off the truthiness in resolve_columns. I think there is a bug here (expecting loaded_fields to always be filled) that we've been lucky hasn't bitten us elsewhere.
What I'd like to do is completely remove these annotation_start/end variables, and just add the output_type of the annotation to the set of fields. Is there a particular reason that we aren't including the output_type of the aggregate in the field list? It seems that backends should have the opportunity to use resolve_columns to also inspect aggregates. The only issue with that is I believe this is truly a bug (not respecting loaded_fields), and should maybe be patched in a separate ticket. Thoughts? I'll try and put together a patch on this branch to make sure it's fixed, and we can extract it out if necessary.
comment:48 by , 10 years ago
I implemented a fix and a test that should solve the column alignment issue. Still unsure if this should be part of a separate patch or not, but the specific commit is: https://github.com/jarshwah/django/commit/330ef1213dfee5c92128d764722b9c75b43e9a5b
If someone a little more knowledgable than me can check this out and validate that my assumptions are correct, that'd be great. Annotations and aggregates now include their output_field in the list of fields, so there's an opportunity for backends implementing resolve_columns to parse annotations. I don't think this was such a big deal before as aggregates focused on a limited set of model fields - but annotations can deal with all model field types.
comment:49 by , 10 years ago
Patch needs improvement: | unset |
---|
comment:50 by , 10 years ago
Owner: | changed from | to
---|
comment:51 by , 10 years ago
Cc: | added |
---|
comment:52 by , 10 years ago
Cc: | added |
---|
comment:53 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
comment:54 by , 10 years ago
Resolution: | fixed |
---|---|
Status: | closed → new |
This patch appears to cause a regression in aggregations on DecimalFields. The following exception is raised if, for example, a Sum() aggregation on the field results in more digits than the field itself can store (a perfectly normal situation).
Traceback (most recent call last): File "/home/.../django/django/core/handlers/base.py", line 131, in get_response response = wrapped_callback(request, *callback_args, **callback_kwargs) File "/usr/lib/python3.4/contextlib.py", line 30, in inner return func(*args, **kwds) File "/home/.../main/staff.py", line 24, in func return view(request, employee, *args, **kwargs) File "/home/.../main/staff.py", line 62, in payslip ytd_gross_income = ytd_payslips.aggregate(total=Sum('gross_income'))['total'] File "/home/.../django/django/db/models/query.py", line 342, in aggregate return query.get_aggregation(self.db, kwargs.keys()) File "/home/.../django/django/db/models/sql/query.py", line 422, in get_aggregation result = compiler.apply_converters(result, converters) File "/home/.../django/django/db/models/sql/compiler.py", line 698, in apply_converters value = converter(value, self.connection) File "/home/.../django/django/db/models/expressions.py", line 258, in convert_value return backend_utils.typecast_decimal(field.format_number(value)) File "/home/.../django/django/db/models/fields/__init__.py", line 1556, in format_number return utils.format_number(value, self.max_digits, self.decimal_places) File "/home/.../django/django/db/backends/utils.py", line 197, in format_number return "{0:f}".format(value.quantize(decimal.Decimal(".1") ** decimal_places, context=context)) File "/usr/lib/python3.4/decimal.py", line 2580, in quantize 'quantize result has too many digits for current context') File "/usr/lib/python3.4/decimal.py", line 4050, in _raise_error raise error(explanation) decimal.InvalidOperation: quantize result has too many digits for current context
comment:55 by , 10 years ago
Yeah I can see the problem. As a workaround, you could manually set the output_field yourself:
Sum('gross_income', output_field=DecimalField(max_digits=X, decimal_places=Y)
I'll think on how to fix this so that a workaround isn't necessary. Should probably bump this to a release blocker, but I'm not sure whether to open a new ticket for the regression or to track it here just yet.
comment:57 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
I've created a new ticket to track this: #23941
No reason it shouldn't be *possible*, but there's the minor issue of implementing it :-)
There's a related problem hidden in this; annotating extra columns that aren't aggregates. Although the example here is trying to get the aggregate sum of field1-field2, you might also want to annotate field1-field2 as an extra column on a model. This sort of computed column requires the same sort of internal mechanics, except that it would result in an extra alias, rather than being buried in a SUM clause.