Context Navigation

← Previous Ticket
Next Ticket →

#14030 closed New feature (fixed)

Use F() objects in aggregates(), annotates() and values()

Reported by:	delfick	Owned by:	Josh Smeaton
Component:	Database layer (models, ORM)	Version:	dev
Severity:	Normal	Keywords:	aggregate, annotate
Cc:	Robin, Johannes Bornhold, cg@…, miloslav.pojman@…, taylor.mitchell@…, michael@…, romainr, janos@…, michal.modzelewski@…, Michael Angeletti, nkryptic@…, flo@…, titanjer@…, Artem Skoretskiy, josh.smeaton@…, jorgecarleitao@…, mmanfre@…, Sergey Dobrov, info+coding@…	Triage Stage:	Accepted
Has patch:	yes	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description

Hello,

Would it be possible to be able to use F() objects in aggregate queries ?
(and others such as annotate() and values())

So I can do something like
myModel.objects.aggregate(Sum(F('field1') - F('field2')))

Attachments (2)

14030.patch (19.8 KB ) - added by Nate Bragg 13 years ago.
14030-2.patch (20.1 KB ) - added by Nate Bragg 13 years ago.: Removed "aggregate that in fact is not an aggregate" hack

Download all attachments as: .zip

Change History (74)

comment:1 by Thejaswi Puthraya, 15 years ago

Component:	Uncategorized → ORM aggregation
Owner:	nobody removed

comment:2 by Russell Keith-Magee, 15 years ago

Triage Stage:	Unreviewed → Accepted

No reason it shouldn't be *possible*, but there's the minor issue of implementing it :-)

There's a related problem hidden in this; annotating extra columns that aren't aggregates. Although the example here is trying to get the aggregate sum of field1-field2, you might also want to annotate field1-field2 as an extra column on a model. This sort of computed column requires the same sort of internal mechanics, except that it would result in an extra alias, rather than being buried in a SUM clause.

comment:3 by Robin, 14 years ago

Cc:	Robin added

comment:4 by Johannes Bornhold, 14 years ago

Cc:	Johannes Bornhold added

comment:5 by cg@…, 14 years ago

Cc:	cg@… added

comment:6 by Miloslav Pojman, 14 years ago

Cc:	miloslav.pojman@… added

comment:7 by Julien Phalip, 14 years ago

Severity:	→ Normal
Type:	→ New feature

comment:8 by Taylor Mitchell, 14 years ago

Cc:	taylor.mitchell@… added
Easy pickings:	unset
UI/UX:	unset

comment:9 by Ramiro Morales, 14 years ago

#10972 and #16775 were closed as duplicates of this ticket.

comment:10 by Anssi Kääriäinen, 14 years ago

For the record, the patch in "Support for Conditional Aggregates" (#11305) implements this for .aggregate() and .annotate().

comment:11 by miracle2k, 14 years ago

Cc:	michael@… added

comment:12 by romainr, 14 years ago

Cc:	romainr added

by Nate Bragg, 13 years ago

Attachment:	14030.patch added

comment:13 by Nate Bragg, 13 years ago

Has patch:	set
Owner:	set to Nate Bragg
Status:	new → assigned

This patch rebases out the changes from akaariai's patch to #11305 necessary to use F() objects inside aggregation functions, as well as adding the functionality required for them to be used directly as a kwarg to aggregate and annotate.

Just as a note: if accepted, a very clean patch can be made from the difference between this and akaariai's #11305 patch.

comment:14 by Janoš, 13 years ago

Cc:	janos@… added

comment:15 by Anssi Kääriäinen, 13 years ago

Patch needs improvement:	set
Version:	1.2

I am wondering about the hack in the patch: that is, adding an empty Aggregate over the given F expression. I guess the reason is to make this work:

>>> q = Entry.objects.annotate(cpb_ratio=F('n_comments')*1.0/F('n_pingbacks'))

Now, I have been wondering about something similar. In fact, I have been thinking of creating expressions you can inject into the query by .annotate().

caseexpr = RawSQL("case when %s > 0 then 'has '||%s||' comments' else 'no comments' end",
                  params=(F('n_comments'), F('n_comments')))

And you could then do

>>> q = Entry.objects.annotate(has_comments=caseexpr).order_by('has_comments') 
(or, in my dreams the .order_by parameter could be .order_by(RawSQL('%s asc nulls last', params=(F('has_comments'),))

In effect, you could get rid of most .extra() uses by the above annotations. I have some reason to believe that the above changes, if implemented correctly, would make the ORM both faster and easier to understand. Of course, without an actual patch it is easy to claim that... :)

Now, my point here is that I really like the ability to use annotations for other expressions than aggregates. But I really, really do not like the idea of adding an empty aggregate that in fact is not an aggregate. I bet that will break under some complex query, because Django thinks you have an aggregate in the query, but in SQL you do not have one. So, +1 on the idea, -1 on the current implementation.

BTW: I am clearing the version field, from the documentation it seems the version field "Finally, it is possible to use the version attribute to indicate in which version the reported bug was identified.". Now, I read that as if there is no bug, there should not be any version either. Learning this as I write, so please correct if I am wrong.

by Nate Bragg, 13 years ago

Attachment:	14030-2.patch added

Removed "aggregate that in fact is not an aggregate" hack

comment:16 by Nate Bragg, 13 years ago

Admittedly, that was a bit of a lazy work around. I think my second patch is an improvement, and I'd love your feedback. My approach this time was instead of squishing the F() object into an Aggregate, I loosened the requirements of the query object so that it could directly handle a SQLEvaluator.

comment:17 by Nate Bragg, 13 years ago

The patches posted here are insufficient to effectively solve this problem.

I have come up with a better solution: refactoring of Aggregate into ExpressionNode.

I have uploaded my branch; it can be found here:

https://github.com/NateBragg/django/tree/14030

Some particular points of note:

I tried to preserve as much interface as possible; I didn't know how much was considered to be more public, so generally I tried to add rather than subtract. However, I did remove a couple things - if you see something missing that shouldn't be, let me know.
Currently, I'm getting the entire test suite passed on sqlite, postgres, mysql, oracle, and postgis. I was unable to test on oracle spatial - any help with that would be appreciated.
When fields are combined, they are coerced to a common type; IntegerFields are coerced to FloatFields, which are coerced into DecimalFields as needed. Any other kinds of combinations must be of the same types. Also, when coerced to a DecimalField, the precision is limited by the original DecimalField. If this is not correct, or other coercions should be added, I'd like to correct that.
When joins are required, they tend to be LEFT OUTER; I'd like some feedback on this, as I'm not 100% sure its always the best behavior.
As the aggregates are a little more complicated, on trivial cases there is a minor reduction in performance; using djangobench, I measured somewhere between a 3% and 8% increase in runtime.
I don't have enough tests - mostly for a lack of creativity. What kind of composed aggregates and expressions would you like to see? I'll add tests for them.

Last edited 13 years ago by Nate Bragg (previous) (diff)

comment:18 by Michał Modzelewski, 13 years ago

Cc:	michal.modzelewski@… added

comment:19 by Nate Bragg, 13 years ago

Now that the Django repo has been moved to github, I have made a pull request of this patch:

https://github.com/django/django/pull/46

comment:20 by Andy Terra, 13 years ago

FYI, I've updated the patch in #11305 to apply cleanly to trunk, but I'm not sure how it would conflict with Nate's branch. Just a heads up!

comment:21 by Anssi Kääriäinen, 13 years ago

I closed the above pull request because I wanted to first clean up the ORM, then add new stuff. The ORM hasn't been "cleaned up" even if some work has been done.

I now feel it would be a good idea to just work this into master. This should clean up the implementation of aggregates a bit. If we want to refactor other parts of the ORM we can do this even after this gets merged.

I don't expect to get this actually into master anytime soon. Too much to do, too little time. But, I do intend to work on this when time permits.

comment:22 by Michael Angeletti, 12 years ago

Cc:	Michael Angeletti added

comment:23 by nkryptic, 12 years ago

Cc:	nkryptic@… added

comment:24 by Anssi Kääriäinen, 12 years ago

Component:	ORM aggregation → Database layer (models, ORM)

comment:25 by fhahn, 12 years ago

Keywords:	aggregate annotate added

comment:26 by fhahn, 12 years ago

Cc:	flo@… added

comment:27 by titanjer@…, 12 years ago

Cc:	titanjer@… added

comment:28 by Anssi Kääriäinen, 12 years ago

I have mostly rebased the patch to current master, available from https://github.com/akaariai/django/commit/5e37ecc64610c0582a9c99bf4c276351c528262f

The current code doesn't seem good to me. The biggest problem is in the existing F() <-> SQLEvaluator interaction which isn't too clean (but works OK). Add in aggregates and the code is hard to understand. I don't know if there is a possibility to refactor the whole SQLEvaluator <-> F() structure. It just doesn't feel right currently, but on the other hand I don't know what the right structure is. Maybe carefully inspecting why things are done in the way they are done is the right approach.

As for the current patch - the result type coersion in sql/expressions.py feels scary. It does work for the very limited tests in the patch, but I am pretty sure that it is broken for many use cases. Also, the current patch will make defining custom aggregates very hard, this needs improvement. Still, checking if a node contains aggregates isn't clear currently.

The code isn't ready for commit, even if it manages to pass all tests on sqlite. PostgreSQL passes transactions and expressions tests and postgis passes all gis tests.

comment:29 by fhahn, 12 years ago

I'm currently looking into the aggregation stuff, because I want to write a proposal for the "Improve annotation and aggregation" GSOC 2013 idea. I'll post a few things that came to my mind when looking at this issue.

I'd make django.db.models.aggregates.Aggregate a subclass of ExpressionNode as well, but I wouldn't move the SQL creation for the aggregate functions in django.db.backends.BaseDatabaseOperations.combine_expression (not like the patch), because this would make adding custom aggregate functions really hard, because all sql functions must be part of a dict in django.db.backends.BaseDatabaseOperations. When django.db.models.aggregates.Aggregate is a subclass of expression node, it's possible to make django.db.models.sql.aggregates.Aggregate a subclass of SQLEvaluator and remove some duplicated code (as shown in your patch).

I'd rather update BaseDatabaseOperations.combine_expressions to check if the current connector is a aggregate function and generate the SQL accordingly. I think the Aggregate class should be the only place that has to know about the SQL code of the specific function, maybe through a as_sql method or something similar. In order to make it easier to write custom aggregate classes, we could add some callbacks to the Aggregate class to generate SQL for different backends (the existing aggregate functions are the same for all backends, but there are some (like string manipulation functions) that are different on some backends), instead of handling the custom SQL generation in the database backend (like for the mathematical operations, see https://github.com/django/django/blob/master/django/db/backends/oracle/base.py#L436). This would allow cross platform custom aggregations without modifying any backend code, but on the other hand this would be different from the way all the database specific SQL generation is handled at the moment, so I'm not sure about that.

To check if the connector is a aggreate I'd pass the whole node object to combine_expressions instead of the connector (string), to use isinstance to check if the node is a aggregate function. This way, it would keep the aggregation handling more extensible.

After fixing this issue I'd like to tackle #11305 (Support for "Conditional Aggregates"). When Aggregate is a subclass of ExpressionNode, it shouldn't be too hard to implement nesting of aggregates and this would allow adding new aggregate functions like an If() that could be used inside any existing aggregate, akaariai's example in the discussion about #14030 would look like:

# calculate how much younger the author's friends are on average 
# (only those friends included that are actually younger than the author)
vals = Author.objects.aggregate(
    friends_younger_avg=Avg(If(condition=Q(friends__age__lt=F('age')), true=F('age') - F('friends__age'))
)

I think this would provide a flexible API that could be used for a lot of cases, because it could be combined with all other aggregate functions.

I'm quite new to contributing to Django, so I apologize for any misjudgement!

comment:30 by jarshwah, 12 years ago

I'm wondering if there's been any traction on this ticket? It appeared like akaariai was looking into it, but after rebasing it hasn't been touched. I can't find a GSOC 2013 proposal linked to this ticket, so I'm unsure if the last update resulted in anything either. This is a ticket I'm deeply interested in, and wouldn't mind investing some time into it myself.

Would it be better to pick up the current branch and work on some of the issues that have been raised ( F() <-> SQLEvaluator ) etc? Or would it perhaps be a better idea to tackle this from the current master, and start anew? I guess I'm looking for a little direction before jumping into it.

comment:31 by jarshwah, 11 years ago

Needs documentation:	set

I've made an attempt at refactoring Aggregates into ExpressionNode as nateb originally did. The points brought up by @akaariai were mostly addressed:

The biggest problem is in the existing F() <-> SQLEvaluator interaction

Addressed by removing the SQLEvaluator structure, and allowing the Expression to evaluate itself. Backends can customise the sql generation using the as_{vendor} approach, introduced in the latest lookups refactor.

the result type coersion in sql/expressions.py feels scary

Addressed by introducing an output_type param, which is required for mixed-type expressions.

Discussion on this particular attempt is on the mailing list: https://groups.google.com/forum/#!topic/django-developers/8vEAwSwJGMc

The incomplete pull request is: https://github.com/django/django/pull/2184

comment:32 by Artem Skoretskiy, 11 years ago

Cc:	Artem Skoretskiy added

I would add one more use case here (as core devs proposed in #21738):

User.objects.annotate(new_is_active=SQL('CASE WHEN id > 100 THEN 1 ELSE 0 END')).update(is_active=models.F('new_is_active'))

comment:33 by anonymous, 11 years ago

Needs tests:	set

comment:34 by anonymous, 11 years ago

Needs tests:	unset

comment:35 by jarshwah, 11 years ago

Support for non-aggregate annotations has been partially implemented on-top of the aggregates refactor, details (and diff) available here: https://github.com/django/django/pull/2184

comment:36 by jarshwah, 11 years ago

Cc:	josh.smeaton@… added

comment:37 by jarshwah, 11 years ago

Owner:	changed from Nate Bragg to jarshwah
Patch needs improvement:	unset

comment:38 by jarshwah, 11 years ago

Quick note - the implementation above covers ticket #20930. If accepted, both tickets can be closed.

comment:39 by jarshwah, 11 years ago

https://github.com/django/django/pull/2496

Final PR implementing non-aggregate annotations and aggregate arithmetic

comment:40 by jarshwah, 11 years ago

Needs documentation:	unset
Triage Stage:	Accepted → Ready for checkin

comment:41 by Anssi Kääriäinen, 11 years ago

I am finally having some extra time, so expect review soon.

I'm wondering if I should write a DEP about this change. I am not sure how well DEPs work for this kind of change. I guess I have to try writing one to find that out.

comment:42 by Tim Graham, 11 years ago

Triage Stage:	Ready for checkin → Accepted
Version:	→ master

comment:43 by Ben Davis, 11 years ago

I was about to post to django-dev about this when I discovered it was already being worked on. Very pleased about this. Having an API for commonly used SQL expressions, available anywhere in the query, will really open up the door for flexible queries. I think this could even pave the way to deprecating .extra() :-)

I've posted some use cases that I'm hoping will be covered by this: https://gist.github.com/bendavis78/865611a9e36c0897cb00 . Would any of these be difficult to implement with the addition of this patch?

comment:44 by jarshwah, 11 years ago

All of those scenarios in your gist will be covered, yes. There is going to be some overlap with the new Lookups/Transforms API and this patch, but I *think* we can also get them to work together, since they both support the Query Expression API. In particular, your datepart example could be:

.annotate(month=F('the_date__month')) 
# or 
.annotate(month=Extract('the_date', MONTH))

Of course this requires implementing the Extract() expression, and registering it with the DateField transform. Also, F() expressions don't yet support the transformlookup syntax, but that'd be the next thing I'd implement once (if) this patch is committed.

The other scenarios you bring up are a lot simpler:

class Coalesce(Func):
    function = 'COALESCE'

comment:45 by jorgecarleitao, 11 years ago

Cc:	jorgecarleitao@… added

comment:46 by Michael Manfre, 11 years ago

Cc:	mmanfre@… added
Patch needs improvement:	set

Patch introduces a field alignment issue for resolve_columns. This is similar to #21126 and #21203.

comment:47 by jarshwah, 11 years ago

The problem stems from the following block inside compiler.results_iter()

                if has_annotation_select:
                    loaded_fields = self.query.get_loaded_field_names().get(self.query.model, set()) or self.query.select
                    annotation_start = len(self.query.extra_select) + len(loaded_fields)
                    annotation_end = annotation_start + len(self.query.annotation_select)

For certain queries, loaded_fields ends up being empty. The set of fields appears to be fixed underneath the if resolve_columns block, but we still have a situation where annotation_start and end are effectively wrong. Firstly, is the lack of a populated query.select already broken? I don't think it is, because we branch off the truthiness in resolve_columns. I think there is a bug here (expecting loaded_fields to always be filled) that we've been lucky hasn't bitten us elsewhere.

What I'd like to do is completely remove these annotation_start/end variables, and just add the output_type of the annotation to the set of fields. Is there a particular reason that we aren't including the output_type of the aggregate in the field list? It seems that backends should have the opportunity to use resolve_columns to also inspect aggregates. The only issue with that is I believe this is truly a bug (not respecting loaded_fields), and should maybe be patched in a separate ticket. Thoughts? I'll try and put together a patch on this branch to make sure it's fixed, and we can extract it out if necessary.

comment:48 by jarshwah, 11 years ago

I implemented a fix and a test that should solve the column alignment issue. Still unsure if this should be part of a separate patch or not, but the specific commit is: https://github.com/jarshwah/django/commit/330ef1213dfee5c92128d764722b9c75b43e9a5b

If someone a little more knowledgable than me can check this out and validate that my assumptions are correct, that'd be great. Annotations and aggregates now include their output_field in the list of fields, so there's an opportunity for backends implementing resolve_columns to parse annotations. I don't think this was such a big deal before as aggregates focused on a limited set of model fields - but annotations can deal with all model field types.

comment:49 by jarshwah, 11 years ago

Patch needs improvement:	unset

comment:50 by Josh Smeaton, 11 years ago

Owner:	changed from jarshwah to Josh Smeaton

comment:51 by Sergey Dobrov, 11 years ago

Cc:	Sergey Dobrov added

comment:52 by Markus Holtermann, 11 years ago

Cc:	info+coding@… added

comment:53 by Marc Tamlyn <marc.tamlyn@…>, 11 years ago

Resolution:	→ fixed
Status:	assigned → closed

In f59fd15c4928caf3dfcbd50f6ab47be409a43b01:

Fixed #14030 -- Allowed annotations to accept all expressions

comment:54 by Jasper Bryant-Greene, 11 years ago

Resolution:	fixed
Status:	closed → new

This patch appears to cause a regression in aggregations on DecimalFields. The following exception is raised if, for example, a Sum() aggregation on the field results in more digits than the field itself can store (a perfectly normal situation).

Traceback (most recent call last):
  File "/home/.../django/django/core/handlers/base.py", line 131, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python3.4/contextlib.py", line 30, in inner
    return func(*args, **kwds)
  File "/home/.../main/staff.py", line 24, in func
    return view(request, employee, *args, **kwargs)
  File "/home/.../main/staff.py", line 62, in payslip
    ytd_gross_income = ytd_payslips.aggregate(total=Sum('gross_income'))['total']
  File "/home/.../django/django/db/models/query.py", line 342, in aggregate
    return query.get_aggregation(self.db, kwargs.keys())
  File "/home/.../django/django/db/models/sql/query.py", line 422, in get_aggregation
    result = compiler.apply_converters(result, converters)
  File "/home/.../django/django/db/models/sql/compiler.py", line 698, in apply_converters
    value = converter(value, self.connection)
  File "/home/.../django/django/db/models/expressions.py", line 258, in convert_value
    return backend_utils.typecast_decimal(field.format_number(value))
  File "/home/.../django/django/db/models/fields/__init__.py", line 1556, in format_number
    return utils.format_number(value, self.max_digits, self.decimal_places)
  File "/home/.../django/django/db/backends/utils.py", line 197, in format_number
    return "{0:f}".format(value.quantize(decimal.Decimal(".1") ** decimal_places, context=context))
  File "/usr/lib/python3.4/decimal.py", line 2580, in quantize
    'quantize result has too many digits for current context')
  File "/usr/lib/python3.4/decimal.py", line 4050, in _raise_error
    raise error(explanation)
decimal.InvalidOperation: quantize result has too many digits for current context

comment:55 by Josh Smeaton, 11 years ago

Yeah I can see the problem. As a workaround, you could manually set the output_field yourself:

Sum('gross_income', output_field=DecimalField(max_digits=X, decimal_places=Y)

I'll think on how to fix this so that a workaround isn't necessary. Should probably bump this to a release blocker, but I'm not sure whether to open a new ticket for the regression or to track it here just yet.

Last edited 11 years ago by Josh Smeaton (previous) (diff)