Opened 5 years ago

Closed 9 months ago

Last modified 7 months ago

#14030 closed New feature (fixed)

Use F() objects in aggregates(), annotates() and values()

Reported by: delfick Owned by: jarshwah
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords: aggregate, annotate
Cc: robin, joh, cg@…, miloslav.pojman@…, taylor.mitchell@…, michael@…, romainr, janos@…, michal.modzelewski@…, yoyoma, nkryptic@…, flo@…, titanjer@…, tonnzor, josh.smeaton@…, jorgecarleitao@…, mmanfre@…, jbinary, info+coding@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Hello,

Would it be possible to be able to use F() objects in aggregate queries ?
(and others such as annotate() and values())

So I can do something like
myModel.objects.aggregate(Sum(F('field1') - F('field2')))

Attachments (2)

14030.patch (19.8 KB) - added by nate_b 4 years ago.
14030-2.patch (20.1 KB) - added by nate_b 4 years ago.
Removed "aggregate that in fact is not an aggregate" hack

Download all attachments as: .zip

Change History (69)

comment:1 Changed 5 years ago by thejaswi_puthraya

  • Component changed from Uncategorized to ORM aggregation
  • Needs documentation unset
  • Needs tests unset
  • Owner nobody deleted
  • Patch needs improvement unset

comment:2 Changed 5 years ago by russellm

  • Triage Stage changed from Unreviewed to Accepted

No reason it shouldn't be *possible*, but there's the minor issue of implementing it :-)

There's a related problem hidden in this; annotating extra columns that aren't aggregates. Although the example here is trying to get the aggregate sum of field1-field2, you might also want to annotate field1-field2 as an extra column on a model. This sort of computed column requires the same sort of internal mechanics, except that it would result in an extra alias, rather than being buried in a SUM clause.

comment:3 Changed 5 years ago by robin

  • Cc robin added

comment:4 Changed 5 years ago by joh

  • Cc joh added

comment:5 Changed 5 years ago by cg@…

  • Cc cg@… added

comment:6 Changed 4 years ago by mila

  • Cc miloslav.pojman@… added

comment:7 Changed 4 years ago by julien

  • Severity set to Normal
  • Type set to New feature

comment:8 Changed 4 years ago by tmitchell

  • Cc taylor.mitchell@… added
  • Easy pickings unset
  • UI/UX unset

comment:9 Changed 4 years ago by ramiro

#10972 and #16775 were closed as duplicates of this ticket.

comment:10 Changed 4 years ago by akaariai

For the record, the patch in "Support for Conditional Aggregates" (#11305) implements this for .aggregate() and .annotate().

comment:11 Changed 4 years ago by miracle2k

  • Cc michael@… added

comment:12 Changed 4 years ago by romainr

  • Cc romainr added

Changed 4 years ago by nate_b

comment:13 Changed 4 years ago by nate_b

  • Has patch set
  • Owner set to nate_b
  • Status changed from new to assigned

This patch rebases out the changes from akaariai's patch to #11305 necessary to use F() objects inside aggregation functions, as well as adding the functionality required for them to be used directly as a kwarg to aggregate and annotate.

Just as a note: if accepted, a very clean patch can be made from the difference between this and akaariai's #11305 patch.

comment:14 Changed 4 years ago by janos

  • Cc janos@… added

comment:15 Changed 4 years ago by akaariai

  • Patch needs improvement set
  • Version 1.2 deleted

I am wondering about the hack in the patch: that is, adding an empty Aggregate over the given F expression. I guess the reason is to make this work:

>>> q = Entry.objects.annotate(cpb_ratio=F('n_comments')*1.0/F('n_pingbacks')) 

Now, I have been wondering about something similar. In fact, I have been thinking of creating expressions you can inject into the query by .annotate().

caseexpr = RawSQL("case when %s > 0 then 'has '||%s||' comments' else 'no comments' end",
                  params=(F('n_comments'), F('n_comments')))

And you could then do

>>> q = Entry.objects.annotate(has_comments=caseexpr).order_by('has_comments') 
(or, in my dreams the .order_by parameter could be .order_by(RawSQL('%s asc nulls last', params=(F('has_comments'),))

In effect, you could get rid of most .extra() uses by the above annotations. I have some reason to believe that the above changes, if implemented correctly, would make the ORM both faster and easier to understand. Of course, without an actual patch it is easy to claim that... :)

Now, my point here is that I really like the ability to use annotations for other expressions than aggregates. But I really, really do not like the idea of adding an empty aggregate that in fact is not an aggregate. I bet that will break under some complex query, because Django thinks you have an aggregate in the query, but in SQL you do not have one. So, +1 on the idea, -1 on the current implementation.

BTW: I am clearing the version field, from the documentation it seems the version field "Finally, it is possible to use the version attribute to indicate in which version the reported bug was identified.". Now, I read that as if there is no bug, there should not be any version either. Learning this as I write, so please correct if I am wrong.

Changed 4 years ago by nate_b

Removed "aggregate that in fact is not an aggregate" hack

comment:16 Changed 4 years ago by nate_b

Admittedly, that was a bit of a lazy work around. I think my second patch is an improvement, and I'd love your feedback. My approach this time was instead of squishing the F() object into an Aggregate, I loosened the requirements of the query object so that it could directly handle a SQLEvaluator.

comment:17 Changed 3 years ago by nate_b

The patches posted here are insufficient to effectively solve this problem.

I have come up with a better solution: refactoring of Aggregate into ExpressionNode.

I have uploaded my branch; it can be found here:

https://github.com/NateBragg/django/tree/14030

Some particular points of note:

  • I tried to preserve as much interface as possible; I didn't know how much was considered to be more public, so generally I tried to add rather than subtract. However, I did remove a couple things - if you see something missing that shouldn't be, let me know.
  • Currently, I'm getting the entire test suite passed on sqlite, postgres, mysql, oracle, and postgis. I was unable to test on oracle spatial - any help with that would be appreciated.
  • When fields are combined, they are coerced to a common type; IntegerFields are coerced to FloatFields, which are coerced into DecimalFields as needed. Any other kinds of combinations must be of the same types. Also, when coerced to a DecimalField, the precision is limited by the original DecimalField. If this is not correct, or other coercions should be added, I'd like to correct that.
  • When joins are required, they tend to be LEFT OUTER; I'd like some feedback on this, as I'm not 100% sure its always the best behavior.
  • As the aggregates are a little more complicated, on trivial cases there is a minor reduction in performance; using djangobench, I measured somewhere between a 3% and 8% increase in runtime.
  • I don't have enough tests - mostly for a lack of creativity. What kind of composed aggregates and expressions would you like to see? I'll add tests for them.
Last edited 3 years ago by nate_b (previous) (diff)

comment:18 Changed 3 years ago by michalm

  • Cc michal.modzelewski@… added

comment:19 Changed 3 years ago by nate_b

Now that the Django repo has been moved to github, I have made a pull request of this patch:

https://github.com/django/django/pull/46

comment:20 Changed 3 years ago by airstrike

FYI, I've updated the patch in #11305 to apply cleanly to trunk, but I'm not sure how it would conflict with Nate's branch. Just a heads up!

comment:21 Changed 3 years ago by akaariai

I closed the above pull request because I wanted to first clean up the ORM, then add new stuff. The ORM hasn't been "cleaned up" even if some work has been done.

I now feel it would be a good idea to just work this into master. This should clean up the implementation of aggregates a bit. If we want to refactor other parts of the ORM we can do this even after this gets merged.

I don't expect to get this actually into master anytime soon. Too much to do, too little time. But, I do intend to work on this when time permits.

comment:22 Changed 3 years ago by yoyoma

  • Cc yoyoma added

comment:23 Changed 3 years ago by nkryptic

  • Cc nkryptic@… added

comment:24 Changed 3 years ago by akaariai

  • Component changed from ORM aggregation to Database layer (models, ORM)

comment:25 Changed 2 years ago by fhahn

  • Keywords aggregate annotate added

comment:26 Changed 2 years ago by fhahn

  • Cc flo@… added

comment:27 Changed 2 years ago by titanjer@…

  • Cc titanjer@… added

comment:28 Changed 2 years ago by akaariai

I have mostly rebased the patch to current master, available from https://github.com/akaariai/django/commit/5e37ecc64610c0582a9c99bf4c276351c528262f

The current code doesn't seem good to me. The biggest problem is in the existing F() <-> SQLEvaluator interaction which isn't too clean (but works OK). Add in aggregates and the code is hard to understand. I don't know if there is a possibility to refactor the whole SQLEvaluator <-> F() structure. It just doesn't feel right currently, but on the other hand I don't know what the right structure is. Maybe carefully inspecting why things are done in the way they are done is the right approach.

As for the current patch - the result type coersion in sql/expressions.py feels scary. It does work for the very limited tests in the patch, but I am pretty sure that it is broken for many use cases. Also, the current patch will make defining custom aggregates very hard, this needs improvement. Still, checking if a node contains aggregates isn't clear currently.

The code isn't ready for commit, even if it manages to pass all tests on sqlite. PostgreSQL passes transactions and expressions tests and postgis passes all gis tests.

comment:29 Changed 2 years ago by fhahn

I'm currently looking into the aggregation stuff, because I want to write a proposal for the "Improve annotation and aggregation" GSOC 2013 idea. I'll post a few things that came to my mind when looking at this issue.

I'd make django.db.models.aggregates.Aggregate a subclass of ExpressionNode as well, but I wouldn't move the SQL creation for the aggregate functions in django.db.backends.BaseDatabaseOperations.combine_expression (not like the patch), because this would make adding custom aggregate functions really hard, because all sql functions must be part of a dict in django.db.backends.BaseDatabaseOperations. When django.db.models.aggregates.Aggregate is a subclass of expression node, it's possible to make django.db.models.sql.aggregates.Aggregate a subclass of SQLEvaluator and remove some duplicated code (as shown in your patch).

I'd rather update BaseDatabaseOperations.combine_expressions to check if the current connector is a aggregate function and generate the SQL accordingly. I think the Aggregate class should be the only place that has to know about the SQL code of the specific function, maybe through a as_sql method or something similar. In order to make it easier to write custom aggregate classes, we could add some callbacks to the Aggregate class to generate SQL for different backends (the existing aggregate functions are the same for all backends, but there are some (like string manipulation functions) that are different on some backends), instead of handling the custom SQL generation in the database backend (like for the mathematical operations, see https://github.com/django/django/blob/master/django/db/backends/oracle/base.py#L436). This would allow cross platform custom aggregations without modifying any backend code, but on the other hand this would be different from the way all the database specific SQL generation is handled at the moment, so I'm not sure about that.

To check if the connector is a aggreate I'd pass the whole node object to combine_expressions instead of the connector (string), to use isinstance to check if the node is a aggregate function. This way, it would keep the aggregation handling more extensible.

After fixing this issue I'd like to tackle #11305 (Support for "Conditional Aggregates"). When Aggregate is a subclass of ExpressionNode, it shouldn't be too hard to implement nesting of aggregates and this would allow adding new aggregate functions like an If() that could be used inside any existing aggregate, akaariai's example in the discussion about #14030 would look like:

# calculate how much younger the author's friends are on average 
# (only those friends included that are actually younger than the author)
vals = Author.objects.aggregate(
    friends_younger_avg=Avg(If(condition=Q(friends__age__lt=F('age')), true=F('age') - F('friends__age'))
)

I think this would provide a flexible API that could be used for a lot of cases, because it could be combined with all other aggregate functions.

I'm quite new to contributing to Django, so I apologize for any misjudgement!

comment:30 Changed 22 months ago by smeatonj

I'm wondering if there's been any traction on this ticket? It appeared like akaariai was looking into it, but after rebasing it hasn't been touched. I can't find a GSOC 2013 proposal linked to this ticket, so I'm unsure if the last update resulted in anything either. This is a ticket I'm deeply interested in, and wouldn't mind investing some time into it myself.

Would it be better to pick up the current branch and work on some of the issues that have been raised ( F() <-> SQLEvaluator ) etc? Or would it perhaps be a better idea to tackle this from the current master, and start anew? I guess I'm looking for a little direction before jumping into it.

comment:31 Changed 20 months ago by smeatonj

  • Needs documentation set

I've made an attempt at refactoring Aggregates into ExpressionNode as nateb originally did. The points brought up by @akaariai were mostly addressed:

  • The biggest problem is in the existing F() <-> SQLEvaluator interaction

Addressed by removing the SQLEvaluator structure, and allowing the Expression to evaluate itself. Backends can customise the sql generation using the as_{vendor} approach, introduced in the latest lookups refactor.

  • the result type coersion in sql/expressions.py feels scary

Addressed by introducing an output_type param, which is required for mixed-type expressions.

Discussion on this particular attempt is on the mailing list: https://groups.google.com/forum/#!topic/django-developers/8vEAwSwJGMc

The incomplete pull request is: https://github.com/django/django/pull/2184

comment:32 Changed 20 months ago by tonnzor

  • Cc tonnzor added

I would add one more use case here (as core devs proposed in #21738):

User.objects.annotate(new_is_active=SQL('CASE WHEN id > 100 THEN 1 ELSE 0 END')).update(is_active=models.F('new_is_active'))

comment:33 Changed 19 months ago by anonymous

  • Needs tests set

comment:34 Changed 19 months ago by anonymous

  • Needs tests unset

comment:35 Changed 19 months ago by smeatonj

Support for non-aggregate annotations has been partially implemented on-top of the aggregates refactor, details (and diff) available here: https://github.com/django/django/pull/2184

comment:36 Changed 19 months ago by smeatonj

  • Cc josh.smeaton@… added

comment:37 Changed 18 months ago by smeatonj

  • Owner changed from nate_b to smeatonj
  • Patch needs improvement unset

comment:38 Changed 18 months ago by smeatonj

Quick note - the implementation above covers ticket #20930. If accepted, both tickets can be closed.

comment:39 Changed 17 months ago by smeatonj

https://github.com/django/django/pull/2496

Final PR implementing non-aggregate annotations and aggregate arithmetic

comment:40 Changed 17 months ago by smeatonj

  • Needs documentation unset
  • Triage Stage changed from Accepted to Ready for checkin

comment:41 Changed 16 months ago by akaariai

I am finally having some extra time, so expect review soon.

I'm wondering if I should write a DEP about this change. I am not sure how well DEPs work for this kind of change. I guess I have to try writing one to find that out.

comment:42 Changed 16 months ago by timo

  • Triage Stage changed from Ready for checkin to Accepted
  • Version set to master

comment:43 Changed 15 months ago by bendavis78

I was about to post to django-dev about this when I discovered it was already being worked on. Very pleased about this. Having an API for commonly used SQL expressions, available anywhere in the query, will really open up the door for flexible queries. I think this could even pave the way to deprecating .extra() :-)

I've posted some use cases that I'm hoping will be covered by this: https://gist.github.com/bendavis78/865611a9e36c0897cb00 . Would any of these be difficult to implement with the addition of this patch?

comment:44 Changed 15 months ago by smeatonj

All of those scenarios in your gist will be covered, yes. There is going to be some overlap with the new Lookups/Transforms API and this patch, but I *think* we can also get them to work together, since they both support the Query Expression API. In particular, your datepart example could be:

.annotate(month=F('the_date__month')) 
# or 
.annotate(month=Extract('the_date', MONTH))

Of course this requires implementing the Extract() expression, and registering it with the DateField transform. Also, F() expressions don't yet support the transformlookup syntax, but that'd be the next thing I'd implement once (if) this patch is committed.

The other scenarios you bring up are a lot simpler:

class Coalesce(Func):
    function = 'COALESCE'

comment:45 Changed 15 months ago by jorgecarleitao

  • Cc jorgecarleitao@… added

comment:46 Changed 15 months ago by manfre

  • Cc mmanfre@… added
  • Patch needs improvement set

Patch introduces a field alignment issue for resolve_columns. This is similar to #21126 and #21203.

comment:47 Changed 15 months ago by smeatonj

The problem stems from the following block inside compiler.results_iter()

                if has_annotation_select:
                    loaded_fields = self.query.get_loaded_field_names().get(self.query.model, set()) or self.query.select
                    annotation_start = len(self.query.extra_select) + len(loaded_fields)
                    annotation_end = annotation_start + len(self.query.annotation_select)

For certain queries, loaded_fields ends up being empty. The set of fields appears to be fixed underneath the if resolve_columns block, but we still have a situation where annotation_start and end are effectively wrong. Firstly, is the lack of a populated query.select already broken? I don't think it is, because we branch off the truthiness in resolve_columns. I think there is a bug here (expecting loaded_fields to always be filled) that we've been lucky hasn't bitten us elsewhere.

What I'd like to do is completely remove these annotation_start/end variables, and just add the output_type of the annotation to the set of fields. Is there a particular reason that we aren't including the output_type of the aggregate in the field list? It seems that backends should have the opportunity to use resolve_columns to also inspect aggregates. The only issue with that is I believe this is truly a bug (not respecting loaded_fields), and should maybe be patched in a separate ticket. Thoughts? I'll try and put together a patch on this branch to make sure it's fixed, and we can extract it out if necessary.

comment:48 Changed 15 months ago by smeatonj

I implemented a fix and a test that should solve the column alignment issue. Still unsure if this should be part of a separate patch or not, but the specific commit is: https://github.com/jarshwah/django/commit/330ef1213dfee5c92128d764722b9c75b43e9a5b

If someone a little more knowledgable than me can check this out and validate that my assumptions are correct, that'd be great. Annotations and aggregates now include their output_field in the list of fields, so there's an opportunity for backends implementing resolve_columns to parse annotations. I don't think this was such a big deal before as aggregates focused on a limited set of model fields - but annotations can deal with all model field types.

comment:49 Changed 13 months ago by smeatonj

  • Patch needs improvement unset

comment:50 Changed 13 months ago by jarshwah

  • Owner changed from smeatonj to jarshwah

comment:51 Changed 12 months ago by jbinary

  • Cc jbinary added

comment:52 Changed 12 months ago by Markush2010

  • Cc info+coding@… added

comment:53 Changed 10 months ago by Marc Tamlyn <marc.tamlyn@…>

  • Resolution set to fixed
  • Status changed from assigned to closed

In f59fd15c4928caf3dfcbd50f6ab47be409a43b01:

Fixed #14030 -- Allowed annotations to accept all expressions

comment:54 Changed 9 months ago by jbg

  • Resolution fixed deleted
  • Status changed from closed to new

This patch appears to cause a regression in aggregations on DecimalFields. The following exception is raised if, for example, a Sum() aggregation on the field results in more digits than the field itself can store (a perfectly normal situation).

Traceback (most recent call last):
  File "/home/.../django/django/core/handlers/base.py", line 131, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python3.4/contextlib.py", line 30, in inner
    return func(*args, **kwds)
  File "/home/.../main/staff.py", line 24, in func
    return view(request, employee, *args, **kwargs)
  File "/home/.../main/staff.py", line 62, in payslip
    ytd_gross_income = ytd_payslips.aggregate(total=Sum('gross_income'))['total']
  File "/home/.../django/django/db/models/query.py", line 342, in aggregate
    return query.get_aggregation(self.db, kwargs.keys())
  File "/home/.../django/django/db/models/sql/query.py", line 422, in get_aggregation
    result = compiler.apply_converters(result, converters)
  File "/home/.../django/django/db/models/sql/compiler.py", line 698, in apply_converters
    value = converter(value, self.connection)
  File "/home/.../django/django/db/models/expressions.py", line 258, in convert_value
    return backend_utils.typecast_decimal(field.format_number(value))
  File "/home/.../django/django/db/models/fields/__init__.py", line 1556, in format_number
    return utils.format_number(value, self.max_digits, self.decimal_places)
  File "/home/.../django/django/db/backends/utils.py", line 197, in format_number
    return "{0:f}".format(value.quantize(decimal.Decimal(".1") ** decimal_places, context=context))
  File "/usr/lib/python3.4/decimal.py", line 2580, in quantize
    'quantize result has too many digits for current context')
  File "/usr/lib/python3.4/decimal.py", line 4050, in _raise_error
    raise error(explanation)
decimal.InvalidOperation: quantize result has too many digits for current context

comment:55 Changed 9 months ago by jarshwah

Yeah I can see the problem. As a workaround, you could manually set the output_field yourself:

Sum('gross_income', output_field=DecimalField(max_digits=X, decimal_places=Y)

I'll think on how to fix this so that a workaround isn't necessary. Should probably bump this to a release blocker, but I'm not sure whether to open a new ticket for the regression or to track it here just yet.

Last edited 9 months ago by jarshwah (previous) (diff)

comment:56 Changed 9 months ago by akaariai

New ticket referencing this ticket is the way to go.

comment:57 Changed 9 months ago by jarshwah

  • Resolution set to fixed
  • Status changed from new to closed

I've created a new ticket to track this: #23941

comment:58 Changed 7 months ago by Loic Bistuer <loic.bistuer@…>

In d450af8a2687ca2e90a8790eb567f9a25ebce85b:

Fixed small inconsistency when handling aggregate's default_alias.

Refs #14030.

comment:59 Changed 7 months ago by Loic Bistuer <loic.bistuer@…>

In 0580133971e42c322a9c0b451d3bbf72eff2ca24:

[1.8.x] Fixed small inconsistency when handling aggregate's default_alias.

Refs #14030.

Backport of d450af8a26 from master

comment:60 Changed 7 months ago by Josh Smeaton <josh.smeaton@…>

In 7171bf755b0c4be85ddbcc164eaf87164c131021:

Refs #14030 -- Added repr methods to all expressions

comment:61 Changed 7 months ago by Josh Smeaton <josh.smeaton@…>

In 14d0bd67d4bcf55f8a0a2b01433571a8b714121f:

Refs #14030 -- Renamed CombinableMixin to Combinable

Removed unused method and updated docstrings.

comment:62 Changed 7 months ago by Josh Smeaton <josh.smeaton@…>

In 6c68e40e6e7f3ba36fa0e629d5724c7f4b279bb8:

[1.8.x] Refs #14030 -- Added repr methods to all expressions

Backport of 7171bf755b0c4be85ddbcc164eaf87164c131021 from master

comment:63 Changed 7 months ago by Josh Smeaton <josh.smeaton@…>

In f858b51ee31314225e82812f58f353721f06101a:

[1.8.x] Refs #14030 -- Renamed CombinableMixin to Combinable

Removed unused method and updated docstrings.

Backport of 14d0bd67d4bcf55f8a0a2b01433571a8b714121f from master

comment:64 Changed 7 months ago by Josh Smeaton <josh.smeaton@…>

In e2d6e14662d780383e18066a3182155fb5b7747b:

Refs #14030 -- Improved expression support for python values

comment:65 Changed 7 months ago by Josh Smeaton <josh.smeaton@…>

In a6ea62aeafd4512f6d13aeda908f7622776a4537:

[1.8.x] Refs #14030 -- Improved expression support for python values

Backport of e2d6e14662d780383e18066a3182155fb5b7747b from master

comment:66 Changed 7 months ago by Josh Smeaton <josh.smeaton@…>

In bd4afef98490198e325654952d38a5735ee7e358:

Refs #14030 -- Added tests for Value aggregates

comment:67 Changed 7 months ago by Josh Smeaton <josh.smeaton@…>

In 47b23ca2ee4d26a50895adf0fba19020b27d8d84:

[1.8.x] Refs #14030 -- Added tests for Value aggregates

Backport of bd4afef98490198e325654952d38a5735ee7e358 from master

Note: See TracTickets for help on using tickets.
Back to Top