Code

Opened 7 years ago

Closed 5 years ago

#3566 closed (fixed)

Proposal: ORM aggregation support

Reported by: Honza Král <Honza.Kral@…> Owned by: russellm
Component: Database layer (models, ORM) Version: master
Severity: Keywords: orm aggregation group by
Cc: nick.lane.au@…, jm.bugtracking@…, dan.fairs@…, lau@…, wmealing@…, frankie@…, alexander.solovyov@…, lurker86@…, densetsu.no.ero.sennin@…, arthur.case@…, remco@…, martin.paquette@…, nicolaslara@…, jarthur@…, simon@…, gajon@…, gonz@…, marc@…, john.syrinek@…, jiminy@…, nreilly@…, cortland@…, me@…, omat@…, kyle.fox@…, flosch@…, daevaorn@…, amlau@…, especkman@…, kristjan@…, robin@…, vbmendes@…, marinho, robin, wallenfe@… Triage Stage: Someday/Maybe
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description

For origin and some discussion see http://groups.google.com/group/django-users/browse_frm/thread/fc58fbe409a6f098/1110c688e1af58d2

I would like to see an addition to the django ORM - many people has asked for aggregation. Do you think Django should have aggregation support?

I would like to add it in a similar fashion that ValuesQuerySet is implemented (a subclass to QuerySet with minor changes to how query is constructed and results are returned) - if we add a AgreggateQuerySet that would:

  • be create from simple QuerySet by a call to aggregate(group_by=(), max=(), min=(), avg=(), sum=(), count=False) , where:
    • group_by is a list of fields on which to aggregate, these will be part of the result, if empty the whole query will be grouped
    • If count, then COUNT(*) will be computed and included
    • sum, max, min and avg will contain a list of fields on which to call SUM(), MAX(), MIN() or AVG() respectively
  • will be a valid queryset
    • filter() will result in limiting the result by appending conditions to the HAVING clause etc.
    • aggregate() will probably add fields to individual aggregation functions.

queryset.count() would become identical to queryset.aggregate(count=True)['count']

Example:

>>> quseryset = Model.objects.all()
>>> queryset.aggregate( group_by=( 'name', 'city' ), sum=( 'pay', 'some_other_field' ), avg=( 'pay', 'age' ), count=True )
[
 {
   'group_by' : { 'name' : 'John Doe', 'city' : 'Prague', }
   'sum' : { 'pay' : 50000, 'some_other_field' : 10000 },
   'avg' : { 'pay' : 10000, 'age' : 30 },
   'count' : 5,
 },
 {
   'group_by' : { 'name' : 'Jane Doe', 'city' : 'Paris', }
   'sum' : { 'pay' : 300000, 'some_other_field' : 10 },
   'avg' : { 'pay' : 100000, 'age' : 32 },
   'count' : 3,
 },
]

There are a few problems I haven't thought through and would like comments:

  • how to deal with grouping by foreign key? the best seems to include the whole object in the result, not just the ID. (isn't it too magical?)
  • how to specify ordering?
  • how to best present the results? originally I thought that fields in group_by will be on the same level as min or max, but Tim Chase pointed out that it could clash if the model has a field named min .

I will start working on this as soon as I am sure about the interface. Any comments and help would be greatly appreciated.

Attachments (6)

aggregate.diff (4.9 KB) - added by Nicolas Lara <nicolaslara@…> 6 years ago.
Patch that adds aggregate functionality (no annotate yet)
aggregate.2.diff (8.6 KB) - added by nicolas 6 years ago.
aggregates.diff (63.9 KB) - added by nicolas 6 years ago.
Updated with docs
aggregates.patch (64.5 KB) - added by nicolas 6 years ago.
Removed a development comment. Sorry for the noise.
geoquery_aggregation_fix.diff (1.2 KB) - added by jbronn 6 years ago.
queryset_modular_aggregates.diff (5.4 KB) - added by jbronn 6 years ago.
Makes aggregation functionality more amenable to subclassing.

Download all attachments as: .zip

Change History (67)

comment:1 Changed 7 years ago by django.ticket@…

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

I would add that the resulting structure should be accessible within a template. Thus, one should be able to have in one's view

context = {

'stats': model.objects.aggregate(group_by=('field0',), sum=('field1', 'field2')),
'items': model.objects.all(),

}
return render_to_response('template.html', context)

one should be able to refer to the results in a template as

{{ stats.sum.field1 }}
{{ stats.group_by.field0 }}

just as one can refer to

{% for item in items %}
{{ item.field3 }}
{% endif %}

-tkc

comment:2 Changed 7 years ago by jacob

  • Summary changed from Proposal of ORM extension -- aggregationu to Proposal: ORM aggregation support

FWIW: I really like this proposal.

comment:3 Changed 7 years ago by Marc Fargas <telenieko@…>

  • Triage Stage changed from Unreviewed to Design decision needed

comment:4 Changed 7 years ago by nick.lane.au@…

  • Cc nick.lane.au@… added

comment:5 Changed 7 years ago by anonymous

  • Cc jm.bugtracking@… added

comment:6 follow-up: Changed 7 years ago by Ciantic

"how to deal with grouping by foreign key? the best seems to include the whole object in the result, not just the ID. (isn't it too magical?)"
Surely, aren't they lazy after all? So they will be not queried until you access them?

"how to specify ordering?"
How about queryset.aggregate( group_by=( 'name', 'city' ), sum=( 'pay', 'some_other_field' ), avg=( 'pay', 'age' ), count=True ).order_by(group_by='name', sum='-pay', avg='pay')
(- sign represents the reversed order as it represents it in normal querysets too)

"how to best present the results? originally I thought that fields in group_by will be on the same level as min or max, but Tim Chase pointed out that it could clash if the model has a field named min ."
I see no collision in current result.

comment:7 in reply to: ↑ 6 Changed 7 years ago by Honza Král <Honza.Kral@…>

Replying to Ciantic:

"how to deal with grouping by foreign key? the best seems to include the whole object in the result, not just the ID. (isn't it too magical?)"
Surely, aren't they lazy after all? So they will be not queried until you access them?

either it will be the whole object, or just an id, having a lazy object doesn't seem to have any effect here

"how to specify ordering?"
How about queryset.aggregate( group_by=( 'name', 'city' ), sum=( 'pay', 'some_other_field' ), avg=( 'pay', 'age' ), count=True ).order_by(group_by='name', sum='-pay', avg='pay')
(- sign represents the reversed order as it represents it in normal querysets too)

nice, but what field should be first in your example? name, SUM( pay ) or AVG( pay ) ??

I am sorry, I didn't have the time to advance with this, plus the queryset refactoring came around and that really needs to be done first.

comment:8 follow-up: Changed 7 years ago by Ciantic

either it will be the whole object, or just an id, having a lazy object doesn't seem to have any effect here

Returning just an ID is more harmful than returning object. When you return just the ID the people who needs the object is facing frustrating problem, fetching those objects externally. I suggest we use object, and those who need ID just uses the object.id to get it (notice: in django one shouldn't change the id fieldname, I've learnt, same way they shouldn't change the primarykey type... even if it is possible, there is stuff that simply does not work after changing it).

nice, but what field should be first in your example? name, SUM( pay ) or AVG( pay ) ??

name is first, then comes sub(pay) with descending in place and finally avg(pay), same way as in normal order_by, the leftmost has biggest priority.

comment:9 in reply to: ↑ 8 Changed 7 years ago by Honza Král <Honza.Kral@…>

Replying to Ciantic:

nice, but what field should be first in your example? name, SUM( pay ) or AVG( pay ) ??

name is first, then comes sub(pay) with descending in place and finally avg(pay), same way as in normal order_by, the leftmost has biggest priority.

and what if I want to sort on sum( pay ), name, sum( other_field ), that wouldn't work in your syntax, since order_by( sum='pay', group_by='name', sum='other_field' ) is a syntax error....

I would like to avoid things like order_by( 'sum(pay)', 'avg(other_field)', 'name' ), but so far its the only way I can think of...

comment:10 Changed 7 years ago by russellm

Earlier counterproposal here. Current discussion of these proposals
here

comment:11 Changed 7 years ago by anonymous

  • Cc dan.fairs@… added

comment:12 Changed 7 years ago by (removed)

  • Cc ferringb@… added

comment:13 Changed 7 years ago by anonymous

  • Cc lau@… added

comment:14 Changed 7 years ago by anonymous

  • Cc wmealing@… added

comment:15 Changed 7 years ago by anonymous

  • Cc frankie@… added

comment:16 Changed 7 years ago by Alexander Solovyov <alexander.solovyov@…>

  • Cc alexander.solovyov@… added

comment:17 Changed 7 years ago by russellm

  • Owner changed from adrian to russellm

comment:18 Changed 7 years ago by anonymous

  • Cc lurker86@… added

comment:19 Changed 7 years ago by Densetsu no Ero-sennin <densetsu.no.ero.sennin@…>

  • Cc densetsu.no.ero.sennin@… added

comment:20 Changed 7 years ago by russellm

  • Owner changed from nobody to russellm

For the record - I'm taking ownership of this ticket, although work probably won't commence until after the queryset refactor and newforms-admin land in trunk. There are still some design issues to sort out, and I want to make sure we have everyone's attention.

comment:21 Changed 7 years ago by russellm

  • Status changed from new to assigned

comment:22 Changed 7 years ago by arthur.case@…

  • Cc arthur.case@… added

comment:23 Changed 7 years ago by anonymous

  • Cc remco@… added

comment:24 Changed 7 years ago by jacob

  • Triage Stage changed from Design decision needed to Someday/Maybe

comment:25 Changed 6 years ago by anonymous

  • Cc martin.paquette@… added

Changed 6 years ago by Nicolas Lara <nicolaslara@…>

Patch that adds aggregate functionality (no annotate yet)

comment:26 Changed 6 years ago by Nicolas Lara <nicolaslara@…>

  • Cc nicolaslara@… added
  • Has patch set
  • Patch needs improvement set

I added a patch that provides the aggregate functionality as proposed in http://groups.google.com/group/django-developers/browse_thread/thread/3d4039161f6d2f5e/7485e59f449f9bf4?hl=en&
No anotate function yet. This is not a final implementation, just some of the functionality. The aggregate modifier already follows joins and group by functionality can be specified by using values on the QuerySet before calling aggregate(). Aggregate is a terminal modifier.
To-Do: test aggregate(), annotate(), AggregationQueryset? (like ValuesQuerySet, so it doesn't need to be called as the last modifier), un-aliased syntax, custom

Changed 6 years ago by nicolas

comment:27 Changed 6 years ago by nicolas

Finnally got some time to work on this. I added a new update to the patch. It is still very rough but implements both annotate and aggregate functionalities. Currently the modifiers take both *agrs and kwargs and generate the standard alias in the case of *args. When applying the modifiers to a ValuesQuerySet the field used for GROUP BY are restricted to the fields specified in 'values'. Annotate returns a new (Values)QuerySet and aggregate a dict. Currently there is a 'bug' in which I'd like to get feedback: when aggregating on both a field of the model and a foreign field the result of the aggregation on the 'local' field is 'wrong' (not the same as if aggregating only in local fields or only in foreign fields) this is because we are doing the aggregation in columns that are repeated in the SQL (due to the nature of joins). I am not sure if this should be prohibited (doing both foreign and local aggregates) because, though it provides some functionality, generates results that are unexpected and might lead to confusion. I am not going to touch on this for now as I think it needs discusion. I will continue to work in making the HAVING clause work when filtering on aggregated fields.

comment:28 Changed 6 years ago by russellm

For the benefit of those that haven't heard - ORM aggregation support has been accepted as a 2008 Google Summer of Code project. Nicolas Lara will be doing the heavy lifting, I (Russell Keith-Magee) will be mentoring.

comment:29 Changed 6 years ago by nicolas

  • Needs tests set

I've added a new patch with the latests implementation of aggregates so it is easier to test it, I'll try to keep this up to date every time a new feature is implemented.

comment:30 Changed 6 years ago by JustinTArthur

  • Cc jarthur@… added

comment:31 Changed 6 years ago by sfllaw

  • Cc simon@… added

Changed 6 years ago by nicolas

Updated with docs

comment:32 Changed 6 years ago by gajon

  • Cc gajon@… added

comment:33 Changed 6 years ago by gonz

  • Cc gonz@… added

comment:34 Changed 6 years ago by anonymous

  • Cc henry@… added

comment:35 Changed 6 years ago by marc

  • Cc marc@… added

comment:36 Changed 6 years ago by Andre

  • Cc andre.miras@… added

comment:37 Changed 6 years ago by Johntron

  • Cc john.syrinek@… added

Changed 6 years ago by nicolas

Removed a development comment. Sorry for the noise.

comment:38 Changed 6 years ago by Richard

  • Cc jiminy@… added

comment:39 Changed 6 years ago by (removed)

  • Cc ferringb@… removed

comment:40 Changed 6 years ago by anonymous

Possibility of creating a supported branch of this to test?

comment:41 Changed 6 years ago by anonymous

  • Cc nreilly@… added

comment:42 Changed 6 years ago by pixelcort

  • Cc cortland@…, me@… added

comment:43 Changed 6 years ago by omat

  • Cc omat@… added

comment:44 Changed 6 years ago by Andre

  • Cc andre.miras@… removed

comment:45 Changed 6 years ago by anonymous

  • Cc henry@… removed

comment:46 Changed 6 years ago by Kyle

  • Cc kyle.fox@… added

comment:47 Changed 6 years ago by flosch

  • Cc flosch@… added

comment:48 Changed 6 years ago by alexkoshelev

  • Cc daevaorn@… added

comment:49 Changed 6 years ago by russellm

Status update: this ticket is now pretty much guaranteed to be on the "must have" list for v1.1.

I've been working on this patch in my own private sandbox in preparation for committing to trunk. In the interests of public disclosure of my progress, I've just pushed my updates onto github.

Big, flashing light warning: While the basic shape of the API has been finalized, the code itself isn't final. Changes are expected before this reaches trunk. In particular:

  • Return types for aggregate columns are still in flux
  • The output format for ValuesQuerySet is still under development
  • The internal structure of Aggregate classes is subject to change
  • The handling of Having clauses will undergo some major changes.

There will probably be other changes. In short - consider yourself warned. This is a development branch. It is subject to change. It will occasionally break. Backwards compatibility is in no way guaranteed until this ticket is committed to the Django trunk.

comment:50 Changed 6 years ago by amlau

  • Cc amlau@… added

comment:51 Changed 6 years ago by eas

  • Cc especkman@… added

comment:52 Changed 6 years ago by anonymous

  • Cc kristjan@… added

comment:53 Changed 6 years ago by isometry

  • Cc robin@… added

comment:54 Changed 6 years ago by vbmendes

  • Cc vbmendes@… added

comment:55 Changed 6 years ago by anonymous

  • Cc marinho added

Changed 6 years ago by jbronn

Changed 6 years ago by jbronn

Makes aggregation functionality more amenable to subclassing.

comment:56 Changed 6 years ago by robin

  • Cc robin added

Will this account for grouping by date or time on a db column with datetime/timestamp type?

eg. 
1. select * from table group by year(timestamp)
2. select * from table group by hour(timestamp)

*note that the sql examples above probably does not work as proper sql.

comment:57 Changed 6 years ago by wallenfe

  • Cc wallenfe@… added

comment:58 Changed 6 years ago by russellm

  • Needs tests unset
  • Patch needs improvement unset

Call for final testing!

Aggregations are almost ready for trunk, so I would greatly appreciate any bug reports on the code as it currently stands. Pull a copy of the repository from my github repository; if you're not a git user, you can download a snapshot tarball from the github page.

comment:59 follow-up: Changed 6 years ago by underbluewaters

I would be great if this feature supported Standard Deviation to do some simple statistics. I'm currently doing this with plain sql and it is a hassle.

comment:60 in reply to: ↑ 59 Changed 6 years ago by russellm

Replying to underbluewaters:

I would be great if this feature supported Standard Deviation to do some simple statistics.

I've just updated the github repository with StdDev and Variance, including an optional flag to do sample stats instead of population stats. Be warned, however: SQlite doesn't provide these aggregates out of the box - you will need to install an SQLite extension module to get StdDev and Variance.

comment:61 Changed 5 years ago by russellm

  • Resolution set to fixed
  • Status changed from assigned to closed

(In [9742]) Fixed #3566 -- Added support for aggregation to the ORM. See the documentation for details on usage.

Many thanks to:

  • Nicolas Lara, who worked on this feature during the 2008 Google Summer of Code.
  • Alex Gaynor for his help debugging and fixing a number of issues.
  • Justin Bronn for his help integrating with contrib.gis.
  • Karen Tracey for her help with cross-platform testing.
  • Ian Kelly for his help testing and fixing Oracle support.
  • Malcolm Tredinnick for his invaluable review notes.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.