Opened 18 years ago

Closed 16 years ago

#3566 closed (fixed)

Proposal: ORM aggregation support

Reported by: Honza Král <Honza.Kral@…> Owned by: Russell Keith-Magee
Component: Database layer (models, ORM) Version: dev
Severity: Keywords: orm aggregation group by
Cc: nick.lane.au@…, jm.bugtracking@…, dan.fairs@…, lau@…, wmealing@…, frankie@…, alexander.solovyov@…, lurker86@…, densetsu.no.ero.sennin@…, arthur.case@…, remco@…, martin.paquette@…, nicolaslara@…, jarthur@…, simon@…, gajon@…, gonz@…, marc@…, john.syrinek@…, jiminy@…, nreilly@…, cortland@…, me@…, omat@…, kyle.fox@…, flosch@…, daevaorn@…, amlau@…, especkman@…, kristjan@…, robin@…, vbmendes@…, Marinho Brandão, Robin, wallenfe@… Triage Stage: Someday/Maybe
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

For origin and some discussion see http://groups.google.com/group/django-users/browse_frm/thread/fc58fbe409a6f098/1110c688e1af58d2

I would like to see an addition to the django ORM - many people has asked for aggregation. Do you think Django should have aggregation support?

I would like to add it in a similar fashion that ValuesQuerySet is implemented (a subclass to QuerySet with minor changes to how query is constructed and results are returned) - if we add a AgreggateQuerySet that would:

  • be create from simple QuerySet by a call to aggregate(group_by=(), max=(), min=(), avg=(), sum=(), count=False) , where:
    • group_by is a list of fields on which to aggregate, these will be part of the result, if empty the whole query will be grouped
    • If count, then COUNT(*) will be computed and included
    • sum, max, min and avg will contain a list of fields on which to call SUM(), MAX(), MIN() or AVG() respectively
  • will be a valid queryset
    • filter() will result in limiting the result by appending conditions to the HAVING clause etc.
    • aggregate() will probably add fields to individual aggregation functions.

queryset.count() would become identical to queryset.aggregate(count=True)['count']

Example:

>>> quseryset = Model.objects.all()
>>> queryset.aggregate( group_by=( 'name', 'city' ), sum=( 'pay', 'some_other_field' ), avg=( 'pay', 'age' ), count=True )
[
 {
   'group_by' : { 'name' : 'John Doe', 'city' : 'Prague', }
   'sum' : { 'pay' : 50000, 'some_other_field' : 10000 },
   'avg' : { 'pay' : 10000, 'age' : 30 },
   'count' : 5,
 },
 {
   'group_by' : { 'name' : 'Jane Doe', 'city' : 'Paris', }
   'sum' : { 'pay' : 300000, 'some_other_field' : 10 },
   'avg' : { 'pay' : 100000, 'age' : 32 },
   'count' : 3,
 },
]

There are a few problems I haven't thought through and would like comments:

  • how to deal with grouping by foreign key? the best seems to include the whole object in the result, not just the ID. (isn't it too magical?)
  • how to specify ordering?
  • how to best present the results? originally I thought that fields in group_by will be on the same level as min or max, but Tim Chase pointed out that it could clash if the model has a field named min .

I will start working on this as soon as I am sure about the interface. Any comments and help would be greatly appreciated.

Attachments (6)

aggregate.diff (4.9 KB ) - added by Nicolas Lara <nicolaslara@…> 17 years ago.
Patch that adds aggregate functionality (no annotate yet)
aggregate.2.diff (8.6 KB ) - added by Nicolas Lara 17 years ago.
aggregates.diff (63.9 KB ) - added by Nicolas Lara 16 years ago.
Updated with docs
aggregates.patch (64.5 KB ) - added by Nicolas Lara 16 years ago.
Removed a development comment. Sorry for the noise.
geoquery_aggregation_fix.diff (1.2 KB ) - added by jbronn 16 years ago.
queryset_modular_aggregates.diff (5.4 KB ) - added by jbronn 16 years ago.
Makes aggregation functionality more amenable to subclassing.

Download all attachments as: .zip

Change History (67)

comment:1 by django.ticket@…, 18 years ago

I would add that the resulting structure should be accessible within a template. Thus, one should be able to have in one's view

context = {

'stats': model.objects.aggregate(group_by=('field0',), sum=('field1', 'field2')),
'items': model.objects.all(),

}
return render_to_response('template.html', context)

one should be able to refer to the results in a template as

{{ stats.sum.field1 }}
{{ stats.group_by.field0 }}

just as one can refer to

{% for item in items %}
{{ item.field3 }}
{% endif %}

-tkc

comment:2 by Jacob, 18 years ago

Summary: Proposal of ORM extension -- aggregationuProposal: ORM aggregation support

FWIW: I really like this proposal.

comment:3 by Marc Fargas <telenieko@…>, 18 years ago

Triage Stage: UnreviewedDesign decision needed

comment:4 by nick.lane.au@…, 18 years ago

Cc: nick.lane.au@… added

comment:5 by anonymous, 18 years ago

Cc: jm.bugtracking@… added

comment:6 by Jari Pennanen, 18 years ago

"how to deal with grouping by foreign key? the best seems to include the whole object in the result, not just the ID. (isn't it too magical?)"
Surely, aren't they lazy after all? So they will be not queried until you access them?

"how to specify ordering?"
How about queryset.aggregate( group_by=( 'name', 'city' ), sum=( 'pay', 'some_other_field' ), avg=( 'pay', 'age' ), count=True ).order_by(group_by='name', sum='-pay', avg='pay')
(- sign represents the reversed order as it represents it in normal querysets too)

"how to best present the results? originally I thought that fields in group_by will be on the same level as min or max, but Tim Chase pointed out that it could clash if the model has a field named min ."
I see no collision in current result.

in reply to:  6 comment:7 by Honza Král <Honza.Kral@…>, 18 years ago

Replying to Ciantic:

"how to deal with grouping by foreign key? the best seems to include the whole object in the result, not just the ID. (isn't it too magical?)"
Surely, aren't they lazy after all? So they will be not queried until you access them?

either it will be the whole object, or just an id, having a lazy object doesn't seem to have any effect here

"how to specify ordering?"
How about queryset.aggregate( group_by=( 'name', 'city' ), sum=( 'pay', 'some_other_field' ), avg=( 'pay', 'age' ), count=True ).order_by(group_by='name', sum='-pay', avg='pay')
(- sign represents the reversed order as it represents it in normal querysets too)

nice, but what field should be first in your example? name, SUM( pay ) or AVG( pay ) ??

I am sorry, I didn't have the time to advance with this, plus the queryset refactoring came around and that really needs to be done first.

comment:8 by Jari Pennanen, 18 years ago

either it will be the whole object, or just an id, having a lazy object doesn't seem to have any effect here

Returning just an ID is more harmful than returning object. When you return just the ID the people who needs the object is facing frustrating problem, fetching those objects externally. I suggest we use object, and those who need ID just uses the object.id to get it (notice: in django one shouldn't change the id fieldname, I've learnt, same way they shouldn't change the primarykey type... even if it is possible, there is stuff that simply does not work after changing it).

nice, but what field should be first in your example? name, SUM( pay ) or AVG( pay ) ??

name is first, then comes sub(pay) with descending in place and finally avg(pay), same way as in normal order_by, the leftmost has biggest priority.

in reply to:  8 comment:9 by Honza Král <Honza.Kral@…>, 18 years ago

Replying to Ciantic:

nice, but what field should be first in your example? name, SUM( pay ) or AVG( pay ) ??

name is first, then comes sub(pay) with descending in place and finally avg(pay), same way as in normal order_by, the leftmost has biggest priority.

and what if I want to sort on sum( pay ), name, sum( other_field ), that wouldn't work in your syntax, since order_by( sum='pay', group_by='name', sum='other_field' ) is a syntax error....

I would like to avoid things like order_by( 'sum(pay)', 'avg(other_field)', 'name' ), but so far its the only way I can think of...

comment:10 by Russell Keith-Magee, 18 years ago

Earlier counterproposal here. Current discussion of these proposals
here

comment:11 by anonymous, 18 years ago

Cc: dan.fairs@… added

comment:12 by (removed), 18 years ago

Cc: ferringb@… added

comment:13 by anonymous, 18 years ago

Cc: lau@… added

comment:14 by anonymous, 18 years ago

Cc: wmealing@… added

comment:15 by anonymous, 18 years ago

Cc: frankie@… added

comment:16 by Alexander Solovyov <alexander.solovyov@…>, 18 years ago

Cc: alexander.solovyov@… added

comment:17 by Russell Keith-Magee, 17 years ago

Owner: changed from Adrian Holovaty to Russell Keith-Magee

comment:18 by anonymous, 17 years ago

Cc: lurker86@… added

comment:19 by Densetsu no Ero-sennin <densetsu.no.ero.sennin@…>, 17 years ago

Cc: densetsu.no.ero.sennin@… added

comment:20 by Russell Keith-Magee, 17 years ago

Owner: changed from nobody to Russell Keith-Magee

For the record - I'm taking ownership of this ticket, although work probably won't commence until after the queryset refactor and newforms-admin land in trunk. There are still some design issues to sort out, and I want to make sure we have everyone's attention.

comment:21 by Russell Keith-Magee, 17 years ago

Status: newassigned

comment:22 by arthur.case@…, 17 years ago

Cc: arthur.case@… added

comment:23 by anonymous, 17 years ago

Cc: remco@… added

comment:24 by Jacob, 17 years ago

Triage Stage: Design decision neededSomeday/Maybe

comment:25 by anonymous, 17 years ago

Cc: martin.paquette@… added

by Nicolas Lara <nicolaslara@…>, 17 years ago

Attachment: aggregate.diff added

Patch that adds aggregate functionality (no annotate yet)

comment:26 by Nicolas Lara <nicolaslara@…>, 17 years ago

Cc: nicolaslara@… added
Has patch: set
Patch needs improvement: set

I added a patch that provides the aggregate functionality as proposed in http://groups.google.com/group/django-developers/browse_thread/thread/3d4039161f6d2f5e/7485e59f449f9bf4?hl=en&
No anotate function yet. This is not a final implementation, just some of the functionality. The aggregate modifier already follows joins and group by functionality can be specified by using values on the QuerySet before calling aggregate(). Aggregate is a terminal modifier.
To-Do: test aggregate(), annotate(), AggregationQueryset? (like ValuesQuerySet, so it doesn't need to be called as the last modifier), un-aliased syntax, custom

by Nicolas Lara, 17 years ago

Attachment: aggregate.2.diff added

comment:27 by Nicolas Lara, 17 years ago

Finnally got some time to work on this. I added a new update to the patch. It is still very rough but implements both annotate and aggregate functionalities. Currently the modifiers take both *agrs and kwargs and generate the standard alias in the case of *args. When applying the modifiers to a ValuesQuerySet the field used for GROUP BY are restricted to the fields specified in 'values'. Annotate returns a new (Values)QuerySet and aggregate a dict. Currently there is a 'bug' in which I'd like to get feedback: when aggregating on both a field of the model and a foreign field the result of the aggregation on the 'local' field is 'wrong' (not the same as if aggregating only in local fields or only in foreign fields) this is because we are doing the aggregation in columns that are repeated in the SQL (due to the nature of joins). I am not sure if this should be prohibited (doing both foreign and local aggregates) because, though it provides some functionality, generates results that are unexpected and might lead to confusion. I am not going to touch on this for now as I think it needs discusion. I will continue to work in making the HAVING clause work when filtering on aggregated fields.

comment:28 by Russell Keith-Magee, 17 years ago

For the benefit of those that haven't heard - ORM aggregation support has been accepted as a 2008 Google Summer of Code project. Nicolas Lara will be doing the heavy lifting, I (Russell Keith-Magee) will be mentoring.

comment:29 by Nicolas Lara, 17 years ago

Needs tests: set

I've added a new patch with the latests implementation of aggregates so it is easier to test it, I'll try to keep this up to date every time a new feature is implemented.

comment:30 by JustinTArthur, 16 years ago

Cc: jarthur@… added

comment:31 by Simon Law, 16 years ago

Cc: simon@… added

by Nicolas Lara, 16 years ago

Attachment: aggregates.diff added

Updated with docs

comment:32 by Jorge Gajon, 16 years ago

Cc: gajon@… added

comment:33 by Gonzalo Saavedra, 16 years ago

Cc: gonz@… added

comment:34 by anonymous, 16 years ago

Cc: henry@… added

comment:35 by marc, 16 years ago

Cc: marc@… added

comment:36 by Andre Miras, 16 years ago

Cc: andre.miras@… added

comment:37 by Johntron, 16 years ago

Cc: john.syrinek@… added

by Nicolas Lara, 16 years ago

Attachment: aggregates.patch added

Removed a development comment. Sorry for the noise.

comment:38 by Richard, 16 years ago

Cc: jiminy@… added

comment:39 by (removed), 16 years ago

Cc: ferringb@… removed

comment:40 by anonymous, 16 years ago

Possibility of creating a supported branch of this to test?

comment:41 by anonymous, 16 years ago

Cc: nreilly@… added

comment:42 by pixelcort, 16 years ago

Cc: cortland@… me@… added

comment:43 by omat, 16 years ago

Cc: omat@… added

comment:44 by Andre Miras, 16 years ago

Cc: andre.miras@… removed

comment:45 by anonymous, 16 years ago

Cc: henry@… removed

comment:46 by Kyle, 16 years ago

Cc: kyle.fox@… added

comment:47 by flosch, 16 years ago

Cc: flosch@… added

comment:48 by Alexander Koshelev, 16 years ago

Cc: daevaorn@… added

comment:49 by Russell Keith-Magee, 16 years ago

Status update: this ticket is now pretty much guaranteed to be on the "must have" list for v1.1.

I've been working on this patch in my own private sandbox in preparation for committing to trunk. In the interests of public disclosure of my progress, I've just pushed my updates onto github.

Big, flashing light warning: While the basic shape of the API has been finalized, the code itself isn't final. Changes are expected before this reaches trunk. In particular:

  • Return types for aggregate columns are still in flux
  • The output format for ValuesQuerySet is still under development
  • The internal structure of Aggregate classes is subject to change
  • The handling of Having clauses will undergo some major changes.

There will probably be other changes. In short - consider yourself warned. This is a development branch. It is subject to change. It will occasionally break. Backwards compatibility is in no way guaranteed until this ticket is committed to the Django trunk.

comment:50 by , 16 years ago

Cc: amlau@… added

comment:51 by eas, 16 years ago

Cc: especkman@… added

comment:52 by anonymous, 16 years ago

Cc: kristjan@… added

comment:53 by Robin Breathe, 16 years ago

Cc: robin@… added

comment:54 by vbmendes, 16 years ago

Cc: vbmendes@… added

comment:55 by anonymous, 16 years ago

Cc: Marinho Brandão added

by jbronn, 16 years ago

by jbronn, 16 years ago

Makes aggregation functionality more amenable to subclassing.

comment:56 by Robin, 16 years ago

Cc: Robin added

Will this account for grouping by date or time on a db column with datetime/timestamp type?

eg. 
1. select * from table group by year(timestamp)
2. select * from table group by hour(timestamp)

*note that the sql examples above probably does not work as proper sql.

comment:57 by wallenfe, 16 years ago

Cc: wallenfe@… added

comment:58 by Russell Keith-Magee, 16 years ago

Needs tests: unset
Patch needs improvement: unset

Call for final testing!

Aggregations are almost ready for trunk, so I would greatly appreciate any bug reports on the code as it currently stands. Pull a copy of the repository from my github repository; if you're not a git user, you can download a snapshot tarball from the github page.

comment:59 by underbluewaters, 16 years ago

I would be great if this feature supported Standard Deviation to do some simple statistics. I'm currently doing this with plain sql and it is a hassle.

in reply to:  59 comment:60 by Russell Keith-Magee, 16 years ago

Replying to underbluewaters:

I would be great if this feature supported Standard Deviation to do some simple statistics.

I've just updated the github repository with StdDev and Variance, including an optional flag to do sample stats instead of population stats. Be warned, however: SQlite doesn't provide these aggregates out of the box - you will need to install an SQLite extension module to get StdDev and Variance.

comment:61 by Russell Keith-Magee, 16 years ago

Resolution: fixed
Status: assignedclosed

(In [9742]) Fixed #3566 -- Added support for aggregation to the ORM. See the documentation for details on usage.

Many thanks to:

  • Nicolas Lara, who worked on this feature during the 2008 Google Summer of Code.
  • Alex Gaynor for his help debugging and fixing a number of issues.
  • Justin Bronn for his help integrating with contrib.gis.
  • Karen Tracey for her help with cross-platform testing.
  • Ian Kelly for his help testing and fixing Oracle support.
  • Malcolm Tredinnick for his invaluable review notes.
Note: See TracTickets for help on using tickets.
Back to Top