Opened 18 years ago
Closed 16 years ago
#3566 closed (fixed)
Proposal: ORM aggregation support
Reported by: | Owned by: | Russell Keith-Magee | |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Keywords: | orm aggregation group by | |
Cc: | nick.lane.au@…, jm.bugtracking@…, dan.fairs@…, lau@…, wmealing@…, frankie@…, alexander.solovyov@…, lurker86@…, densetsu.no.ero.sennin@…, arthur.case@…, remco@…, martin.paquette@…, nicolaslara@…, jarthur@…, simon@…, gajon@…, gonz@…, marc@…, john.syrinek@…, jiminy@…, nreilly@…, cortland@…, me@…, omat@…, kyle.fox@…, flosch@…, daevaorn@…, amlau@…, especkman@…, kristjan@…, robin@…, vbmendes@…, Marinho Brandão, Robin, wallenfe@… | Triage Stage: | Someday/Maybe |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
For origin and some discussion see http://groups.google.com/group/django-users/browse_frm/thread/fc58fbe409a6f098/1110c688e1af58d2
I would like to see an addition to the django ORM - many people has asked for aggregation. Do you think Django should have aggregation support?
I would like to add it in a similar fashion that ValuesQuerySet
is implemented (a subclass to QuerySet
with minor changes to how query is constructed and results are returned) - if we add a AgreggateQuerySet
that would:
- be create from simple
QuerySet
by a call toaggregate(group_by=(), max=(), min=(), avg=(), sum=(), count=False)
, where:group_by
is a list of fields on which to aggregate, these will be part of the result, if empty the whole query will be grouped- If
count
, thenCOUNT(*)
will be computed and included sum, max, min
andavg
will contain a list of fields on which to callSUM(), MAX(), MIN()
orAVG()
respectively
- will be a valid queryset
filter()
will result in limiting the result by appending conditions to theHAVING
clause etc.aggregate()
will probably add fields to individual aggregation functions.
queryset.count()
would become identical to queryset.aggregate(count=True)['count']
Example:
>>> quseryset = Model.objects.all() >>> queryset.aggregate( group_by=( 'name', 'city' ), sum=( 'pay', 'some_other_field' ), avg=( 'pay', 'age' ), count=True ) [ { 'group_by' : { 'name' : 'John Doe', 'city' : 'Prague', } 'sum' : { 'pay' : 50000, 'some_other_field' : 10000 }, 'avg' : { 'pay' : 10000, 'age' : 30 }, 'count' : 5, }, { 'group_by' : { 'name' : 'Jane Doe', 'city' : 'Paris', } 'sum' : { 'pay' : 300000, 'some_other_field' : 10 }, 'avg' : { 'pay' : 100000, 'age' : 32 }, 'count' : 3, }, ]
There are a few problems I haven't thought through and would like comments:
- how to deal with grouping by foreign key? the best seems to include the whole object in the result, not just the ID. (isn't it too magical?)
- how to specify ordering?
- how to best present the results? originally I thought that fields in
group_by
will be on the same level asmin
ormax
, but Tim Chase pointed out that it could clash if the model has a field namedmin
.
I will start working on this as soon as I am sure about the interface. Any comments and help would be greatly appreciated.
Attachments (6)
Change History (67)
comment:1 by , 18 years ago
comment:2 by , 18 years ago
Summary: | Proposal of ORM extension -- aggregationu → Proposal: ORM aggregation support |
---|
FWIW: I really like this proposal.
comment:3 by , 18 years ago
Triage Stage: | Unreviewed → Design decision needed |
---|
comment:4 by , 18 years ago
Cc: | added |
---|
comment:5 by , 18 years ago
Cc: | added |
---|
follow-up: 7 comment:6 by , 18 years ago
"how to deal with grouping by foreign key? the best seems to include the whole object in the result, not just the ID. (isn't it too magical?)"
Surely, aren't they lazy after all? So they will be not queried until you access them?
"how to specify ordering?"
How about queryset.aggregate( group_by=( 'name', 'city' ), sum=( 'pay', 'some_other_field' ), avg=( 'pay', 'age' ), count=True ).order_by(group_by='name', sum='-pay', avg='pay')
(- sign represents the reversed order as it represents it in normal querysets too)
"how to best present the results? originally I thought that fields in group_by will be on the same level as min or max, but Tim Chase pointed out that it could clash if the model has a field named min ."
I see no collision in current result.
comment:7 by , 18 years ago
Replying to Ciantic:
"how to deal with grouping by foreign key? the best seems to include the whole object in the result, not just the ID. (isn't it too magical?)"
Surely, aren't they lazy after all? So they will be not queried until you access them?
either it will be the whole object, or just an id, having a lazy object doesn't seem to have any effect here
"how to specify ordering?"
How about queryset.aggregate( group_by=( 'name', 'city' ), sum=( 'pay', 'some_other_field' ), avg=( 'pay', 'age' ), count=True ).order_by(group_by='name', sum='-pay', avg='pay')
(- sign represents the reversed order as it represents it in normal querysets too)
nice, but what field should be first in your example? name, SUM( pay ) or AVG( pay ) ??
I am sorry, I didn't have the time to advance with this, plus the queryset refactoring came around and that really needs to be done first.
follow-up: 9 comment:8 by , 18 years ago
either it will be the whole object, or just an id, having a lazy object doesn't seem to have any effect here
Returning just an ID is more harmful than returning object. When you return just the ID the people who needs the object is facing frustrating problem, fetching those objects externally. I suggest we use object, and those who need ID just uses the object.id to get it (notice: in django one shouldn't change the id fieldname, I've learnt, same way they shouldn't change the primarykey type... even if it is possible, there is stuff that simply does not work after changing it).
nice, but what field should be first in your example? name, SUM( pay ) or AVG( pay ) ??
name is first, then comes sub(pay) with descending in place and finally avg(pay), same way as in normal order_by, the leftmost has biggest priority.
comment:9 by , 18 years ago
Replying to Ciantic:
nice, but what field should be first in your example? name, SUM( pay ) or AVG( pay ) ??name is first, then comes sub(pay) with descending in place and finally avg(pay), same way as in normal order_by, the leftmost has biggest priority.
and what if I want to sort on sum( pay ), name, sum( other_field ), that wouldn't work in your syntax, since order_by( sum='pay', group_by='name', sum='other_field' ) is a syntax error....
I would like to avoid things like order_by( 'sum(pay)', 'avg(other_field)', 'name' ), but so far its the only way I can think of...
comment:10 by , 18 years ago
comment:11 by , 18 years ago
Cc: | added |
---|
comment:12 by , 17 years ago
Cc: | added |
---|
comment:13 by , 17 years ago
Cc: | added |
---|
comment:14 by , 17 years ago
Cc: | added |
---|
comment:15 by , 17 years ago
Cc: | added |
---|
comment:16 by , 17 years ago
Cc: | added |
---|
comment:17 by , 17 years ago
Owner: | changed from | to
---|
comment:18 by , 17 years ago
Cc: | added |
---|
comment:19 by , 17 years ago
Cc: | added |
---|
comment:20 by , 17 years ago
Owner: | changed from | to
---|
For the record - I'm taking ownership of this ticket, although work probably won't commence until after the queryset refactor and newforms-admin land in trunk. There are still some design issues to sort out, and I want to make sure we have everyone's attention.
comment:21 by , 17 years ago
Status: | new → assigned |
---|
comment:22 by , 17 years ago
Cc: | added |
---|
comment:23 by , 17 years ago
Cc: | added |
---|
comment:24 by , 17 years ago
Triage Stage: | Design decision needed → Someday/Maybe |
---|
comment:25 by , 17 years ago
Cc: | added |
---|
by , 17 years ago
Attachment: | aggregate.diff added |
---|
Patch that adds aggregate functionality (no annotate yet)
comment:26 by , 17 years ago
Cc: | added |
---|---|
Has patch: | set |
Patch needs improvement: | set |
I added a patch that provides the aggregate functionality as proposed in http://groups.google.com/group/django-developers/browse_thread/thread/3d4039161f6d2f5e/7485e59f449f9bf4?hl=en&
No anotate function yet. This is not a final implementation, just some of the functionality. The aggregate modifier already follows joins and group by functionality can be specified by using values on the QuerySet before calling aggregate(). Aggregate is a terminal modifier.
To-Do: test aggregate(), annotate(), AggregationQueryset? (like ValuesQuerySet, so it doesn't need to be called as the last modifier), un-aliased syntax, custom
by , 17 years ago
Attachment: | aggregate.2.diff added |
---|
comment:27 by , 17 years ago
Finnally got some time to work on this. I added a new update to the patch. It is still very rough but implements both annotate and aggregate functionalities. Currently the modifiers take both *agrs and kwargs and generate the standard alias in the case of *args. When applying the modifiers to a ValuesQuerySet the field used for GROUP BY are restricted to the fields specified in 'values'. Annotate returns a new (Values)QuerySet and aggregate a dict. Currently there is a 'bug' in which I'd like to get feedback: when aggregating on both a field of the model and a foreign field the result of the aggregation on the 'local' field is 'wrong' (not the same as if aggregating only in local fields or only in foreign fields) this is because we are doing the aggregation in columns that are repeated in the SQL (due to the nature of joins). I am not sure if this should be prohibited (doing both foreign and local aggregates) because, though it provides some functionality, generates results that are unexpected and might lead to confusion. I am not going to touch on this for now as I think it needs discusion. I will continue to work in making the HAVING clause work when filtering on aggregated fields.
comment:28 by , 17 years ago
For the benefit of those that haven't heard - ORM aggregation support has been accepted as a 2008 Google Summer of Code project. Nicolas Lara will be doing the heavy lifting, I (Russell Keith-Magee) will be mentoring.
comment:29 by , 16 years ago
Needs tests: | set |
---|
I've added a new patch with the latests implementation of aggregates so it is easier to test it, I'll try to keep this up to date every time a new feature is implemented.
comment:30 by , 16 years ago
Cc: | added |
---|
comment:31 by , 16 years ago
Cc: | added |
---|
comment:32 by , 16 years ago
Cc: | added |
---|
comment:33 by , 16 years ago
Cc: | added |
---|
comment:34 by , 16 years ago
Cc: | added |
---|
comment:35 by , 16 years ago
Cc: | added |
---|
comment:36 by , 16 years ago
Cc: | added |
---|
comment:37 by , 16 years ago
Cc: | added |
---|
by , 16 years ago
Attachment: | aggregates.patch added |
---|
Removed a development comment. Sorry for the noise.
comment:38 by , 16 years ago
Cc: | added |
---|
comment:39 by , 16 years ago
Cc: | removed |
---|
comment:41 by , 16 years ago
Cc: | added |
---|
comment:42 by , 16 years ago
Cc: | added |
---|
comment:43 by , 16 years ago
Cc: | added |
---|
comment:44 by , 16 years ago
Cc: | removed |
---|
comment:45 by , 16 years ago
Cc: | removed |
---|
comment:46 by , 16 years ago
Cc: | added |
---|
comment:47 by , 16 years ago
Cc: | added |
---|
comment:48 by , 16 years ago
Cc: | added |
---|
comment:49 by , 16 years ago
Status update: this ticket is now pretty much guaranteed to be on the "must have" list for v1.1.
I've been working on this patch in my own private sandbox in preparation for committing to trunk. In the interests of public disclosure of my progress, I've just pushed my updates onto github.
Big, flashing light warning: While the basic shape of the API has been finalized, the code itself isn't final. Changes are expected before this reaches trunk. In particular:
- Return types for aggregate columns are still in flux
- The output format for ValuesQuerySet is still under development
- The internal structure of Aggregate classes is subject to change
- The handling of Having clauses will undergo some major changes.
There will probably be other changes. In short - consider yourself warned. This is a development branch. It is subject to change. It will occasionally break. Backwards compatibility is in no way guaranteed until this ticket is committed to the Django trunk.
comment:50 by , 16 years ago
Cc: | added |
---|
comment:51 by , 16 years ago
Cc: | added |
---|
comment:52 by , 16 years ago
Cc: | added |
---|
comment:53 by , 16 years ago
Cc: | added |
---|
comment:54 by , 16 years ago
Cc: | added |
---|
comment:55 by , 16 years ago
Cc: | added |
---|
by , 16 years ago
Attachment: | geoquery_aggregation_fix.diff added |
---|
by , 16 years ago
Attachment: | queryset_modular_aggregates.diff added |
---|
Makes aggregation functionality more amenable to subclassing.
comment:56 by , 16 years ago
Cc: | added |
---|
Will this account for grouping by date or time on a db column with datetime/timestamp type?
eg. 1. select * from table group by year(timestamp) 2. select * from table group by hour(timestamp)
*note that the sql examples above probably does not work as proper sql.
comment:57 by , 16 years ago
Cc: | added |
---|
comment:58 by , 16 years ago
Needs tests: | unset |
---|---|
Patch needs improvement: | unset |
Call for final testing!
Aggregations are almost ready for trunk, so I would greatly appreciate any bug reports on the code as it currently stands. Pull a copy of the repository from my github repository; if you're not a git user, you can download a snapshot tarball from the github page.
follow-up: 60 comment:59 by , 16 years ago
I would be great if this feature supported Standard Deviation to do some simple statistics. I'm currently doing this with plain sql and it is a hassle.
comment:60 by , 16 years ago
Replying to underbluewaters:
I would be great if this feature supported Standard Deviation to do some simple statistics.
I've just updated the github repository with StdDev and Variance, including an optional flag to do sample stats instead of population stats. Be warned, however: SQlite doesn't provide these aggregates out of the box - you will need to install an SQLite extension module to get StdDev and Variance.
comment:61 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
(In [9742]) Fixed #3566 -- Added support for aggregation to the ORM. See the documentation for details on usage.
Many thanks to:
- Nicolas Lara, who worked on this feature during the 2008 Google Summer of Code.
- Alex Gaynor for his help debugging and fixing a number of issues.
- Justin Bronn for his help integrating with contrib.gis.
- Karen Tracey for her help with cross-platform testing.
- Ian Kelly for his help testing and fixing Oracle support.
- Malcolm Tredinnick for his invaluable review notes.
I would add that the resulting structure should be accessible within a template. Thus, one should be able to have in one's view
context = {
}
return render_to_response('template.html', context)
one should be able to refer to the results in a template as
{{ stats.sum.field1 }}
{{ stats.group_by.field0 }}
just as one can refer to
{% for item in items %}
{{ item.field3 }}
{% endif %}
-tkc