Opened 9 years ago
Closed 9 years ago
#25834 closed Cleanup/optimization (invalid)
Better expose ORM grouping semantics
Reported by: | Raphael Gaschignard | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 1.8 |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Unreviewed | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Currently, what QuerySet's annotate
does is entirely dependent on opaque semantics that don't hold up to well when working on more complex situations.
Let's say I have a LineItem
model (the line items of an invoice, for example) that I want to calculate some stats on, with a price
, and a price_currency
. Some pointy haired boss wants to know what the totals on the line items are depending on currency. Pull out my nifty annotate
tool:
In [33]: print LineItem.objects.values('price_currency').annotate(total=Sum('price')) [{'price_currency': u'USD', 'total': Decimal('9001')}]
So there annotate
ends up putting things into the GROUP BY
clause to do the sum (namely the previous values
call)
Sometimes I just want to rename columns to make my life easier, so I'll make a pc
field as a shortcut to price:
In[37]: print LineItem.objects.values('price_currency').annotate(pc=F('price')) [{'price_currency': u'USD', 'pc': Decimal('0')}, {'price_currency': u'USD', 'pc': Decimal('34')}, ...etc...]
Here annotate
adds a SELECT
clause without touching the GROUP BY
clause
When I end up working on a bigger query, I'll try to simplify my queryset expression after I have something that computes what I wanted, so doing things like merging filter
s and, sometimes annotate
:
In[39]: print LineItem.objects.values('price_currency').annotate(total=Sum('price'), pc=F('price')).query SELECT "item_lineitem"."price_currency", SUM("item_lineitem"."price") AS "total" FROM "item_lineitem" GROUP BY "item_lineitem"."price_currency", "item_lineitem"."price" ORDER BY "item_lineitem"."position" ASC
So when I annotate with a mixture of "grouping required" and "adding a select" , then everything becomes "grouping required", and in the previous example, suddenly I'm grouping by currency AND price. I think this is the only valid behaviour if you want to return something, but shouldn't this just fail?
(Some crazier stuff is when you annotate
with Sum and it simply groups by _all_ of your model's fields. I think the reason this doesn't fail is for the motivating example in the documentation (of counting the min price of the books in a store), but this seems like something that should fail)
My understanding is that Django's ORM is coming more to terms with the fact that it's basically always going to be paired with a RDBMS, so this multi-meaning annotate being the only reasonable way to use grouping is a hard sell to me. I very much want to get rid of all of our custom SQL, but I have to spend a day working on one query, all because annotate
ends up being 2 separate functions:
- "group by previous
values
call and add aggregation function to select" - "add a select clause"
Not to mention that the pairing with values
means that if I want to group by a complex query, I have to do something like:
Model.objects.values('foo').annotate(grouping_property=complex_thing).annotate(Sum('stuff')).annotate(grouping_property=complex_thing)
in order to actually have the grouping_property
to show up in the select (values
doesn't work on some of my annotate
d values for some reason... there might be a bug that I haven't successfully isolated in my code or the ORM).
Anyways, I'm complaining about this because I would much rather have errors show up for a lot of these cases, so that I can at least fix them, but in the current API that would be impossible...
In my ideal world, you would have group(by=columns_or_expressions, into=aggregation_things)
which would be somewhat equivalent to "group by the columns in by
(something like values
), and annotate your select
with the into
clause)", so when someone asks how to do a grouping you can point to that directly. And annotate
would be relegated to only touching the SELECT
part of a query. With backwards compatability considerations, you could probably make a separate method just for SELECT
.
I'm not sure of the implications with regards to joins and things like annotate(Sum('books__price'))
, but my impression is that all this functionality is in the ORM already, just all within annotate
Change History (4)
comment:1 by , 9 years ago
comment:2 by , 9 years ago
TL;DR. Just dropped in to say that this sort of discussion belongs on the DevelopersMailingList where many more people will see it.
comment:3 by , 9 years ago
I think we have room for improvement here. Sometimes you need to do constructs like .values().annotate().values(), where the first values() is used to set the group by, the second one to select the columns you want. This is a somewhat non-obvious API. Also, in some cases you know exactly the group by you want, but there is no way to actually set the group by.
I know some users use the private qs.query.group_by directly. Maybe we could expose some API to do exactly that? The API would need to come with a warning that you are completely responsible for setting the correct group by, if you set it manually.
Agreed that this belongs to the mailing list, so lets continue there.
comment:4 by , 9 years ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
Closing as "invalid" since this is a high level discussion that needs smaller, more actionable tickets pending the outcome of the mailing list discussions.
Hi rtpg, thanks for taking the time to write this all out. Let me respond to some of this in line, and we can keep the conversation going from there.
This was actually pretty surprising to me, but not for the reason you mention. I would have thought that the extra alias you were annotating would have been added to the SELECT list AND the GROUP BY clauses. GROUPing is usually required for any non-aggregate columns used in the SELECT list (and sometimes the ORDER BY too).
Grouping by all the fields is useful for aggregating over a related model. This shouldn't be an error condition.
Fair enough, and I can sympathise. I've got a lot of experience with the ORM and with aggregation in general, and sometimes even I can mess up these queries. I find it easier to think in terms of SQL, and then work back from there. This isn't what an ORM should be going, but I'm not sure how others approach aggregation with Django's ORM.
I think it helps to think of annotate as the following: "adds an expression to the query" which in most cases simply adds a new column/calculation to the SELECT list. Annotate isn't responsible for generating the GROUP BY. The SQL Compiler generates the GROUP BY based on the existence of an "aggregate" (an expression with
contains_aggregate=True
) in the SELECT list.First, I think you've added an extra
.annotate
clause at the end by mistake. Second, I don't think this query is correct. You should be annotating the grouping property, and then usingvalues
to include this new property:The idea is to support expressions within
values
though, so the ideal queryset using existing API (and an updated values() method) would be:I don't think this API would be a good one for a few reasons. Figuring out what to put in a GROUP BY clause is not trivial, is different on most backends, and would require specialised knowledge of SQL. The Django ORM is designed to be easy to construct queries, but also abstracted from the database enough that users should not need to know SQL.
Personally I think everyone using an ORM should be proficient with SQL, but that's not always the case.
Going back to GROUP BY though, most compliant databases require every non-aggregate select column to be in the group by. Most also require any ORDER BY columns to be added to the GROUP BY, else the ordering has no effect. Some databases will allow you to group by the unique constraint (like Primary Key or related Primary Keys) which is equivalent to all fields on the model.
The ORM should always construct the correct GROUP BY clause. If the API is clumsy, we should first ask if documentation can help. If the API is non-intuitive, we can try to make that behaviour more clear or fix any major inconsistencies. At the moment though, aggregating should be thought of as the following steps:
.annotate()
(or with .values() in the future).values()
.annotate(Sum('...'))
or.aggregate(Sum('...'))
.filter()
If this isn't clear or obvious, then we need to document it.
The django orm follows the builder pattern. The order of operations matter in nearly all queries, so you really do have to pay attention to the order of queryset methods applied. This is why
values()
and thenannotate()
will apply GROUPING, butannotate()
and thenvalues()
will just limit the columns returned to Django.If you'd like to propose a different API idea or syntax change, then please show a few examples of a full queryset, and the associated SQL that it'd generate. Then we can discuss the merits of an actual implementation rather than the theoretical brain dump I wrote above :)
For what it's worth, I think there's room for improvement in these APIs. I don't know what these improvements would look like, but I should be able to help you identify problems with any proposals that may be put forward.