Opened 8 years ago

Closed 8 years ago

#25834 closed Cleanup/optimization (invalid)

Better expose ORM grouping semantics

Reported by: Raphael Gaschignard Owned by: nobody
Component: Database layer (models, ORM) Version: 1.8
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Currently, what QuerySet's annotate does is entirely dependent on opaque semantics that don't hold up to well when working on more complex situations.

Let's say I have a LineItem model (the line items of an invoice, for example) that I want to calculate some stats on, with a price, and a price_currency. Some pointy haired boss wants to know what the totals on the line items are depending on currency. Pull out my nifty annotate tool:

In [33]: print LineItem.objects.values('price_currency').annotate(total=Sum('price'))
[{'price_currency': u'USD', 'total': Decimal('9001')}]

So there annotate ends up putting things into the GROUP BY clause to do the sum (namely the previous values call)

Sometimes I just want to rename columns to make my life easier, so I'll make a pc field as a shortcut to price:

In[37]: print LineItem.objects.values('price_currency').annotate(pc=F('price'))
[{'price_currency': u'USD', 'pc': Decimal('0')}, {'price_currency': u'USD', 'pc': Decimal('34')}, ...etc...]

Here annotate adds a SELECT clause without touching the GROUP BY clause

When I end up working on a bigger query, I'll try to simplify my queryset expression after I have something that computes what I wanted, so doing things like merging filters and, sometimes annotate:

In[39]: print LineItem.objects.values('price_currency').annotate(total=Sum('price'), pc=F('price')).query

SELECT "item_lineitem"."price_currency", SUM("item_lineitem"."price") AS "total" FROM "item_lineitem" GROUP BY "item_lineitem"."price_currency", "item_lineitem"."price" ORDER BY "item_lineitem"."position" ASC

So when I annotate with a mixture of "grouping required" and "adding a select" , then everything becomes "grouping required", and in the previous example, suddenly I'm grouping by currency AND price. I think this is the only valid behaviour if you want to return something, but shouldn't this just fail?

(Some crazier stuff is when you annotate with Sum and it simply groups by _all_ of your model's fields. I think the reason this doesn't fail is for the motivating example in the documentation (of counting the min price of the books in a store), but this seems like something that should fail)



My understanding is that Django's ORM is coming more to terms with the fact that it's basically always going to be paired with a RDBMS, so this multi-meaning annotate being the only reasonable way to use grouping is a hard sell to me. I very much want to get rid of all of our custom SQL, but I have to spend a day working on one query, all because annotate ends up being 2 separate functions:

  • "group by previous values call and add aggregation function to select"
  • "add a select clause"

Not to mention that the pairing with values means that if I want to group by a complex query, I have to do something like:

Model.objects.values('foo').annotate(grouping_property=complex_thing).annotate(Sum('stuff')).annotate(grouping_property=complex_thing)

in order to actually have the grouping_property to show up in the select (values doesn't work on some of my annotated values for some reason... there might be a bug that I haven't successfully isolated in my code or the ORM).

Anyways, I'm complaining about this because I would much rather have errors show up for a lot of these cases, so that I can at least fix them, but in the current API that would be impossible...

In my ideal world, you would have group(by=columns_or_expressions, into=aggregation_things) which would be somewhat equivalent to "group by the columns in by (something like values), and annotate your select with the into clause)", so when someone asks how to do a grouping you can point to that directly. And annotate would be relegated to only touching the SELECT part of a query. With backwards compatability considerations, you could probably make a separate method just for SELECT.

I'm not sure of the implications with regards to joins and things like annotate(Sum('books__price')), but my impression is that all this functionality is in the ORM already, just all within annotate

Change History (4)

comment:1 by Josh Smeaton, 8 years ago

Hi rtpg, thanks for taking the time to write this all out. Let me respond to some of this in line, and we can keep the conversation going from there.

So when I annotate with a mixture of "grouping required" and "adding a select" , then everything becomes "grouping required", and in the previous example, suddenly I'm grouping by currency AND price. I think this is the only valid behaviour if you want to return something, but shouldn't this just fail?

This was actually pretty surprising to me, but not for the reason you mention. I would have thought that the extra alias you were annotating would have been added to the SELECT list AND the GROUP BY clauses. GROUPing is usually required for any non-aggregate columns used in the SELECT list (and sometimes the ORDER BY too).

(Some crazier stuff is when you annotate with Sum and it simply groups by _all_ of your model's fields. I think the reason this doesn't fail is for the motivating example in the documentation (of counting the min price of the books in a store), but this seems like something that should fail)

Grouping by all the fields is useful for aggregating over a related model. This shouldn't be an error condition.

My understanding is that Django's ORM is coming more to terms with the fact that it's basically always going to be paired with a RDBMS, so this multi-meaning annotate being the only reasonable way to use grouping is a hard sell to me.

Fair enough, and I can sympathise. I've got a lot of experience with the ORM and with aggregation in general, and sometimes even I can mess up these queries. I find it easier to think in terms of SQL, and then work back from there. This isn't what an ORM should be going, but I'm not sure how others approach aggregation with Django's ORM.

I very much want to get rid of all of our custom SQL, but I have to spend a day working on one query, all because annotate ends up being 2 separate functions:

  • "group by previous values call and add aggregation function to select"
  • "add a select clause"

I think it helps to think of annotate as the following: "adds an expression to the query" which in most cases simply adds a new column/calculation to the SELECT list. Annotate isn't responsible for generating the GROUP BY. The SQL Compiler generates the GROUP BY based on the existence of an "aggregate" (an expression with contains_aggregate=True) in the SELECT list.

Not to mention that the pairing with values means that if I want to group by a complex query, I have to do something like:

Model.objects.values('foo')
.annotate(grouping_property=complex_thing)
.annotate(Sum('stuff'))
.annotate(grouping_property=complex_thing)`

First, I think you've added an extra .annotate clause at the end by mistake. Second, I don't think this query is correct. You should be annotating the grouping property, and then using values to include this new property:

Model.objects
.annotate(grouping_property=complex_thing)
.values('foo', 'grouping_property') # grouping_property will also be in the SELECT list now
.annotate(Sum('stuff'))

The idea is to support expressions within values though, so the ideal queryset using existing API (and an updated values() method) would be:

Model.objects
.values('foo', grouping_property=complex_thing)
.annotate(Sum('stuff'))

In my ideal world, you would have group(by=columns_or_expressions, into=aggregation_things) which would be somewhat equivalent to "group by the columns in by (something like values), and annotate your select with the into clause)",

I don't think this API would be a good one for a few reasons. Figuring out what to put in a GROUP BY clause is not trivial, is different on most backends, and would require specialised knowledge of SQL. The Django ORM is designed to be easy to construct queries, but also abstracted from the database enough that users should not need to know SQL.

Personally I think everyone using an ORM should be proficient with SQL, but that's not always the case.

Going back to GROUP BY though, most compliant databases require every non-aggregate select column to be in the group by. Most also require any ORDER BY columns to be added to the GROUP BY, else the ordering has no effect. Some databases will allow you to group by the unique constraint (like Primary Key or related Primary Keys) which is equivalent to all fields on the model.

The ORM should always construct the correct GROUP BY clause. If the API is clumsy, we should first ask if documentation can help. If the API is non-intuitive, we can try to make that behaviour more clear or fix any major inconsistencies. At the moment though, aggregating should be thought of as the following steps:

  1. Introduce all the expressions and columns you'll need with .annotate() (or with .values() in the future)
  2. Limit the SELECT/GROUPING fields with .values()
  3. Perform the aggregation with .annotate(Sum('...')) or .aggregate(Sum('...'))
  4. (Optionally) Filter the result of the aggregation with .filter()

If this isn't clear or obvious, then we need to document it.

The django orm follows the builder pattern. The order of operations matter in nearly all queries, so you really do have to pay attention to the order of queryset methods applied. This is why values() and then annotate() will apply GROUPING, but annotate() and then values() will just limit the columns returned to Django.

If you'd like to propose a different API idea or syntax change, then please show a few examples of a full queryset, and the associated SQL that it'd generate. Then we can discuss the merits of an actual implementation rather than the theoretical brain dump I wrote above :)

For what it's worth, I think there's room for improvement in these APIs. I don't know what these improvements would look like, but I should be able to help you identify problems with any proposals that may be put forward.

comment:2 by Shai Berger, 8 years ago

TL;DR. Just dropped in to say that this sort of discussion belongs on the DevelopersMailingList where many more people will see it.

comment:3 by Anssi Kääriäinen, 8 years ago

I think we have room for improvement here. Sometimes you need to do constructs like .values().annotate().values(), where the first values() is used to set the group by, the second one to select the columns you want. This is a somewhat non-obvious API. Also, in some cases you know exactly the group by you want, but there is no way to actually set the group by.

I know some users use the private qs.query.group_by directly. Maybe we could expose some API to do exactly that? The API would need to come with a warning that you are completely responsible for setting the correct group by, if you set it manually.

Agreed that this belongs to the mailing list, so lets continue there.

comment:4 by Tim Graham, 8 years ago

Resolution: invalid
Status: newclosed

Closing as "invalid" since this is a high level discussion that needs smaller, more actionable tickets pending the outcome of the mailing list discussions.

Note: See TracTickets for help on using tickets.
Back to Top