Opened 8 years ago

Last modified 11 months ago

#28296 assigned New feature

Add support for aggregation through subqueries

Reported by: László Károlyi Owned by: B Martsberger
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords:
Cc: Simon Charette, Evgeny Arshinov, şuayip üzülmez, Ionel Cristian Mărieș Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

Hey guys,

after having spent a couple of hours figuring out why a subquery annotated with Count fails, I came to the conclusion that Count in a query causes GROUP BY clauses added to it, which in turn renders the SQL subquery useless.

Please investigate this example which I just added to Stack Overflow as a resolution, and see what I'm talking about:

https://stackoverflow.com/a/44474268/1067833

Change History (21)

comment:1 by Simon Charette, 8 years ago

Hello László,

It's not clear to me why you not expecting a GROUP BY statement to be present in the subquery when you are explicitly grouping by id by using values('id') before annotating. Please see the section about aggregation and the usage of values() to GROUP BY in the aggregation documentation.

Also, is there a reason you are not simply using conditional aggregation for this query instead?

Performance.objects.annotate(
    reserved_seats=Count(
        Case(When(seat_reservations__status__in=TAKEN_TYPES, then='seat_reservations')),
    ),
)

Note that I'm assuming SeatReservation.performance.related_name == 'seat_reservations' above.

in reply to:  1 comment:2 by László Károlyi, 8 years ago

Hello Simon, thanks for getting back to me.

Replying to Simon Charette:

It's not clear to me why you not expecting a GROUP BY statement to be present in the subquery when you are explicitly grouping by id by using values('id') before annotating. Please see the section about aggregation and the usage of values() to GROUP BY in the

aggregation documentation.

The reason for why I needed to get rid of the GROUP BY clause was because it broke the COUNT() on the SQL level. I recognized this after going down to the SQL level and fiddling around with native SQL queries to find out why I'm getting NULL values at the counts. So here's a simplified version how the clause with GROUP BY looks, and what results it brings, copying from the MariaDB command line:

MariaDB [ticketshop]> select ticketshop_event_performance.id, (select count(u0.`id`) as count from ticketshop_booking_seatreservation u0 where (ticketshop_event_performance.id=(u0.performance_id)) and status='xxx' group by u0.id) as count from ticketshop_event_performance limit 1;                                           +-----+-------+
| id  | count |
+-----+-------+
| 134 |  NULL |
+-----+-------+
1 row in set (0.00 sec)

As you can see, it returns NULL in the count column. And so I removed the GROUP BY to see if it works that way:

MariaDB [ticketshop]> select ticketshop_event_performance.id, (select count(u0.`id`) as count from ticketshop_booking_seatreservation u0 where (ticketshop_event_performance.id=(u0.performance_id)) and status='xxx') as count from ticketshop_event_performance limit 1;
+-----+-------+
| id  | count |
+-----+-------+
| 134 |     0 |
+-----+-------+
1 row in set (0.00 sec)

See, it's returning proper values now. So that is why I went back to the ORM level and played around with it until I managed to remove the GROUP BY clause from the subquery. Basically, empirical testing.

Also it's worth mentioning that when I managed to remove the GROUP BY on the ORM level, executing the query without specifying the output field resulted in an exception when evaluating the query the first time, but not the second time:

In [8]: from ticketshop.booking.choices import TAKEN_TYPES

In [9]: rq = SeatReservation.objects.filter(
   ...:             performance=OuterRef(name='pk'), status__in=TAKEN_TYPES
   ...: ).values('id').annotate(count=Count('id')).values('count')

In [10]: rq.query.group_by = []

In [11]: a = Performance.objects.annotate(count=Subquery(rq))

In [12]: a[0].count
### LOTS OF TRACEBACK CUT
# ~/Work/venv/lib/python3.6/site-packages/django/db/models/expressions.py in _resolve_output_field(self) line 280:
# FieldError: Expression contains mixed types. You must set output_field
In [13]: a[0].count
Out[13]: 0

Also, is there a reason you are not simply using conditional aggregation for this query instead?

Performance.objects.annotate(
    reserved_seats=Count(
        Case(When(seat_reservations__status__in=TAKEN_TYPES, then='seat_reservations')),
    ),
)

Note that I'm assuming SeatReservation.performance.related_name == 'seat_reservations' above.

Thanks for the hint, I wasn't aware of this method, I'll look into it later.

comment:3 by Simon Charette, 8 years ago

Summary: Subquery with annotation fails without tweakingAdd support for aggregation through subqueries
Triage Stage: UnreviewedAccepted
Type: BugNew feature
Version: 1.11master

Hello László, thanks for taking the time to provide all these details.

I think that what you'd like to do use is the aggregate method instead of the annotate one to avoid the group by. Something along the lines of

Performance.objects.annotate(
    reserved_seats=Subquery(
        SeatReservation.objects.filter(
            performance=OuterRef(name='pk'),
            status__in=TAKEN_TYPES,
        ).aggregate(Count('pk'))
    ),
)

Unfortunately the Subquery API doesn't allow that yet.

It would require to either make aggregate return a lazy object that Subquery can deal with (right now it returns a dict on call) which has the potential of breaking backward compatibility or introducing a new kind of expression to deal with this case (e.g. AggregateSubquery(query, Count('pk'))).

In an ideal world we'd be able to call count() directly and avoid the Subquery wrapper

Performance.objects.annotate(
    reserved_seats=SeatReservation.objects.filter(
        performance=OuterRef(name='pk'),
        status__in=TAKEN_TYPES,
    ).count()
)

I'm converting the ticket to a feature request for subquery aggregation. It has a bit of overlap with #10060 as it would add a way to work around the issue by explicitly using subqueries as opposed to having the ORM do automatic pushdown but I still consider this a feature on it's own.

comment:4 by Matthew Schinckel, 8 years ago

I attempted to comment in the SO post, but comments are hardly conducive to code blocks.

I've written several Subquery subclasses to do aggregation of the subquery: perhaps that will work here.

class Count(Subquery):
    template = "(SELECT COUNT(*) FROM (%(subquery)s) _count)"
    
    @property
    def output_field(self):
        return models.IntegerField()

comment:5 by Sylvain Zimmer, 6 years ago

Hello !

I solved this with a custom class:

class SubQueryCount(Subquery):
    output_field = models.IntegerField()
    def __init__(self, *args, **kwargs):
        Subquery.__init__(self, *args, **kwargs)
        self.queryset = self.queryset.annotate(cnt=Count("*")).values("cnt")
        self.queryset.query.set_group_by()  # values() adds a GROUP BY we don't want here

subquery = Media.objects.filter(artist=OuterRef('id'))
artists_with_count_medias = Artist.objects.all().annotate(count_medias=SubQueryCount(subquery))

comment:6 by Mark Gensler, 6 years ago

I would like to offer the following solution using the new SQL Window class. This allows the aggregation of annotated values by calculating the aggregate over partitions based on the outer query model (in the GROUP BY clause), then annotating that data to every row in the subquery queryset. The subquery can then use the aggregated data from the first row returned and ignore the other rows.

I'd like some opinions and feedback on whether this is the optimal solution for aggregating arbitrary annotated data in a subquery. If so I'd be happy to write something up for the documentation. I couldn't find any other way to do this in the ORM! This also works well for calling Sum, Avg etc. rather than Count on annotated values on the subquery queryset.

Performance.objects.annotate(
    reserved_seats=Subquery(
        SeatReservation.objects.filter(
            performance=OuterRef(name='pk'),
            status__in=TAKEN_TYPES,
        ).annotate(
            reserved_seat_count=Window(
                expression=Count('pk'),
                partition_by=[F('performance')]
            ),
        ).values('reserved_seat_count')[:1],
        output_field=FloatField()
    )
)

comment:7 by Diederik van der Boor, 6 years ago

... either make aggregate return a lazy object that Subquery can deal with (right now it returns a dict on call)

I second that! It would make subqueries much easier to use.
A lazy dict or AggregateResult doesn't have to be too complicated. Providing __getitem__, __contains__, etc.. like UserDict would do the trick.
The same could happen with .count(). By letting it return a lazy CountResult object that casts to an int, it doesn't change the existing code, but also offers some .query attribute that Subquery() can use.

This would allow things like:

Performance.objects.annotate(
    reserved_seats=Subquery(
        SeatReservation.objects.filter(
            performance=OuterRef(name='pk'), status__in=TAKEN_TYPES
        ).aggregate(Count('pk'))
    )
)

Or even:

Performance.objects.annotate(
    reserved_seats=Subquery(
        SeatReservation.objects.filter(
            performance=OuterRef(name='pk'), status__in=TAKEN_TYPES
        ).count()
    )
)

comment:8 by Simon Charette, 6 years ago

A lazy dict or AggregateResult doesn't have to be too complicated. Providing getitem, contains, etc.. like UserDict would do the trick.

The main challenge here is that it changes the nature of count() and aggregate() methods so that they no longer perform a query automatically. This is certainly going to break a few tests but it should be possible to provide shims through __int__ for count() and dict subclassing for aggregate.

comment:9 by Simon Charette, 6 years ago

Cc: Simon Charette added

comment:10 by Simon Charette, 6 years ago

I did a bit of investigation to determine the length of changes this would require. I started with trying to make count() return a lazy object as it's a bit easier to implement than aggregate(). I didn't implement the resolve_expression or as_sql for now.

As expected making count() lazy breaks a few tests that expect the query to be immediately executed but but only a few adjustments to LazyObject are required to get most of the suite passing.

Still this breaks backward compatiblity and would certainly break more than just test code out there. For example, a view a could be getting a count, perfoming some alterations and retrieving a new count and that would break if the first query is not immediately executed.

The work I've done can be found here https://github.com/django/django/compare/master...charettes:ticket-28296

What I suggest we do here instead is to introduce a new AggregateSubQuery(queryset, aggregate) expression instead. It's certainly not as elegant as using count() or aggregate() but it does maintain backward compatiblity.

e.g.

Performance.objects.annotate(
    reserved_seats=AggregateSubQuery(SeatReservation.objects.filter(
        performance=OuterRef(name='pk'),
        status__in=TAKEN_TYPES,
    ), Count('pk')),
)

comment:11 by Simon Charette, 6 years ago

Another API alternative could be allow queries to be passed directly to aggregation functions

# Implicit Count(*)
Performance.objects.annotate(
    reserved_seats=Count(SeatReservation.objects.filter(
        performance=OuterRef('pk'),
        status__in=TAKEN_TYPES,
    ))
)

# Explicit Count('pk')
Performance.objects.annotate(
    reserved_seats=Count(SeatReservation.objects.filter(
        performance=OuterRef('pk'),
        status__in=TAKEN_TYPES,
    ).values('pk'))
)

# Sum('seat__amount')
Performance.objects.annotate(
    reserved_seats_total_amount=Sum(SeatReservation.objects.filter(
        performance=OuterRef('pk'),
        status__in=TAKEN_TYPES,
    ).values('seat__amount')
)

Or to allow passing a subquery kwarg to aggregation expressions. It can either be True or a query

# subquery=True
Performance.objects.annotate(
    reserved_seats=Count(
        'seat_reservations',
        filter=Q(status__in=TAKEN_TYPES),
        subquery=True,
    ),
)

# explicit subquery
Performance.objects.annotate(
    reserved_seats=Count(
        'seat_reservations',
        subquery=SeatReservation.objects.filter(
            status__in=TAKEN_TYPES,
        ),
    ),
)

# Sum('seat_reservations__seat__amount')
Performance.objects.annotate(
    reserved_seats_total_amount=Sum(
        'seat_reservations__seat__amount'',
        subquery=True,
    ),
)

I think I like thesubquery kwarg alternative better. It blends naturally with the newly added filter kwarg and result in less verbose expressions as the subquery shouldn't have to be provided most of the time.

comment:12 by Simon Charette, 6 years ago

I looks like someone went ahead and implemented the AggregateSubquery API.

I'd like to throw another one into the mix which I believe would be more suitable than the AggregateSubquery and subquery kwarg idea; an as_subquery() method that returns a Subquery instance. It would offer a nicer interface through composition and allow encapsulation of most of the subquery creation logic instead of doing it at Aggregate expression resolving and compiling time. The as_* pattern is also coherent with the Queryset.as_manager interface so it shouldn't be completely alien to most of our users.

comment:13 by B Martsberger, 5 years ago

Owner: changed from nobody to B Martsberger
Status: newassigned

comment:14 by Mariusz Felisiak, 5 years ago

Has patch: set

comment:15 by Evgeny Arshinov, 5 years ago

Cc: Evgeny Arshinov added

comment:16 by Mariusz Felisiak, 5 years ago

Patch needs improvement: set

Marking as "needs improvement" due to Simon's comments.

comment:17 by Matt Hegarty, 4 years ago

I wanted to comment in case it helps anyone else.

My use case was to use Django's subquery API to retrieve a count of "active" days. That is, the number of days on which events were recorded.

The (postgres) SQL for this query would be:

SELECT "player_player"."id",
       (SELECT count(1) from (select date_trunc('day', evt.timestamp) "day"
                 FROM "player_event" evt
                 WHERE (evt."player_id" = "player_player"."id")
                 GROUP BY evt."player_id", "day") a) AS "day_count"

FROM "player_player"

I had to solve this using a subclass of Subquery:

class EventDayCountSubquery(Subquery):
    template = "(SELECT COUNT(1) FROM (%(subquery)s) _count)"
    output_field = IntegerField()

I could then use this as a regular annotation:

event_day_count = (
    Event.objects.filter(player=OuterRef("pk"))
        .annotate(day=TruncDay("timestamp")).values("day")
        .annotate(count=Count("pk"))
)

Player.objects.annotate(
    event_day_count_90_day=EventDayCountSubquery(event_day_count)
)

in reply to:  17 comment:18 by B Martsberger, 4 years ago

Matt, I'm not sure your use case will be improved by this ticket due to the need to group by a computed field. But it also looks like it's not necessary to subclass Subquery with a new template.

event_day_subquery = Subquery(
    Event.objects.annotate(day=TruncDay('timestamp'))
         .values('project_id', 'day')
         .annotate(cnt=Count(Value(1)))
         .values('cnt')
         .filter(project_id=OuterRef('id'))

Player.objects.annotate(day_count=event_day_count).values('id', 'day_count')

Produces SQL like

SELECT "player_player"."id",
    (SELECT COUNT(1) AS "cnt"
     FROM "player_event" U0
     WHERE U0."player_id" = "player_player"."id"
     GROUP BY U0."player_id",
              DATE_TRUNC('day', U0."timestamp" )) AS "day_count"
FROM "player_player"

Replying to Matt Hegarty:

comment:19 by Simon Charette, 2 years ago

#10621 was a 13 years old duplicate.

comment:20 by şuayip üzülmez, 12 months ago

Cc: şuayip üzülmez added

comment:21 by Ionel Cristian Mărieș, 11 months ago

Cc: Ionel Cristian Mărieș added
Note: See TracTickets for help on using tickets.
Back to Top