Opened 6 months ago

Last modified 6 months ago

#28296 new New feature

Add support for aggregation through subqueries

Reported by: László Károlyi Owned by: nobody
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Hey guys,

after having spent a couple of hours figuring out why a subquery annotated with Count fails, I came to the conclusion that Count in a query causes GROUP BY clauses added to it, which in turn renders the SQL subquery useless.

Please investigate this example which I just added to Stack Overflow as a resolution, and see what I'm talking about:

https://stackoverflow.com/a/44474268/1067833

Change History (4)

comment:1 Changed 6 months ago by Simon Charette

Hello László,

It's not clear to me why you not expecting a GROUP BY statement to be present in the subquery when you are explicitly grouping by id by using values('id') before annotating. Please see the section about aggregation and the usage of values() to GROUP BY in the aggregation documentation.

Also, is there a reason you are not simply using conditional aggregation for this query instead?

Performance.objects.annotate(
    reserved_seats=Count(
        Case(When(seat_reservations__status__in=TAKEN_TYPES, then='seat_reservations')),
    ),
)

Note that I'm assuming SeatReservation.performance.related_name == 'seat_reservations' above.

comment:2 in reply to:  1 Changed 6 months ago by László Károlyi

Hello Simon, thanks for getting back to me.

Replying to Simon Charette:

It's not clear to me why you not expecting a GROUP BY statement to be present in the subquery when you are explicitly grouping by id by using values('id') before annotating. Please see the section about aggregation and the usage of values() to GROUP BY in the

aggregation documentation.

The reason for why I needed to get rid of the GROUP BY clause was because it broke the COUNT() on the SQL level. I recognized this after going down to the SQL level and fiddling around with native SQL queries to find out why I'm getting NULL values at the counts. So here's a simplified version how the clause with GROUP BY looks, and what results it brings, copying from the MariaDB command line:

MariaDB [ticketshop]> select ticketshop_event_performance.id, (select count(u0.`id`) as count from ticketshop_booking_seatreservation u0 where (ticketshop_event_performance.id=(u0.performance_id)) and status='xxx' group by u0.id) as count from ticketshop_event_performance limit 1;                                           +-----+-------+
| id  | count |
+-----+-------+
| 134 |  NULL |
+-----+-------+
1 row in set (0.00 sec)

As you can see, it returns NULL in the count column. And so I removed the GROUP BY to see if it works that way:

MariaDB [ticketshop]> select ticketshop_event_performance.id, (select count(u0.`id`) as count from ticketshop_booking_seatreservation u0 where (ticketshop_event_performance.id=(u0.performance_id)) and status='xxx') as count from ticketshop_event_performance limit 1;
+-----+-------+
| id  | count |
+-----+-------+
| 134 |     0 |
+-----+-------+
1 row in set (0.00 sec)

See, it's returning proper values now. So that is why I went back to the ORM level and played around with it until I managed to remove the GROUP BY clause from the subquery. Basically, empirical testing.

Also it's worth mentioning that when I managed to remove the GROUP BY on the ORM level, executing the query without specifying the output field resulted in an exception when evaluating the query the first time, but not the second time:

In [8]: from ticketshop.booking.choices import TAKEN_TYPES

In [9]: rq = SeatReservation.objects.filter(
   ...:             performance=OuterRef(name='pk'), status__in=TAKEN_TYPES
   ...: ).values('id').annotate(count=Count('id')).values('count')

In [10]: rq.query.group_by = []

In [11]: a = Performance.objects.annotate(count=Subquery(rq))

In [12]: a[0].count
### LOTS OF TRACEBACK CUT
# ~/Work/venv/lib/python3.6/site-packages/django/db/models/expressions.py in _resolve_output_field(self) line 280:
# FieldError: Expression contains mixed types. You must set output_field
In [13]: a[0].count
Out[13]: 0

Also, is there a reason you are not simply using conditional aggregation for this query instead?

Performance.objects.annotate(
    reserved_seats=Count(
        Case(When(seat_reservations__status__in=TAKEN_TYPES, then='seat_reservations')),
    ),
)

Note that I'm assuming SeatReservation.performance.related_name == 'seat_reservations' above.

Thanks for the hint, I wasn't aware of this method, I'll look into it later.

comment:3 Changed 6 months ago by Simon Charette

Summary: Subquery with annotation fails without tweakingAdd support for aggregation through subqueries
Triage Stage: UnreviewedAccepted
Type: BugNew feature
Version: 1.11master

Hello László, thanks for taking the time to provide all these details.

I think that what you'd like to do use is the aggregate method instead of the annotate one to avoid the group by. Something along the lines of

Performance.objects.annotate(
    reserved_seats=Subquery(
        SeatReservation.objects.filter(
            performance=OuterRef(name='pk'),
            status__in=TAKEN_TYPES,
        ).aggregate(Count('pk'))
    ),
)

Unfortunately the Subquery API doesn't allow that yet.

It would require to either make aggregate return a lazy object that Subquery can deal with (right now it returns a dict on call) which has the potential of breaking backward compatibility or introducing a new kind of expression to deal with this case (e.g. AggregateSubquery(query, Count('pk'))).

In an ideal world we'd be able to call count() directly and avoid the Subquery wrapper

Performance.objects.annotate(
    reserved_seats=SeatReservation.objects.filter(
        performance=OuterRef(name='pk'),
        status__in=TAKEN_TYPES,
    ).count()
)

I'm converting the ticket to a feature request for subquery aggregation. It has a bit of overlap with #10060 as it would add a way to work around the issue by explicitly using subqueries as opposed to having the ORM do automatic pushdown but I still consider this a feature on it's own.

comment:4 Changed 6 months ago by Matthew Schinckel

I attempted to comment in the SO post, but comments are hardly conducive to code blocks.

I've written several Subquery subclasses to do aggregation of the subquery: perhaps that will work here.

class Count(Subquery):
    template = "(SELECT COUNT(*) FROM (%(subquery)s) _count)"
    
    @property
    def output_field(self):
        return models.IntegerField()
Note: See TracTickets for help on using tickets.
Back to Top