Opened 3 years ago
Last modified 3 years ago
#33706 closed Cleanup/optimization
Implement annotations as (lateral) joins — at Version 1
Reported by: | Bálint Balina | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 4.0 |
Severity: | Normal | Keywords: | annotate, performance, subquery, postgresql |
Cc: | Triage Stage: | Unreviewed | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
The ORM annotations are a very good and flexible tool for dynamic data. However there is a huge performance loss because of the way they are implemented. See the following example:
Product.objects.annotate( on_stock=Subquery( Stock.objects.filter( product_id=OuterRef('id'), ).values('product_id').annotate( on_stock=Sum('quantity'), ).values('on_stock').order_by(), output_field=DecimalField(), ), outgoing=Subquery( OrderItem.objects.filter( product_id=OuterRef('id'), ).values('product_id').annotate( outgoing=Sum('quantity'), ).values('outgoing').order_by(), output_field=DecimalField(), ), available=F('on_stock') - F('outgoing'), ).filter( available__gt=0, ).values('id', 'on_stock', 'outgoing', 'available')
When aggregates are needed from multiple tables annotation can be used as subqueries. Annotations can also be filtered against and reused. However django implements this query with the following SQL:
SELECT "product"."id", (SELECT SUM(u0."quantity") AS "on_stock" FROM "stock" u0 WHERE ( u0."product_id" = ("product"."id")) GROUP BY u0."product_id") AS "on_stock", (SELECT SUM(u0."quantity") AS "outgoing" FROM "order_item" u0 WHERE u0."product_id" = ("product"."id") GROUP BY u0."product_id") AS "outgoing", ((SELECT SUM(u0."quantity") AS "on_stock" FROM "stock" u0 WHERE (u0."product_id" = ("product"."id")) GROUP BY u0."product_id") - (SELECT SUM(u0."quantity") AS "outgoing" FROM "order_item" u0 WHERE u0."product_id" = ("product"."id") GROUP BY u0."product_id")) AS "available" FROM "product" WHERE (((SELECT SUM(u0."quantity") AS "on_stock" FROM "stock" u0 WHERE (u0."product_id" = ("product"."id")) GROUP BY u0."product_id") - (SELECT SUM(u0."quantity") AS "outgoing" FROM "order_item" u0 WHERE u0."product_id" = ("product"."id") GROUP BY u0."product_id")) > 0)
Both the on_stock and outgoing annotations are inlined 3 times and calculated 3 times by the database, which is 6 subquery instead of 2.
See the following PostgreSQL query adjusted by hand:
SELECT "product"."id", "on_stock", "outgoing", "on_stock" - "outgoing" as available FROM "product" , LATERAL (SELECT SUM(u0."quantity") AS "on_stock" FROM "stock" u0 WHERE (u0."product_id" = ("product"."id")) GROUP BY u0."product_id") AS "on_stock" , LATERAL (SELECT SUM(u0."quantity") AS "outgoing" FROM "order_item" u0 WHERE u0."product_id" = ("product"."id") GROUP BY u0."product_id") AS "outgoing" WHERE (("on_stock" - "outgoing") > 0)
The exact same subquery can be joined with the lateral keyword, producing the exact same results, but reusing calculations. The performance benefit is huge, it can be hundreds of milliseconds even for this simple query, even more so the more complicated and reused these calculations are.
I like the way annotate works, there are many small pieces to put together to get a complex result. I'm planning to look into implementing this behaviour, but wanted to check first about your opinions:
- do you think it is possible to patch the ORM to generate sql like above?
- do you see any pitfalls or caveats to this method of performing the query?
- if there are no pitfalls, or they can be resolved do you think that this could be the default way of annotating calculations in django? The examples are in postgresql, not sure about the other supported database engines compatibility.
Some links: