Opened 3 years ago

Last modified 3 years ago

#33706 closed Cleanup/optimization

Implement annotations as (lateral) joins — at Initial Version

Reported by: Bálint Balina Owned by: nobody
Component: Database layer (models, ORM) Version: 4.0
Severity: Normal Keywords: annotate, performance, subquery, postgresql
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The ORM annotations are a very good and flexible tool for dynamic data. However there is a huge performance loss because of the way they are implemented. See the following example:

Product.objects.annotate(
    on_stock=Subquery(
        Stock.objects.filter(
            product_id=OuterRef('id'),
        ).values('product_id').annotate(
            on_stock=Sum('quantity'),
        ).values('on_stock').order_by(),
        output_field=DecimalField(),
    ),
    outgoing=Subquery(
        OrderItem.objects.filter(
            product_id=OuterRef('id'),
        ).values('product_id').annotate(
            outgoing=Sum('quantity'),
        ).values('outgoing').order_by(),
        output_field=DecimalField(),
    ),
    available=F('on_stock') - F('outgoing'),
).filter(
    available__gt=0,
).values('id', 'on_stock', 'outgoing', 'available')

When aggregates are needed from multiple tables annotation can be used as subqueries. Annotations can also be filtered against and reused. However django implements this query with the following SQL:

SELECT "product"."id",
       (SELECT SUM(u0."quantity") AS "on_stock"
        FROM "stock" u0
        WHERE ( u0."product_id" = ("product"."id"))
        GROUP BY u0."product_id")                                AS "on_stock",
       (SELECT SUM(u0."quantity") AS "outgoing"
        FROM "order_item" u0
        WHERE u0."product_id" = ("product"."id")
        GROUP BY u0."product_id")                                AS "outgoing",
       ((SELECT SUM(u0."quantity") AS "on_stock"
         FROM "stock" u0
         WHERE (u0."product_id" = ("product"."id"))
         GROUP BY u0."product_id") - (SELECT SUM(u0."quantity") AS "outgoing"
                                      FROM "order_item" u0
                                      WHERE u0."product_id" = ("product"."id")
                                      GROUP BY u0."product_id")) AS "available"
FROM "product"
WHERE (((SELECT SUM(u0."quantity") AS "on_stock"
         FROM "stock" u0
         WHERE (u0."product_id" = ("product"."id"))
         GROUP BY u0."product_id") - (SELECT SUM(u0."quantity") AS "outgoing"
                                      FROM "order_item" u0
                                      WHERE u0."product_id" = ("product"."id")
                                      GROUP BY u0."product_id")) > 0)

Both the on_stock and outgoing annotations are inlined 3 times and calculated 3 times by the database, which is 6 subquery instead of 2.

See the following PostgreSQL query adjusted by hand:

SELECT "product"."id",
       "on_stock",
       "outgoing",
       "on_stock" - "outgoing" as available
FROM "product"
   , LATERAL (SELECT SUM(u0."quantity") AS "on_stock"
              FROM "stock" u0
              WHERE (u0."product_id" = ("product"."id"))
              GROUP BY u0."product_id") AS "on_stock"
   , LATERAL (SELECT SUM(u0."quantity") AS "outgoing"
              FROM "order_item" u0
              WHERE u0."product_id" = ("product"."id")
              GROUP BY u0."product_id") AS "outgoing"
WHERE (("on_stock" - "outgoing") > 0)

The exact same subquery can be joined with the lateral keyword, producing the exact same results, but reusing calculations. The performance benefit is huge, it can be hundreds of milliseconds even for this simple query, even more so the more complicated and reused these calculations are.

I like the way annotate works, there are many small pieces to put together to get a complex result. I'm planning to look into implementing this behaviour, but wanted to check first about your opinions:

  • do you think it is possible to patch the ORM to generate sql like above?
  • do you see any pitfalls or caveats to this method of performing the query?
  • if there are no pitfalls, or they can be resolved do you think that this could be the default way of annotating calculations in django? The examples are in postgresql, not sure about the other supported database engines compatibility.

Change History (0)

Note: See TracTickets for help on using tickets.
Back to Top