Opened 7 years ago

Last modified 17 months ago

#28333 assigned New feature

Filter and subquery for window expressions

Reported by: Mads Jensen Owned by: Simon Charette
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords: window orm filter subquery GSoC
Cc: Alexandr Artemyev, Andy Terra, Étienne Beaulé, Michael Wheeler, şuayip üzülmez, John Speno, Alex Scott, Ad Timmering, Hannes Ljungberg, Dave Johansen, Simon Charette Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Mads Jensen)

#26608 will introduce window function expressions, but will disallow filtering on the result of them, e.g.:

Window.objects.annotate(row=Window(expression=RowNumber())).filter(row__gt=1)

is not allowed. Instead, the window function expression should be wrapped in an inner query, and the filtering should be done in an outer query.

Change History (34)

comment:1 by Mads Jensen, 7 years ago

Description: modified (diff)
Triage Stage: UnreviewedAccepted

comment:2 by Bernd Wechner, 5 years ago

This is 2 years old with no action and I am very keen to see it implemented (need it rather badly).

It strikes me as an aside that a more general approach may kill more birds with one stone. I noticed the rather excellent ExpressionWrapper(), and it struck me that a QueryWrapper() would be a more general solution that covers this particular need and will cover others as well, known and unknown at present.

In short QueryWrapper would simply make an inner query of the QuerySet to date so that subsequent operations act upon it as if it were a table.

Last edited 5 years ago by Bernd Wechner (previous) (diff)

comment:3 by Alexandr Artemyev, 5 years ago

Cc: Alexandr Artemyev added

comment:4 by Andy Terra, 5 years ago

Cc: Andy Terra added

comment:5 by Étienne Beaulé, 4 years ago

Cc: Étienne Beaulé added

comment:6 by Mariusz Felisiak, 4 years ago

Keywords: GSoC added

comment:7 by Manav Agarwal, 4 years ago

Owner: changed from nobody to Manav Agarwal
Status: newassigned

comment:9 by Manav Agarwal, 4 years ago

I was doing some research on this issue and found a few solutions to the problem. (all these are vague ideas. Any suggestions/feedback would be appreciated to make the idea worth implementing)

  1. A separate QueryWrapper Class which will have syntax like this
    1. Window.objects.annotate(row=QueryWrapper(Window(expression=RowNumber())).filter(row__gt=1) OR
    2. QueryWrapper class internally implemented for window function to automatically generate SQL subquery for all window expressions.
  2. Use the subquery class internally to make it execute all window expression related queries as subqueries.
  3. Passing an alias to window expression and then in spite of generating half query with just over and order by clause we may generate a separate select statement when will further be used as a select statement for a separate table in the query.

I personally feel that implementing 1.a would be a good option but as I mentioned above this is just a vague idea and to implement it I need some guidance from someone who is more experienced.

Last edited 4 years ago by Manav Agarwal (previous) (diff)

comment:10 by Michael Wheeler, 4 years ago

Cc: Michael Wheeler added

comment:11 by Manav Agarwal, 4 years ago

Owner: Manav Agarwal removed

comment:12 by şuayip üzülmez, 4 years ago

Cc: şuayip üzülmez added

comment:13 by John Speno, 3 years ago

Cc: John Speno added

comment:14 by Alex Scott, 3 years ago

Cc: Alex Scott added

Is there a recommend workaround for not being able to filter on a window function result? Can you wrap it somehow manually?

comment:15 by Mariusz Felisiak, 3 years ago

Status: assignednew

comment:16 by Ad Timmering, 3 years ago

Cc: Ad Timmering added

comment:17 by rossm6, 3 years ago

I have this problem as well.

According to this article https://learnsql.com/blog/window-functions-not-allowed-in-where/ the solution is either to use a common table expression or a subquery which is the FROM clause in the sql query. Neither unfortunately is supported by django it seems. Although I did find this package for the first option - https://docs.djangoproject.com/en/3.2/ref/models/querysets/#extra.

Both of these should be options in the Django ORM right? Each would be a big win for the power of the ORM.

comment:18 by Hannes Ljungberg, 3 years ago

Cc: Hannes Ljungberg added

comment:19 by Dave Johansen, 3 years ago

Cc: Dave Johansen added

comment:20 by Safa Alfulaij, 3 years ago

What I'm doing currently is this hack:

Before:

queryset = MyModel.objects.annotate(row=Window(expression=RowNumber())).filter(row__gt=1)

After:

queryset = MyModel.objects.annotate(row=Window(expression=RowNumber()))
sql, params = queryset.query.sql_with_params()
queryset = queryset.raw(f"SELECT * FROM ({sql}) AS full WHERE row >= 1", params)

I don't see that it's bad to have this currently, with whatever limitations of raw documented in the Window filtering section, then add in more features if real use cases are provided.

comment:21 by Simon Charette, 2 years ago

#26780 which is about adding support for slices prefetching (think top-n results per category) to core would benefit from this feature being implemented at least partially.

The most difficult part of this issue is not the subquery pushdown itself (see #24462) but making sure that union filters of the form filter(Q(window__lookup=foo) | Q(aggregate__lookup=bar) | Q(field__lookup=baz)) are resulting in the proper usage of inner query WHERE and HAVING and outer query usage of WHERE (see the Where.split_having method for the current implementation).

If we were to start by focusing this ticket on the simple intersection use cases of the form filter(window__lookup=foo) (as reported here and required by #26780) I suspect we'd cover most of the use cases while deferring most of the complexity. If someone would like to give this a shot I'd start by doing the following:

  1. Make Window.filterable = True for now
  2. Adjust Where.split_having to properly deal with self.contains_over_clause by returning a triple of the form (where: Where, having: Where, window: Where) and error out when self.connector != AND and self.contains_over_clause. Possibly rename to split_having_window?
  3. Adjust SQLCompiler.pre_sql_setup to assign self.over_where and use it in SQLCompiler.as_sql to wrap the query in a subquery that SELECT * FROM ({subquery_sql}) subquery WHERE {over_where_sql}
  4. Add tests for new supported use cases and disallowed ones.
  5. Make Q.filterable return False when self.connector != AND and self.contains_over_clause but that will result in weird error messages of the form Q is disallowed in the filter clause. so maybe we'll want to deprecate Q.filterable in favour of a BaseExpression.check_filterable method instead that defaults to raise the current message and is overridden in Q to raise a proper message with regards to complex filters window functions.

Happy to review a PR that attempts the above or provide feedback here if that means this ticket is partially fixed and allows for #26780 to benefit from this work.

comment:22 by Simon Charette, 2 years ago

Cc: Simon Charette added

comment:23 by Simon Charette, 2 years ago

Had a first stab at the above and it seems to be working relatively well, not too intrusive of a change. I'll give a shot at implementing #26780 on to of it now to confirm it could work.

As a side note it seems that the Snowflake database has an SQL extension to filter against window functions, the QUALIFY clause.

comment:24 by Simon Charette, 2 years ago

Submit a PR that adds support for jointed predicates but still disallowed disjointed ones.

For example, given the following model and queryset

class Employee(models.Model):
    name = models.CharField(max_length=50)
    department = models.CharField(max_length=50)
    salary = models.IntegerField()

class PastEmployeeDepartment(models.Model):
    employee = models.ForeignKey(Employee, related_name="past_departments")
    department = models.CharField(max_length=50)

queryset = Employee.objects.annotate(
    dept_max_salary=Window(Max(), partition_by="department"),
    dept_salary_rank=Window(Rank(), partition_by="department", order_by="-salary"),
    past_depths_cnt=Count("past_departments"),
)

All of the following is supported

# window predicate will be pushed to outer query
queryset.filter(dept_max_salary__gte=F("salary")) 
SELECT * FROM (...) "quantify" WHERE dept_max_salary >= "quantify"."salary"

# department predicate will be applied in inner query
queryset.filter(department="IT", dept_max_salary__gte=F("salary"))
SELECT * FROM (... WHERE "department" = 'IT') "quantify" WHERE dept_max_salary >= "quantify"."salary"

# aggregate predicate will be applied in the inner query
queryset.filter(past_depths_cnt__gte=1, dept_max_salary__gte=F("salary"))
SELECT * FROM (... HAVING COUNT("pastemployeedepartment"."id" >= 1) "quantify" WHERE dept_max_salary >= "quantify"."salary"

Some form of disjointed predicates against window functions (using OR) are also supported as long as they are only against window functions

# Disjointed predicates only about window functions is supported
queryset.filter(Q(dept_max_salary__gte=F("salary")) | Q(dept_salary_rank__lte=2))
SELECT * FROM (...) "quantify" WHERE "dept_max_salary" >= "quantify"."salary" OR "dept_salary_rank" <= 2

And limits are only applied on the outer query, once all window function filters are applied.

The following is not supported

  1. Disjointed filters mixing predicates against window functions and aggregates and/or column references as it's really hard to emulate without getting in multiple level of subquery pushdown particularly if aggregation is involved.
  2. Filtering against columns masked by the usage of values, values_list, or alias. This one could be to solved by adding another layer of subquery pushdown that avoids applying the mask in the subquery but does so in an outermost query over the one used for window filtering.
  3. Passing window functions instances directly to filter and exclude instead of referencing annotated window functions.

Feedback about the proposed supported feature set and implementation is very welcome.

Version 0, edited 2 years ago by Simon Charette (next)

comment:25 by Simon Charette, 2 years ago

Has patch: set

comment:26 by Mariusz Felisiak, 2 years ago

Needs tests: set
Owner: set to Simon Charette
Patch needs improvement: set
Status: newassigned

comment:27 by Simon Charette, 2 years ago

The latest version of the patch now supports filtering against annotations masked by the usage of values and friends.

queryset.filter(dept_max_salary__gte=2000).values("id") now results in

SELECT "col1" FROM (
    SELECT * FROM (
        SELECT "id" AS "col1", MAX OVER (...) AS "depth_max_salary" FROM ...
    ) "qualify" WHERE "dept_max_salary" >= 2000
) "qualify_mask"

comment:28 by Mariusz Felisiak <felisiak.mariusz@…>, 2 years ago

In 35911078:

Replaced Expression.replace_references() with .replace_expressions().

The latter allows for more generic use cases beyond the currently
limited ones constraints validation has.

Refs #28333, #30581.

comment:29 by Mariusz Felisiak <felisiak.mariusz@…>, 2 years ago

In 8c3046da:

Refs #28333 -- Moved SQLCompiler's forced column aliasing logic to get_select().

This extends query composability possibilities when dealing with
subqueries which is necessary to implement window function filtering.

comment:30 by Mariusz Felisiak <felisiak.mariusz@…>, 2 years ago

In f387d024:

Refs #28333 -- Added partial support for filtering against window functions.

Adds support for joint predicates against window annotations through
subquery wrapping while maintaining errors for disjointed filter
attempts.

The "qualify" wording was used to refer to predicates against window
annotations as it's the name of a specialized Snowflake extension to
SQL that is to window functions what HAVING is to aggregates.

While not complete the implementation should cover most of the common
use cases for filtering against window functions without requiring
the complex subquery pushdown and predicate re-aliasing machinery to
deal with disjointed predicates against columns, aggregates, and window
functions.

A complete disjointed filtering implementation should likely be
deferred until proper QUALIFY support lands or the ORM gains a proper
subquery pushdown interface.

comment:31 by Mariusz Felisiak, 2 years ago

Has patch: unset
Needs tests: unset
Patch needs improvement: unset

comment:32 by GitHub <noreply@…>, 2 years ago

In f210de76:

Refs #28333 -- Fixed NonQueryWindowTests.test_invalid_filter() on databases that don't support window expressions.

comment:33 by GitHub <noreply@…>, 2 years ago

In 3ba7f2e:

Refs #28333 -- Explicitly ordered outer qualify query on window filtering.

While most backends will propagate derived table ordering as long as
the outer query doesn't perform additional processing the SQL specs
doesn't explicitly state the ordering must be maintained.

comment:34 by Mariusz Felisiak, 21 months ago

Referencing outer window expressions in subqueries should also be supported, see #34368.

comment:35 by master, 17 months ago

My code was working till 3.2, but with https://code.djangoproject.com/changeset/8c3046daade8d9b019928f96e53629b03060fe73, it doesn't anymore.

Here is a simplified (no filter(), less annotate(), fake Value(), ...) demonstrator:

>>> from postman.models import Message
>>> qs1=Message.objects.values_list('id').order_by()
>>> print(qs1.query) # correct
SELECT "postman_message"."id" FROM "postman_message"
>>> qs2=Message.objects.values('thread').annotate(id=Value(2, IntegerField())).values_list('id').order_by()
>>> print(qs2.query) # correct
SELECT 2 AS "id" FROM "postman_message"
>>> print(qs1.union(qs2).query) # will cause my problem
SELECT "postman_message"."id" AS "col1" FROM "postman_message" UNION SELECT 2 AS "id" FROM "postman_message"

The noticeable point is the introduction of the alias AS "col1", not compatible with the id in the second part.

In the full code, the union is injected in another query, of the form SELECT ... FROM ... INNER JOIN (the union) PM ON (... = PM.id)
So it leads to the error: django.db.utils.OperationalError: no such column: PM.id

In db/models/sql/compiler.py, get_combinator_sql(), the call is imposed as: as_sql(with_col_aliases=True)

I don't know how to solve this problem.
Any advice?

Note: See TracTickets for help on using tickets.
Back to Top