Opened 5 months ago

Closed 5 months ago

Last modified 5 months ago

#35396 closed Bug (wontfix)

QuerySet filters incorrectly pushed to the inner query when applied after a window function filter

Reported by: Gary Chen Owned by: nobody
Component: Database layer (models, ORM) Version: 4.2
Severity: Normal Keywords:
Cc: Simon Charette Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I may be doing something funky here with my window function, but I'm trying to get the first row in each partition of a window, then filter the results by another column. I expect the last filter to be applied to the outer query created by the window function filter, but it's "pushed" up to the inner query leading to incorrect results. Filtering on window functions was introduced in 4.2 and I wonder if this is a case that wasn't caught.

Example

A simple model:

class Player(Model):
    name = CharField()
    city = CharField()
    score = IntegerField()
    active = BooleanField()

Some data:

id name city score active
0CaryPhoenix17false
1JoePhoenix15true
2KatiePhoenix13true
3BobSpringfield12true
4AliceSpringfield10true

The queryset:

Player.objects.annotate(
    first=Window(
        expression=functions.FirstValue("id"),
        partition_by=[F("city")],
        order_by=("-score"),
    ),
).filter(id=F("first"), active=True)

The generated sql looks like this:

SELECT 
  * 
FROM 
  (
    SELECT 
      `myapp_player`.`id` AS `col1`, 
      `myapp_player`.`name` AS `col2`, 
      `myapp_player`.`city` AS `col3`, 
      `myapp_player`.`score` AS `col4`, 
      `myapp_player`.`active` AS `col5`, 
      FIRST_VALUE(`myapp_player`.`id`) OVER (
        PARTITION BY `myapp_player`.`city` 
        ORDER BY 
          `myapp_player`.`score` DESC
      ) AS `first` 
    FROM 
      `myapp_player` 
    WHERE 
      `myapp_player`.`active` = True
  ) `qualify` 
WHERE 
  `col1` = (`first`)

This would return this result:

id name city score active
1JoePhoenix15true
3BobSpringfield12true

Expected

I would expect the generated SQL from that queryset to look like this:

SELECT 
  * 
FROM 
  (
    SELECT 
      `myapp_player`.`id` AS `col1`, 
      `myapp_player`.`name` AS `col2`, 
      `myapp_player`.`city` AS `col3`, 
      `myapp_player`.`score` AS `col4`, 
      `myapp_player`.`active` AS `col5`, 
      FIRST_VALUE(`myapp_player`.`id`) OVER (
        PARTITION BY `myapp_player`.`city` 
        ORDER BY 
          `myapp_player`.`score` DESC
      ) AS `first` 
    FROM 
      `myapp_player` 
  ) `qualify` 
WHERE 
  `col1` = (`first`) AND `col5` = True

With a result of:

id name city score active
3BobSpringfield12true

Change History (2)

comment:1 by Simon Charette, 5 months ago

Cc: Simon Charette added
Resolution: wontfix
Status: newclosed

Changing the behavior would be trivial, adjust Where.split_having_qualify to always return everything in qualify_node when it's present and nothing in where_node or having_node but I don't think it's the right thing to do.

The real problem here is the ambiguity of what should be done when users filter against non-windowed and windowed expressions at the same time.

The current implementation defaults to respecting what fetching the result set without the qualify outer query emulation would do to make sure non-window referencing filters are applied against the set of rows windowered over.

In other words if you do

objects = list(Player.objects.annotate(
    first=Window(
        expression=functions.FirstValue("id"),
        partition_by=[F("city")],
        order_by=("-score"),
    ),
).filter(active=True))

And you iterate over objects you'll get two matches of id and first. The way it's implemented is coherent with how QUALIFY is implemented in backends that support it.

What I suspect you want here is more control over subquery wrapping instead (see #24462).

If you want to want to filter out highest score by city you should also window by active and filter against it

Player.objects.annotate(
    first=Window(
        expression=functions.FirstValue("id"),
        partition_by=[F("city")],
        order_by=("-score"),
    ),
    first_active=Window(
        expression=functions.FirstValue("active"),
        partition_by=[F("city")],
        order_by=("-score"),
    ),
).filter(id=F("first"), first_active=True)
SELECT 
  * 
FROM 
  (
    SELECT 
      `myapp_player`.`id` AS `col1`, 
      `myapp_player`.`name` AS `col2`, 
      `myapp_player`.`city` AS `col3`, 
      `myapp_player`.`score` AS `col4`, 
      `myapp_player`.`active` AS `col5`, 
      FIRST_VALUE(`myapp_player`.`id`) OVER (
        PARTITION BY `myapp_player`.`city` 
        ORDER BY 
          `myapp_player`.`score` DESC
      ) AS `first`,
      FIRST_VALUE(`myapp_player`.`active`) OVER (
        PARTITION BY `myapp_player`.`city` 
        ORDER BY 
          `myapp_player`.`score` DESC
      ) AS `first_active`  
    FROM 
      `myapp_player` 
  ) `qualify` 
WHERE 
  `col1` = (`first`) AND `first_active` = True

It's a common pattern when you're interested in multiple values over the same window and why there's even a way to define window aliases that can be references by multiple window expressions.

Last edited 5 months ago by Simon Charette (previous) (diff)

comment:2 by Gary Chen, 5 months ago

The real problem here is the ambiguity of what should be done when users filter against non-windowed and windowed expressions at the same time.

I would think the ordering of the filter statements would at least make this clear. The django docs call out that annotation/filter ordering matters, so I'd expect filters before the window annotation to apply to the inner query, and after to apply to the outer query.

Version 0, edited 5 months ago by Gary Chen (next)
Note: See TracTickets for help on using tickets.
Back to Top