Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#32717 closed Bug (fixed)

Incorrect SQL generation filtering OR-combined queries

Reported by: Shaheed Haque Owned by: Simon Charette
Component: Database layer (models, ORM) Version: 3.2
Severity: Release blocker Keywords:
Cc: Simon Charette Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I'm running the just-released Django 3.2.1 and am seeing what I think is incorrect SQL generation involving this model (cut down for brevity):

from django.db import models as db_models

class Buss(db_models.Model):
    MAX_LENGTH = 25
    CHOICES = [('Universal', 'Universal'), ('GB', 'GB'), ('US', 'US'), ('Company', 'Company')]

    jurisdiction = db_models.CharField(max_length=MAX_LENGTH, choices=CHOICES)
    name = db_models.CharField(max_length=MAX_LENGTH)

    class Meta:
        unique_together = [('jurisdiction', 'name')]

I have a function which returns a queryset by combining 3 sets of busses using the "|" OR operator:

from paiyroll.models import Buss

def jurisdiction_qs(for_jurisdiction):
    # Get busses identified by "jurisdiction_for", and add other busses from 'Universal' and 'Company' where they don't clash.
    qs = Buss.objects.filter(jurisdiction=for_jurisdiction)
    if for_jurisdiction != 'Universal':
        qs = qs | Buss.objects.filter(jurisdiction='Universal'). \
            exclude(name__in=qs.values_list('name', flat=True))
    if for_jurisdiction != 'Company':
        qs = qs | Buss.objects.filter(jurisdiction='Company'). \
            exclude(name__in=qs.values_list('name', flat=True))
    return qs

In use, the function seems to work as expected:

In [7]: Buss.objects.filter(jurisdiction='GB').count()
Out[7]: 8

In [11]: Buss.objects.filter(jurisdiction__in=['GB','Universal','Company']).count()
Out[11]: 37

In [12]: jurisdiction_qs('GB').count()
Out[12]: 34

However, if the OR'd queryset is further filtered, the results are unpredictable. For example, this works:

In [13]: jurisdiction_qs('GB').filter(jurisdiction='US').count()
Out[13]: 0

but this - where the filter is by the original "GB" - returns 34 instead of 8:

In [14]: jurisdiction_qs('GB').filter(jurisdiction='GB').count()
Out[14]: 34

I can see that the SQL from the function looks OK:

str(jurisdiction_qs('GB').query)

SELECT "paiyroll_buss"."id", "paiyroll_buss"."jurisdiction", "paiyroll_buss"."name", "paiyroll_buss"."description" FROM "paiyroll_buss" WHERE (
    "paiyroll_buss"."jurisdiction" = GB OR 
    ("paiyroll_buss"."jurisdiction" = Universal AND NOT 
        ("paiyroll_buss"."name" IN (SELECT U0."name" FROM "paiyroll_buss" U0 WHERE U0."jurisdiction" = GB))
    ) OR 
    ("paiyroll_buss"."jurisdiction" = Company AND NOT 
        ("paiyroll_buss"."name" IN (SELECT V0."name" FROM "paiyroll_buss" V0 WHERE (V0."jurisdiction" = GB OR (V0."jurisdiction" = Universal AND NOT 
            (V0."name" IN (SELECT U0."name" FROM "paiyroll_buss" U0 WHERE U0."jurisdiction" = GB))
        ))))
    )
)

In the working case, the above SQL is changed to end as follows:

str(jurisdiction_qs('GB').filter(jurisdiction='US').query)
SELECT ...WHERE (... AND "paiyroll_buss"."jurisdiction" = US)

but in the broken case, the original SQL is returned!

str(jurisdiction_qs('GB').filter(jurisdiction='GB').query)

SELECT "paiyroll_buss"."id", "paiyroll_buss"."jurisdiction", "paiyroll_buss"."name", "paiyroll_buss"."description" FROM "paiyroll_buss" WHERE ("paiyroll_buss"."jurisdiction" = GB OR ("paiyroll_buss"."jurisdiction" = Universal AND NOT ("paiyroll_buss"."name" IN (SELECT U0."name" FROM "paiyroll_buss" U0 WHERE U0."jurisdiction" = GB))) OR ("paiyroll_buss"."jurisdiction" = Company AND NOT ("paiyroll_buss"."name" IN (SELECT V0."name" FROM "paiyroll_buss" V0 WHERE (V0."jurisdiction" = GB OR (V0."jurisdiction" = Universal AND NOT (V0."name" IN (SELECT U0."name" FROM "paiyroll_buss" U0 WHERE U0."jurisdiction" = GB))))))))

AFAIK, it is legal to add a .filter() to this kind of query, so I think this is a bug. On the mailing list (https://groups.google.com/g/django-users/c/iR6ArOi9OlY/m/bk0JDF_nDwAJ), there was a suggestion that using Q() might have helped but I could not see how to use Q() with "exclude".

Change History (12)

comment:1 by Shaheed Haque, 4 years ago

Type: UncategorizedBug

comment:2 by Mariusz Felisiak, 4 years ago

Cc: Simon Charette added
Severity: NormalRelease blocker
Triage Stage: UnreviewedAccepted

Thanks for the report.

Regression in bbf141bcdc31f1324048af9233583a523ac54c94.
Reproduced at a0a5e0f4c83acdfc6eab69754e245354689c7185.

comment:3 by Simon Charette, 4 years ago

Owner: changed from nobody to Simon Charette
Status: newassigned

comment:4 by Simon Charette, 4 years ago

I started looking into this one and it seems the issue is two fold.

One the ''optimization'' in Node.add strikes again this time for Lookup equality as it was introduced in bbf141bcdc31f1324048af9233583a523ac54c94. As we already know this code is poorly tested and this issue just uncovered that it's actually flawed as it should also make sure that connectors are matching.

What's happening here is that the query has a WHERE tree of the form (jurisdiction = 'US' OR ...) and it wrongfully consider that an addition of a AND jurisdiction = 'US' is noop because it doesn't compare the connector AND != OR. The following patch should address that.

  • django/utils/tree.py

    diff --git a/django/utils/tree.py b/django/utils/tree.py
    index 302cd37d5f..f8d3e3586a 100644
    a b def add(self, data, conn_type, squash=True):  
    9090        If `squash` is False the data is prepared and added as a child to
    9191        this tree without further logic.
    9292        """
    93         if data in self.children:
     93        if (self.connector == conn_type or len(self) == 1) and data in self.children:
    9494            return data
    9595        if not squash:
    9696            self.children.append(data)

The reason why this is not an issue for most cases though is that the only way to craft a query with a root Where node with an OR connector is through a | combination of query like described here. That explains why queries of the form filter(Q(foo="bar") | Q(baz="foo")).filter(foo="bar") are not affected by this Node.add bug and resulting in a large number of failures in the test suite.

I personally think that we should prevent Query.combine from ever generating queries with root Where(connector='OR') as I wouldn't be surprised if it caused other subtle issues down the line given the poor coverage we have for queryset combinations. If that makes sense to you I'll prepare a patch with two commits, one for the Node.add adjustments and a following one with the Query.combine proposed adjustments meant to be backported.

I also think that we should just nuke the if data in self.children entirely in the main branch as I've expressed in #32632 as it's wasteful 99% of the time and I'm willing to shepherd an optimization ticket if that's deemed appropriate.

Version 0, edited 4 years ago by Simon Charette (next)

comment:5 by Shaheed Haque, 4 years ago

Hi,

I'm not qualified to speak to most of this, but I'm wanted to check if "we should prevent Query.combine from ever generating queries with root Where(connector='OR')" means that the original OR'd query I wrote would not be supported?

If so, then how else could one write the query? (As I said, I don't see how Q() can be used with .exclude()...but I might have missed something).

comment:6 by Simon Charette, 4 years ago

I'm not qualified to speak to most of this, but I'm wanted to check if "we should prevent Query.combine from ever generating queries with root Where(connector='OR')" means that the original OR'd query I wrote would not be supported?

You queries would still be supported, it's more of how internal data structures are used under the hood. Basically doing a <QuerySet> | <QuerySet> is the only circumstances in the ORM where the returned QuerySet.query, which is an instance of sql.Query, will have a .where property, which is an instance of sql.where.Where, with a .connector == 'OR' instead of AND. What I'm proposing we do here is that instead of having query.where.connector be of the (connector='OR', nodes=[jurisdiction = 'GB', ...]) form we ensure that it's always nested in a top-level AND instead so (connector='AND', nodes=[(connector='OR', nodes=[jurisdiction = 'GB', ...])]).

Not totally necessary as the above patch that adjusts the optimization should address the issue, it'd be great if you could confirm it does, but it does make the behaviour of Query.combine more in line with the heavily tested paths the ORM usually takes.

comment:7 by Simon Charette, 4 years ago

To answer your initial question though your query would likely be more easily expressed using Q

def jurisdiction_qs(for_jurisdiction):
    filter_ = Q(jurisdiction=for_jurisdiction)
    if for_jurisdiction != 'Universal':
        filter_ |= Q(jurisdiction='Universal') & ~Q(
            name__in=Buss.objects.filter(filter_).values_list('name', flat=True)
        )
    if for_jurisdiction != 'Company':
       filter_ |= Q(jurisdiction='Company') & ~Q(
            name__in=Buss.objects.filter(filter_).values_list('name', flat=True)
        )
    return Buss.objects.filter(filter_)

in reply to:  6 comment:8 by Shaheed Haque, 4 years ago

Replying to Simon Charette:

Not totally necessary as the above patch that adjusts the optimization should address the issue, it'd be great if you could confirm it does, but it does make the behaviour of Query.combine more in line with the heavily tested paths the ORM usually takes.

Yes, in testing here, I can confirm with the patch applied, that the broken case (post filter by 'GB') and the working case (post filter by 'US') now generate similar SQL and - the right query results.

And thank you for clarifying the use of "~Q()" for exclude...I'm not sure how I missed that.

comment:9 by Simon Charette, 4 years ago

Has patch: set

comment:10 by Mariusz Felisiak, 4 years ago

Triage Stage: AcceptedReady for checkin

comment:11 by Mariusz Felisiak <felisiak.mariusz@…>, 4 years ago

Resolution: fixed
Status: assignedclosed

In b81c7562:

Fixed #32717 -- Fixed filtering of querysets combined with the | operator.

Address a long standing bug in a Where.add optimization to discard
equal nodes that was surfaced by implementing equality for Lookup
instances in bbf141bcdc31f1324048af9233583a523ac54c94.

Thanks Shaheed Haque for the report.

comment:12 by Mariusz Felisiak <felisiak.mariusz@…>, 4 years ago

In 386caa5:

[3.2.x] Fixed #32717 -- Fixed filtering of querysets combined with the | operator.

Address a long standing bug in a Where.add optimization to discard
equal nodes that was surfaced by implementing equality for Lookup
instances in bbf141bcdc31f1324048af9233583a523ac54c94.

Thanks Shaheed Haque for the report.

Backport of b81c7562fc33f50166d5120138d6398dc42b13c3 from main

Note: See TracTickets for help on using tickets.
Back to Top