#32717 closed Bug (fixed)
Incorrect SQL generation filtering OR-combined queries
Reported by: | Shaheed Haque | Owned by: | Simon Charette |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 3.2 |
Severity: | Release blocker | Keywords: | |
Cc: | Simon Charette | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
I'm running the just-released Django 3.2.1 and am seeing what I think is incorrect SQL generation involving this model (cut down for brevity):
from django.db import models as db_models class Buss(db_models.Model): MAX_LENGTH = 25 CHOICES = [('Universal', 'Universal'), ('GB', 'GB'), ('US', 'US'), ('Company', 'Company')] jurisdiction = db_models.CharField(max_length=MAX_LENGTH, choices=CHOICES) name = db_models.CharField(max_length=MAX_LENGTH) class Meta: unique_together = [('jurisdiction', 'name')]
I have a function which returns a queryset by combining 3 sets of busses using the "|" OR operator:
from paiyroll.models import Buss def jurisdiction_qs(for_jurisdiction): # Get busses identified by "jurisdiction_for", and add other busses from 'Universal' and 'Company' where they don't clash. qs = Buss.objects.filter(jurisdiction=for_jurisdiction) if for_jurisdiction != 'Universal': qs = qs | Buss.objects.filter(jurisdiction='Universal'). \ exclude(name__in=qs.values_list('name', flat=True)) if for_jurisdiction != 'Company': qs = qs | Buss.objects.filter(jurisdiction='Company'). \ exclude(name__in=qs.values_list('name', flat=True)) return qs
In use, the function seems to work as expected:
In [7]: Buss.objects.filter(jurisdiction='GB').count() Out[7]: 8 In [11]: Buss.objects.filter(jurisdiction__in=['GB','Universal','Company']).count() Out[11]: 37 In [12]: jurisdiction_qs('GB').count() Out[12]: 34
However, if the OR'd queryset is further filtered, the results are unpredictable. For example, this works:
In [13]: jurisdiction_qs('GB').filter(jurisdiction='US').count() Out[13]: 0
but this - where the filter is by the original "GB" - returns 34 instead of 8:
In [14]: jurisdiction_qs('GB').filter(jurisdiction='GB').count() Out[14]: 34
I can see that the SQL from the function looks OK:
str(jurisdiction_qs('GB').query) SELECT "paiyroll_buss"."id", "paiyroll_buss"."jurisdiction", "paiyroll_buss"."name", "paiyroll_buss"."description" FROM "paiyroll_buss" WHERE ( "paiyroll_buss"."jurisdiction" = GB OR ("paiyroll_buss"."jurisdiction" = Universal AND NOT ("paiyroll_buss"."name" IN (SELECT U0."name" FROM "paiyroll_buss" U0 WHERE U0."jurisdiction" = GB)) ) OR ("paiyroll_buss"."jurisdiction" = Company AND NOT ("paiyroll_buss"."name" IN (SELECT V0."name" FROM "paiyroll_buss" V0 WHERE (V0."jurisdiction" = GB OR (V0."jurisdiction" = Universal AND NOT (V0."name" IN (SELECT U0."name" FROM "paiyroll_buss" U0 WHERE U0."jurisdiction" = GB)) )))) ) )
In the working case, the above SQL is changed to end as follows:
str(jurisdiction_qs('GB').filter(jurisdiction='US').query) SELECT ...WHERE (... AND "paiyroll_buss"."jurisdiction" = US)
but in the broken case, the original SQL is returned!
str(jurisdiction_qs('GB').filter(jurisdiction='GB').query) SELECT "paiyroll_buss"."id", "paiyroll_buss"."jurisdiction", "paiyroll_buss"."name", "paiyroll_buss"."description" FROM "paiyroll_buss" WHERE ("paiyroll_buss"."jurisdiction" = GB OR ("paiyroll_buss"."jurisdiction" = Universal AND NOT ("paiyroll_buss"."name" IN (SELECT U0."name" FROM "paiyroll_buss" U0 WHERE U0."jurisdiction" = GB))) OR ("paiyroll_buss"."jurisdiction" = Company AND NOT ("paiyroll_buss"."name" IN (SELECT V0."name" FROM "paiyroll_buss" V0 WHERE (V0."jurisdiction" = GB OR (V0."jurisdiction" = Universal AND NOT (V0."name" IN (SELECT U0."name" FROM "paiyroll_buss" U0 WHERE U0."jurisdiction" = GB))))))))
AFAIK, it is legal to add a .filter() to this kind of query, so I think this is a bug. On the mailing list (https://groups.google.com/g/django-users/c/iR6ArOi9OlY/m/bk0JDF_nDwAJ), there was a suggestion that using Q() might have helped but I could not see how to use Q() with "exclude".
Change History (12)
comment:1 by , 3 years ago
Type: | Uncategorized → Bug |
---|
comment:2 by , 3 years ago
Cc: | added |
---|---|
Severity: | Normal → Release blocker |
Triage Stage: | Unreviewed → Accepted |
comment:3 by , 3 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:4 by , 3 years ago
I started looking into this one and it seems the issue is two fold.
One the ''optimization'' in Node.add
strikes again this time for Lookup
equality as it was introduced in bbf141bcdc31f1324048af9233583a523ac54c94. As we already know this code is poorly tested and this issue just uncovered that it's actually flawed as it should also make sure that connectors are matching.
What's happening here is that the query has a WHERE
tree of the form (jurisdiction = 'GB' OR ...)
and it wrongfully consider that an addition of a AND jurisdiction = 'GB'
is noop because it doesn't compare the connector AND != OR
. The following patch should address that.
-
django/utils/tree.py
diff --git a/django/utils/tree.py b/django/utils/tree.py index 302cd37d5f..f8d3e3586a 100644
a b def add(self, data, conn_type, squash=True): 90 90 If `squash` is False the data is prepared and added as a child to 91 91 this tree without further logic. 92 92 """ 93 if data in self.children:93 if (self.connector == conn_type or len(self) == 1) and data in self.children: 94 94 return data 95 95 if not squash: 96 96 self.children.append(data)
The reason why this is not an issue for most cases though is that the only way to craft a query with a root Where
node with an OR
connector is through a |
combination of query like described here. That explains why queries of the form filter(Q(foo="bar") | Q(baz="foo")).filter(foo="bar")
are not affected by this Node.add
bug and resulting in a large number of failures in the test suite.
I personally think that we should prevent Query.combine
from ever generating queries with root Where(connector='OR')
as I wouldn't be surprised if it caused other subtle issues down the line given the poor coverage we have for queryset combinations. If that makes sense to you I'll prepare a patch with two commits, one for the Node.add
adjustments and a following one with the Query.combine
proposed adjustments meant to be backported.
I also think that we should just nuke the if data in self.children
entirely in the main branch as I've expressed in #32632 as it's wasteful 99% of the time and I'm willing to shepherd an optimization ticket if that's deemed appropriate.
comment:5 by , 3 years ago
Hi,
I'm not qualified to speak to most of this, but I'm wanted to check if "we should prevent Query.combine from ever generating queries with root Where(connector='OR')" means that the original OR'd query I wrote would not be supported?
If so, then how else could one write the query? (As I said, I don't see how Q() can be used with .exclude()...but I might have missed something).
follow-up: 8 comment:6 by , 3 years ago
I'm not qualified to speak to most of this, but I'm wanted to check if "we should prevent Query.combine from ever generating queries with root Where(connector='OR')" means that the original OR'd query I wrote would not be supported?
You queries would still be supported, it's more of how internal data structures are used under the hood. Basically doing a <QuerySet> | <QuerySet>
is the only circumstances in the ORM where the returned QuerySet.query
, which is an instance of sql.Query
, will have a .where
property, which is an instance of sql.where.Where
, with a .connector == 'OR'
instead of AND
. What I'm proposing we do here is that instead of having query.where.connector
be of the (connector='OR', nodes=[jurisdiction = 'GB', ...])
form we ensure that it's always nested in a top-level AND
instead so (connector='AND', nodes=[(connector='OR', nodes=[jurisdiction = 'GB', ...])])
.
Not totally necessary as the above patch that adjusts the optimization should address the issue, it'd be great if you could confirm it does, but it does make the behaviour of Query.combine
more in line with the heavily tested paths the ORM usually takes.
comment:7 by , 3 years ago
To answer your initial question though your query would likely be more easily expressed using Q
def jurisdiction_qs(for_jurisdiction): filter_ = Q(jurisdiction=for_jurisdiction) if for_jurisdiction != 'Universal': filter_ |= Q(jurisdiction='Universal') & ~Q( name__in=Buss.objects.filter(filter_).values_list('name', flat=True) ) if for_jurisdiction != 'Company': filter_ |= Q(jurisdiction='Company') & ~Q( name__in=Buss.objects.filter(filter_).values_list('name', flat=True) ) return Buss.objects.filter(filter_)
comment:8 by , 3 years ago
Replying to Simon Charette:
Not totally necessary as the above patch that adjusts the optimization should address the issue, it'd be great if you could confirm it does, but it does make the behaviour of
Query.combine
more in line with the heavily tested paths the ORM usually takes.
Yes, in testing here, I can confirm with the patch applied, that the broken case (post filter by 'GB') and the working case (post filter by 'US') now generate similar SQL and - the right query results.
And thank you for clarifying the use of "~Q()" for exclude...I'm not sure how I missed that.
comment:9 by , 3 years ago
Has patch: | set |
---|
comment:10 by , 3 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
Thanks for the report.
Regression in bbf141bcdc31f1324048af9233583a523ac54c94.
Reproduced at a0a5e0f4c83acdfc6eab69754e245354689c7185.