Opened 9 years ago
Last modified 2 years ago
#25789 new Cleanup/optimization
Inefficient Queries Generated due to not using WHERE EXISTS
Reported by: | Alex Rothberg | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 1.8 |
Severity: | Normal | Keywords: | QuerySet.extra |
Cc: | Anssi Kääriäinen, Marc Tamlyn, James Howe, David Wobrock, John Speno | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Reposting question from SO with some more details.
I believe that the Django ORM is generating seriously inefficient SQL due to it not using WHERE EXISTS
but instead using a DISTINCT
with a LEFT JOIN
. by comparison, SQLAlchemy will use WHERE EXISTS
.
I have two models, Exam
and Series
. Series
objects have a foreign key to an Exam
object. Both of the models contain a field description_user
. I am trying to search for all Exam
s that have a search term in description_user
or have a child Series
with that term in its description_user
. I want to do this for a number of search terms (requiring all of them). I also want to de-duplicate the results (ie not get the same Exam multiple times).
This is roughly what the filter looks like:
a = (Q(**{'series__description_user__icontains': 'bar'}) | Q(**{'description_user__icontains': 'bar'})) b = (Q(**{'series__description_user__icontains': 'foo'}) | Q(**{'description_user__icontains': 'foo'})) c = (Q(**{'series__description_user__icontains': 'baz'}) | Q(**{'description_user__icontains': 'baz'})) Exam.objects.filter(a & b & c).distinct()
with corresponding SQL:
SELECT DISTINCT "exam_storage_exam"."id", "exam_storage_exam"."description_user" FROM "exam_storage_exam" LEFT OUTER JOIN "exam_storage_series" ON ( "exam_storage_exam"."id" = "exam_storage_series"."exam_id" AND ( "exam_storage_series"."removed" IS NULL) ) WHERE ( "exam_storage_exam"."removed" IS NULL AND ( "exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\' OR "exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\') AND ( "exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\' OR "exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\') AND ( "exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\' OR "exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\'))
The issue is that as the number of search terms grows, the size of the intermediate data set before the DISTINCT operation grows as well.
Ideally the SQL would look like:
SELECT * FROM exam WHERE (EXISTS (SELECT 1 FROM exam_storage_series WHERE exam.id = series.exam_id AND ( series.description_user LIKE '%foo%' )) or exam.description_user LIKE '%foo%') AND (EXISTS (SELECT 1 FROM exam_storage_series WHERE exam.id = series.exam_id AND ( series.description_user LIKE '%bar%' )) or exam.description_user LIKE '%bar%') AND (EXISTS (SELECT 1 FROM exam_storage_series WHERE exam.id = series.exam_id AND ( series.description_user LIKE '%baz%' )) or exam.description_user LIKE '%baz%')
Currently the performance of Django query is terrible. This style searching comes up for example in how DRF generates search queries.
Change History (13)
comment:1 by , 9 years ago
Cc: | added |
---|---|
Type: | Uncategorized → Cleanup/optimization |
comment:2 by , 9 years ago
Yes, this is possible, and something Django's ORM should do.
Unfortunately this is hard to implement correctly. The big problem is aggregation, when doing .filter().annotate(), where both operations target the same relation, the aggregation must use results of the join generated by the filter. But if the filter doesn't generate a join, then we have a problem.
It would be Djangoic if the exists query would be generated automatically. For the above mentioned reason this is hard. Maybe it would be easier if we had something like .filter(models.Exists(somerel__col=val))
, where the Exists()
class would inform Django that an exists subquery should be generated.
comment:3 by , 9 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:4 by , 9 years ago
Cc: | added |
---|
comment:5 by , 9 years ago
Keywords: | QuerySet.extra added |
---|
comment:6 by , 9 years ago
It's not totally relevant (at least not yet), but I have a version of this that works with .annotate():
https://github.com/django/django/pull/6478
There's also a django-developers thread that hopefully may get some discussion.
comment:7 by , 6 years ago
Similarly, I have a use-case where this query structure has the only acceptable performance (in PostgreSQL):
... FROM model_1 WHERE NOT EXISTS ( SELECT TRUE FROM model_2 WHERE model_1.pkey = model_2.fkey AND (model_2.property IS NULL OR model_2.property >= 42) )
Currently using extra()
to do it.
comment:8 by , 6 years ago
Cc: | added |
---|
comment:9 by , 6 years ago
James, since this ticket was filled support for Exists
expression was added so you should be able to avoid using extra
queryset.annotate( foo_exists=Exists(subqueryset.filter(foo_id=OuterRef('pk')) ).filter( foo_exists=False, )
Once #25367 lands it'll be possible to pass ~Exists
directly to filter.
comment:10 by , 2 years ago
Another nice to have that could be added here would be allow the usage of transforms on related fields so we could have lookups of the form __exists
or __count
Author.objects.filter(books__exists=Q(title__icontains="ring")) Author.objects.filter(books__count__gt=4)
instead of
Author.objects.filter(Exist(Book.objects.filter(author=OuterRef("pk"), title__icontains="ring"))) Author.objects.filter( GreaterThan( Book.objects.filter( author_id=OutRef("pk"), ).values( "author_id" ).values(Count("id")), 4 ) )
I think it would go a long way in making it easier for users to work around this issue and #10060.
comment:11 by , 2 years ago
Cc: | added |
---|
comment:12 by , 2 years ago
FWIW I had a very limited try at this in this branch. Mostly some test and debugging shims but I figured I'd provide them if someone is interesting in looking this issue.
I'd add that this work has a bit of overlap with #28296 as it basically needs to turn a series of transforms and lookups against a related field into a subquery without causing JOINs on the outer query which the default behaviour of using the __
syntax.
comment:13 by , 2 years ago
Cc: | added |
---|
Anssi, is this feasible? The closest existing ticket I could find is #16603 but not sure if it should be considered a duplicate.