Opened 10 years ago
Last modified 3 years ago
#25789 new Cleanup/optimization
Inefficient Queries Generated due to not using WHERE EXISTS
| Reported by: | Alex Rothberg | Owned by: | nobody |
|---|---|---|---|
| Component: | Database layer (models, ORM) | Version: | 1.8 |
| Severity: | Normal | Keywords: | QuerySet.extra |
| Cc: | Anssi Kääriäinen, Marc Tamlyn, James Howe, David Wobrock, John Speno | Triage Stage: | Accepted |
| Has patch: | no | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description
Reposting question from SO with some more details.
I believe that the Django ORM is generating seriously inefficient SQL due to it not using WHERE EXISTS but instead using a DISTINCT with a LEFT JOIN. by comparison, SQLAlchemy will use WHERE EXISTS.
I have two models, Exam and Series. Series objects have a foreign key to an Exam object. Both of the models contain a field description_user. I am trying to search for all Exams that have a search term in description_user or have a child Series with that term in its description_user. I want to do this for a number of search terms (requiring all of them). I also want to de-duplicate the results (ie not get the same Exam multiple times).
This is roughly what the filter looks like:
a = (Q(**{'series__description_user__icontains': 'bar'}) | Q(**{'description_user__icontains': 'bar'}))
b = (Q(**{'series__description_user__icontains': 'foo'}) | Q(**{'description_user__icontains': 'foo'}))
c = (Q(**{'series__description_user__icontains': 'baz'}) | Q(**{'description_user__icontains': 'baz'}))
Exam.objects.filter(a & b & c).distinct()
with corresponding SQL:
SELECT DISTINCT
"exam_storage_exam"."id",
"exam_storage_exam"."description_user"
FROM "exam_storage_exam"
LEFT OUTER JOIN "exam_storage_series"
ON (
"exam_storage_exam"."id" = "exam_storage_series"."exam_id"
AND (
"exam_storage_series"."removed" IS NULL) )
WHERE (
"exam_storage_exam"."removed" IS NULL
AND (
"exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\'
OR "exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\')
AND (
"exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\'
OR "exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\')
AND (
"exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\'
OR "exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\'))
The issue is that as the number of search terms grows, the size of the intermediate data set before the DISTINCT operation grows as well.
Ideally the SQL would look like:
SELECT *
FROM exam
WHERE (EXISTS (SELECT 1
FROM exam_storage_series
WHERE exam.id = series.exam_id AND (
series.description_user LIKE '%foo%'
)) or exam.description_user LIKE '%foo%') AND
(EXISTS (SELECT 1
FROM exam_storage_series
WHERE exam.id = series.exam_id AND (
series.description_user LIKE '%bar%'
)) or exam.description_user LIKE '%bar%') AND
(EXISTS (SELECT 1
FROM exam_storage_series
WHERE exam.id = series.exam_id AND (
series.description_user LIKE '%baz%'
)) or exam.description_user LIKE '%baz%')
Currently the performance of Django query is terrible. This style searching comes up for example in how DRF generates search queries.
Change History (13)
comment:1 by , 10 years ago
| Cc: | added |
|---|---|
| Type: | Uncategorized → Cleanup/optimization |
comment:2 by , 10 years ago
Yes, this is possible, and something Django's ORM should do.
Unfortunately this is hard to implement correctly. The big problem is aggregation, when doing .filter().annotate(), where both operations target the same relation, the aggregation must use results of the join generated by the filter. But if the filter doesn't generate a join, then we have a problem.
It would be Djangoic if the exists query would be generated automatically. For the above mentioned reason this is hard. Maybe it would be easier if we had something like .filter(models.Exists(somerel__col=val)), where the Exists() class would inform Django that an exists subquery should be generated.
comment:3 by , 10 years ago
| Triage Stage: | Unreviewed → Accepted |
|---|
comment:4 by , 10 years ago
| Cc: | added |
|---|
comment:5 by , 10 years ago
| Keywords: | QuerySet.extra added |
|---|
comment:6 by , 10 years ago
It's not totally relevant (at least not yet), but I have a version of this that works with .annotate():
https://github.com/django/django/pull/6478
There's also a django-developers thread that hopefully may get some discussion.
comment:7 by , 7 years ago
Similarly, I have a use-case where this query structure has the only acceptable performance (in PostgreSQL):
... FROM model_1 WHERE NOT EXISTS ( SELECT TRUE FROM model_2 WHERE model_1.pkey = model_2.fkey AND (model_2.property IS NULL OR model_2.property >= 42) )
Currently using extra() to do it.
comment:8 by , 7 years ago
| Cc: | added |
|---|
comment:9 by , 7 years ago
James, since this ticket was filled support for Exists expression was added so you should be able to avoid using extra
queryset.annotate( foo_exists=Exists(subqueryset.filter(foo_id=OuterRef('pk')) ).filter( foo_exists=False, )
Once #25367 lands it'll be possible to pass ~Exists directly to filter.
comment:10 by , 3 years ago
Another nice to have that could be added here would be allow the usage of transforms on related fields so we could have lookups of the form __exists or __count
Author.objects.filter(books__exists=Q(title__icontains="ring")) Author.objects.filter(books__count__gt=4)
instead of
Author.objects.filter(Exist(Book.objects.filter(author=OuterRef("pk"), title__icontains="ring"))) Author.objects.filter( GreaterThan( Book.objects.filter( author_id=OutRef("pk"), ).values( "author_id" ).values(Count("id")), 4 ) )
I think it would go a long way in making it easier for users to work around this issue and #10060.
comment:11 by , 3 years ago
| Cc: | added |
|---|
comment:12 by , 3 years ago
FWIW I had a very limited try at this in this branch. Mostly some test and debugging shims but I figured I'd provide them if someone is interesting in looking this issue.
I'd add that this work has a bit of overlap with #28296 as it basically needs to turn a series of transforms and lookups against a related field into a subquery without causing JOINs on the outer query which the default behaviour of using the __ syntax.
comment:13 by , 3 years ago
| Cc: | added |
|---|
Anssi, is this feasible? The closest existing ticket I could find is #16603 but not sure if it should be considered a duplicate.