Opened 7 years ago

Last modified 6 months ago

#25789 new Cleanup/optimization

Inefficient Queries Generated due to not using WHERE EXISTS

Reported by: Alex Rothberg Owned by: nobody
Component: Database layer (models, ORM) Version: 1.8
Severity: Normal Keywords: QuerySet.extra
Cc: Anssi Kääriäinen, Marc Tamlyn, James Howe, David Wobrock, John Speno Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Reposting question from SO with some more details.

I believe that the Django ORM is generating seriously inefficient SQL due to it not using WHERE EXISTS but instead using a DISTINCT with a LEFT JOIN. by comparison, SQLAlchemy will use WHERE EXISTS.

I have two models, Exam and Series. Series objects have a foreign key to an Exam object. Both of the models contain a field description_user. I am trying to search for all Exams that have a search term in description_user or have a child Series with that term in its description_user. I want to do this for a number of search terms (requiring all of them). I also want to de-duplicate the results (ie not get the same Exam multiple times).

This is roughly what the filter looks like:

a = (Q(**{'series__description_user__icontains': 'bar'}) | Q(**{'description_user__icontains': 'bar'})) 
b = (Q(**{'series__description_user__icontains': 'foo'}) | Q(**{'description_user__icontains': 'foo'}))
c = (Q(**{'series__description_user__icontains': 'baz'}) | Q(**{'description_user__icontains': 'baz'}))
Exam.objects.filter(a & b & c).distinct()

with corresponding SQL:

SELECT DISTINCT 
                "exam_storage_exam"."id", 
                "exam_storage_exam"."description_user"
FROM            "exam_storage_exam" 
LEFT OUTER JOIN "exam_storage_series" 
ON              ( 
                                "exam_storage_exam"."id" = "exam_storage_series"."exam_id" 
                AND             ( 
                                                "exam_storage_series"."removed" IS NULL) ) 
WHERE           ( 
                                "exam_storage_exam"."removed" IS NULL 
                AND             ( 
                                                "exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\'
                                OR              "exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\')
                AND             ( 
                                                "exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\'
                                OR              "exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\')
                AND             ( 
                                                "exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\'
                                OR              "exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\'))

The issue is that as the number of search terms grows, the size of the intermediate data set before the DISTINCT operation grows as well.

Ideally the SQL would look like:

SELECT *                                                
 FROM exam                                             
   WHERE (EXISTS (SELECT 1             
        FROM exam_storage_series              
          WHERE exam.id = series.exam_id AND (
            series.description_user LIKE '%foo%'
          )) or exam.description_user LIKE '%foo%') AND 
(EXISTS (SELECT 1             
        FROM exam_storage_series              
          WHERE exam.id = series.exam_id AND (
            series.description_user LIKE '%bar%'
          )) or exam.description_user LIKE '%bar%') AND 
(EXISTS (SELECT 1             
        FROM exam_storage_series              
          WHERE exam.id = series.exam_id AND (
            series.description_user LIKE '%baz%'
          )) or exam.description_user LIKE '%baz%')

Currently the performance of Django query is terrible. This style searching comes up for example in how DRF generates search queries.

Change History (13)

comment:1 Changed 7 years ago by Tim Graham

Cc: Anssi Kääriäinen added
Type: UncategorizedCleanup/optimization

Anssi, is this feasible? The closest existing ticket I could find is #16603 but not sure if it should be considered a duplicate.

comment:2 Changed 7 years ago by Anssi Kääriäinen

Yes, this is possible, and something Django's ORM should do.

Unfortunately this is hard to implement correctly. The big problem is aggregation, when doing .filter().annotate(), where both operations target the same relation, the aggregation must use results of the join generated by the filter. But if the filter doesn't generate a join, then we have a problem.

It would be Djangoic if the exists query would be generated automatically. For the above mentioned reason this is hard. Maybe it would be easier if we had something like .filter(models.Exists(somerel__col=val)), where the Exists() class would inform Django that an exists subquery should be generated.

Last edited 7 years ago by Tim Graham (previous) (diff)

comment:3 Changed 7 years ago by Tim Graham

Triage Stage: UnreviewedAccepted

comment:4 Changed 7 years ago by Marc Tamlyn

Cc: Marc Tamlyn added

comment:5 Changed 7 years ago by Tim Graham

Keywords: QuerySet.extra added

comment:6 Changed 7 years ago by Matthew Schinckel

It's not totally relevant (at least not yet), but I have a version of this that works with .annotate():

https://github.com/django/django/pull/6478

There's also a django-developers thread that hopefully may get some discussion.

Last edited 7 years ago by Tim Graham (previous) (diff)

comment:7 Changed 5 years ago by James Howe

Similarly, I have a use-case where this query structure has the only acceptable performance (in PostgreSQL):

... FROM model_1
WHERE NOT EXISTS (
  SELECT TRUE FROM model_2
  WHERE model_1.pkey = model_2.fkey
    AND (model_2.property IS NULL OR model_2.property >= 42)
)

Currently using extra() to do it.

comment:8 Changed 5 years ago by James Howe

Cc: James Howe added

comment:9 Changed 5 years ago by Simon Charette

James, since this ticket was filled support for Exists expression was added so you should be able to avoid using extra

queryset.annotate(
    foo_exists=Exists(subqueryset.filter(foo_id=OuterRef('pk'))
).filter(
    foo_exists=False,
)

Once #25367 lands it'll be possible to pass ~Exists directly to filter.

comment:10 Changed 7 months ago by Simon Charette

Another nice to have that could be added here would be allow the usage of transforms on related fields so we could have lookups of the form __exists or __count

Author.objects.filter(books__exists=Q(title__icontains="ring"))
Author.objects.filter(books__count__gt=4)

instead of

Author.objects.filter(Exist(Book.objects.filter(author=OuterRef("pk"), title__icontains="ring")))
Author.objects.filter(
     GreaterThan(
         Book.objects.filter(
             author_id=OutRef("pk"),
         ).values(
             "author_id"
         ).values(Count("id")), 4
     )
)

I think it would go a long way in making it easier for users to work around this issue and #10060.

comment:11 Changed 6 months ago by David Wobrock

Cc: David Wobrock added

comment:12 Changed 6 months ago by Simon Charette

FWIW I had a very limited try at this in this branch. Mostly some test and debugging shims but I figured I'd provide them if someone is interesting in looking this issue.

I'd add that this work has a bit of overlap with #28296 as it basically needs to turn a series of transforms and lookups against a related field into a subquery without causing JOINs on the outer query which the default behaviour of using the __ syntax.

comment:13 Changed 6 months ago by John Speno

Cc: John Speno added
Note: See TracTickets for help on using tickets.
Back to Top