Opened 3 years ago

Last modified 2 years ago

#25789 new Cleanup/optimization

Inefficient Queries Generated due to not using WHERE EXISTS

Reported by: Alex Rothberg Owned by: nobody
Component: Database layer (models, ORM) Version: 1.8
Severity: Normal Keywords: QuerySet.extra
Cc: Anssi Kääriäinen, Marc Tamlyn Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Reposting question from SO with some more details.

I believe that the Django ORM is generating seriously inefficient SQL due to it not using WHERE EXISTS but instead using a DISTINCT with a LEFT JOIN. by comparison, SQLAlchemy will use WHERE EXISTS.

I have two models, Exam and Series. Series objects have a foreign key to an Exam object. Both of the models contain a field description_user. I am trying to search for all Exams that have a search term in description_user or have a child Series with that term in its description_user. I want to do this for a number of search terms (requiring all of them). I also want to de-duplicate the results (ie not get the same Exam multiple times).

This is roughly what the filter looks like:

a = (Q(**{'series__description_user__icontains': 'bar'}) | Q(**{'description_user__icontains': 'bar'})) 
b = (Q(**{'series__description_user__icontains': 'foo'}) | Q(**{'description_user__icontains': 'foo'}))
c = (Q(**{'series__description_user__icontains': 'baz'}) | Q(**{'description_user__icontains': 'baz'}))
Exam.objects.filter(a & b & c).distinct()

with corresponding SQL:

SELECT DISTINCT 
                "exam_storage_exam"."id", 
                "exam_storage_exam"."description_user"
FROM            "exam_storage_exam" 
LEFT OUTER JOIN "exam_storage_series" 
ON              ( 
                                "exam_storage_exam"."id" = "exam_storage_series"."exam_id" 
                AND             ( 
                                                "exam_storage_series"."removed" IS NULL) ) 
WHERE           ( 
                                "exam_storage_exam"."removed" IS NULL 
                AND             ( 
                                                "exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\'
                                OR              "exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\')
                AND             ( 
                                                "exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\'
                                OR              "exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\')
                AND             ( 
                                                "exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\'
                                OR              "exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\'))

The issue is that as the number of search terms grows, the size of the intermediate data set before the DISTINCT operation grows as well.

Ideally the SQL would look like:

SELECT *                                                
 FROM exam                                             
   WHERE (EXISTS (SELECT 1             
        FROM exam_storage_series              
          WHERE exam.id = series.exam_id AND (
            series.description_user LIKE '%foo%'
          )) or exam.description_user LIKE '%foo%') AND 
(EXISTS (SELECT 1             
        FROM exam_storage_series              
          WHERE exam.id = series.exam_id AND (
            series.description_user LIKE '%bar%'
          )) or exam.description_user LIKE '%bar%') AND 
(EXISTS (SELECT 1             
        FROM exam_storage_series              
          WHERE exam.id = series.exam_id AND (
            series.description_user LIKE '%baz%'
          )) or exam.description_user LIKE '%baz%')

Currently the performance of Django query is terrible. This style searching comes up for example in how DRF generates search queries.

Change History (6)

comment:1 Changed 3 years ago by Tim Graham

Cc: Anssi Kääriäinen added
Type: UncategorizedCleanup/optimization

Anssi, is this feasible? The closest existing ticket I could find is #16603 but not sure if it should be considered a duplicate.

comment:2 Changed 3 years ago by Anssi Kääriäinen

Yes, this is possible, and something Django's ORM should do.

Unfortunately this is hard to implement correctly. The big problem is aggregation, when doing .filter().annotate(), where both operations target the same relation, the aggregation must use results of the join generated by the filter. But if the filter doesn't generate a join, then we have a problem.

It would be Djangoic if the exists query would be generated automatically. For the above mentioned reason this is hard. Maybe it would be easier if we had something like .filter(models.Exists(somerel__col=val)), where the Exists() class would inform Django that an exists subquery should be generated.

Last edited 3 years ago by Tim Graham (previous) (diff)

comment:3 Changed 3 years ago by Tim Graham

Triage Stage: UnreviewedAccepted

comment:4 Changed 3 years ago by Marc Tamlyn

Cc: Marc Tamlyn added

comment:5 Changed 2 years ago by Tim Graham

Keywords: QuerySet.extra added

comment:6 Changed 2 years ago by Matthew Schinckel

It's not totally relevant (at least not yet), but I have a version of this that works with .annotate():

https://github.com/django/django/pull/6478

There's also a django-developers thread that hopefully may get some discussion.

Last edited 2 years ago by Tim Graham (previous) (diff)
Note: See TracTickets for help on using tickets.
Back to Top