Opened 13 months ago

Last modified 3 months ago

#30637 assigned Bug

SearchVectorField and SearchVector should be combinable.

Reported by: Dani Hodovic Owned by: Ehsan Poursaeed
Component: contrib.postgres Version: master
Severity: Normal Keywords: db, postgres, full-text, search
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Dani Hodovic)

When using django.contrib.postgres to perform full text search it's not possible to combine SearchVectors and SearchVectorfields. Doing so impacts the performance of the query.

Suppose we have a model with a small "role" field and a much larger "text" field. The text field is large enough that it warrants indexing in a separate column as a tsvector with a Gin index to ensure our queries are fast.

Code highlighting:

class JobPosting(models.Model):
    role = models.CharField(max_length=170, null=True)
    text = models.TextField(max_length=8000, default="")
    # Large field optimized for full text search
    text_search = SearchVectorField(null=True)

    class Meta:
        indexes = [GinIndex(fields=["text_search"])]

If we need to perform a search on all columns we need to combine them into a common tsvector. The problem is that Django casts the large and search optimized text_search field to text and then back into a tsvector. This results in Postgres not using the existing Gin index and having to cast between types which makes for a very slow query.

Code highlighting:

JobPosting.objects.annotate(full_text=(SearchVector("role") + SearchVector("text_search"))).filter(full_text=SearchQuery("foo"))

Code highlighting:

EXPLAIN ANALYZE SELECT "jobs_jobposting"."id",
       "jobs_jobposting"."role",
       "jobs_jobposting"."text",
       (to_tsvector(COALESCE("jobs_jobposting"."role", '')) || to_tsvector(COALESCE(("jobs_jobposting"."text_search")::text, ''))) AS "full_text"
  FROM "jobs_jobposting"
 WHERE (to_tsvector(COALESCE("jobs_jobposting"."role", '')) || to_tsvector(COALESCE(("jobs_jobposting"."text_search")::text, ''))) @@ (plainto_tsquery('foo')) = true
 LIMIT 21;
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..35.54 rows=1 width=809) (actual time=40.085..40.085 rows=0 loops=1)
   ->  Seq Scan on jobs_jobposting  (cost=0.00..35.54 rows=1 width=809) (actual time=40.082..40.082 rows=0 loops=1)
         Filter: ((to_tsvector((COALESCE(role, ''::character varying))::text) || to_tsvector(COALESCE((text_search)::text, ''::text))) @@ plainto_tsquery('foo'::text))
         Rows Removed by Filter: 42
 Planning Time: 3.140 ms
 Execution Time: 40.273 ms

If you compare this to using the text_search field directly we can see that the query is much faster, presumably due to the use of index and lack of casting to ::text

Code highlighting:

JobPosting.objects.filter(text_search=SearchQuery("foo"))

Code highlighting:

EXPLAIN ANALYZE SELECT "jobs_jobposting"."id",
       "jobs_jobposting"."text",
       "jobs_jobposting"."role"
  FROM "jobs_jobposting"
 WHERE "jobs_jobposting"."text_search" @@ (plainto_tsquery('foo')) = true
 LIMIT 21;
------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..15.24 rows=1 width=371) (actual time=1.165..1.166 rows=0 loops=1)
   ->  Seq Scan on jobs_jobposting  (cost=0.00..15.24 rows=1 width=371) (actual time=1.163..1.163 rows=0 loops=1)
         Filter: (text_search @@ plainto_tsquery('foo'::text))
         Rows Removed by Filter: 42
 Planning Time: 0.699 ms
 Execution Time: 1.209 ms

Compare the execution times: from 40.3ms to 1.2ms.

You could technically concatenate the role and the text field into one SearchVectorField, but then you would be unable to search rank different fields differently. Perhaps we would like to rank the information in the role column as A, but the text column as C.

I have tried to use F expressions to concatenate the columns, but then Django complains.

Code highlighting:

JobPosting.objects.annotate(full_text=(SearchVector("role") + F("text_search"))).filter(full_text=SearchQuery("foo"))
TypeError: SearchVector can only be combined with other SearchVectors

I have also tried casting the "text_search" field into a SearchVector, but that also fails.

Code highlighting:

JobPosting.objects.annotate(full_text=(SearchVector("role") + Cast("text_search", SearchVector()))).filter(full_text=SearchQuery("foo"))
TypeError: SearchVector can only be combined with other SearchVectors

Change History (4)

comment:1 Changed 13 months ago by Dani Hodovic

Description: modified (diff)

comment:2 Changed 13 months ago by felixxm

Summary: Django is unable to combine SearchVectorField and SearchVectorSearchVectorField and SearchVector should be combinable.
Triage Stage: UnreviewedAccepted
Type: Cleanup/optimizationBug
Version: 2.2master

Thanks for the report. Agreed, SearchVector and SearchVectorField should be combinable, e.g.

JobPosting.objects.annotate(full_text=(SearchVector('role') + F('text_search'))).filter(full_text=SearchQuery('foo'))

comment:3 Changed 13 months ago by Ehsan Poursaeed

Owner: set to Ehsan Poursaeed
Status: newassigned

comment:4 Changed 3 months ago by Alexandr Tatarinov

I have taken a look, and not sure about the desired result. Should we
a) avoid ::text cast
b) allow SearchVectorField and SearchVector combination without changing the produced query
c) both?

I doubt the database can use the index when two fields are combined, and I can't see the index being used anyway in both queries provided.
So the performance problems should come from ::text cast?
Because SearchVectorField can be null, we still need to wrap it with Coalesce. So the resulting query is something like this, correct me if I'm wrong.
Please, test whether this will change the execution time.

EXPLAIN ANALYZE SELECT "jobs_jobposting"."id",
       "jobs_jobposting"."role",
       "jobs_jobposting"."text",
       (to_tsvector(COALESCE("jobs_jobposting"."role", '')) || COALESCE("jobs_jobposting"."text_search", to_tsvector(''))) AS "full_text"
  FROM "jobs_jobposting"
 WHERE (to_tsvector(COALESCE("jobs_jobposting"."role", '')) || COALESCE("jobs_jobposting"."text_search", to_tsvector(''))) @@ (plainto_tsquery('foo')) = true
 LIMIT 21;
Note: See TracTickets for help on using tickets.
Back to Top