Opened 5 months ago

Closed 5 months ago

Last modified 4 months ago

#35539 closed Bug (invalid)

SearchVector GinIndex raises IMMUTABLE error

Reported by: Alastair D'Silva Owned by:
Component: contrib.postgres Version: 5.0
Severity: Normal Keywords:
Cc: Mariusz Felisiak, Simon Charette Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I have a Document model with the following subclass:

    class Meta:
        indexes = [
            GinIndex(SearchVector('search_vector'), name='Document search vector'),  # Create a GIN index on the search vector
        ]

It creates the following migration:

        migrations.AddIndex(
            model_name='document',
            index=django.contrib.postgres.indexes.GinIndex(django.contrib.postgres.search.SearchVector('search_vector'), name='Document search vector'),
        ),

which in turn attempts to create the following index:

CREATE INDEX "Document search vector" ON "documents_document" USING gin ((to_tsvector(COALESCE(("search_vector")::text, ''))))

This fails with:

ERROR:  functions in index expression must be marked IMMUTABLE

If I update contrib/postgres/search.py as follows:
91:

function = ""

and 116-128:

#        clone.set_source_expressions(
#            [
#               Coalesce(
#                    (
#                        expression
#                        if isinstance(expression.output_field, (CharField, TextField))
#                        else Cast(expression, TextField())
#                    ),
#                    Value(""),
#                )
#                for expression in clone.get_source_expressions()
#            ]
#        )

I can successfully create the index, and searches using SearchRank successfully use it.

Change History (5)

comment:1 by Sarah Boyce, 5 months ago

Cc: Mariusz Felisiak Simon Charette added
Summary: contrib/postgres/search.py COALESCE breaks GIN Index creation on PostgreSQL 15SearchVector GinIndex raises IMMUTABLE error
Triage Stage: UnreviewedAccepted

Thank you for the report, was able to replicate the error 👍
I believe this is a valid bug, I've never used these before myself but looks like it's being used as documented

Here is my "rough" test if it gives anyone a starting point

  • tests/postgres_tests/test_indexes.py

    a b class SchemaTests(PostgreSQLTestCase):  
    379379            editor.remove_index(Scene, index)
    380380        self.assertNotIn(index_name, self.get_constraints(Scene._meta.db_table))
    381381
     382    def test_search_vector_gin_index(self):
     383        index_name = "search_vector_gin"
     384        index = GinIndex(SearchVector("field"), name=index_name)
     385        with connection.schema_editor() as editor:
     386            editor.add_index(TextFieldModel, index)
     387        table = TextFieldModel._meta.db_table
     388        constraints = self.get_constraints(table)
     389        self.assertIn(index_name, constraints)
     390        self.assertIn(constraints[index_name]["type"], GinIndex.suffix)
     391        with connection.schema_editor() as editor:
     392            editor.remove_index(TextFieldModel, index)
     393        self.assertNotIn(index_name, self.get_constraints(table))
     394
    382395    def test_cast_search_vector_gin_index(self):

comment:2 by Sarah Boyce, 5 months ago

Has patch: unset

comment:3 by Simon Charette, 5 months ago

Resolution: invalid
Status: newclosed

Blaming the surrounding code leads to #30488 and #30385 which have plenty of context. It also has ties to #34955.

The summary is that SearchVector was designed to support any field and expression thrown at it even the ones that are not of textual type and / or are nullable. This encouraged the usage of Postgres CONCAT function over the || operator as the former properly deals with both

psql (16.3, server 15.6 (Debian 15.6-1.pgdg120+2))
Type "help" for help.

django=# SELECT CONCAT('foo', NULL, 1);
 concat
--------
 foo1
(1 row)

django=# SELECT 'foo' || NULL || 1;
 ?column?
----------

(1 row)

The problem with CONCAT however is that it produces non-IMMUTABLE output (the resulting data can change based on some connection settings that relate to collation for example) and thus is not suitable in index creation which creates a pickle about how NULL-able values should be dealt with.


With all that said this ticket has little to do with this complexity. If you comment out the code reported to solve the issue by the reporter your test still fails Sarah and the reason is simple: to_tsvector is not IMMUTABLE unless a regconfig is specified which is covered at length in the PostgreSQL documentation and included in the Django docs as well in the example.

Notice that the 2-argument version of to_tsvector is used. Only text search functions that specify a configuration name can be used in expression indexes (Section 11.7). This is because the index contents must be unaffected by default_text_search_config. If they were affected, the index contents might be inconsistent because different entries could contain tsvectors that were created with different text search configurations, and there would be no way to guess which was which. It would be impossible to dump and restore such an index correctly.

So for the same reason Concat cannot be used in indices because it relies on alterable global configuration SearchVector cannot be used in indices without specifying a config because they do not use the same stopwords and stoplists

django=# SELECT to_tsvector('english', 'The Web framework for perfectionists with deadlines.');
                     to_tsvector
-----------------------------------------------------
 'deadlin':7 'framework':3 'perfectionist':5 'web':2
(1 row)

django=# SELECT to_tsvector('french', 'The Web framework for perfectionists with deadlines.');
                                 to_tsvector
------------------------------------------------------------------------------
 'deadlin':7 'for':4 'framework':3 'perfectionist':5 'the':1 'web':2 'with':6
(1 row)

I'm not sure how the reporter got that removing these particular lines of code would help but in its current form the report appears to me to be invalid.

Last edited 5 months ago by Simon Charette (previous) (diff)

comment:4 by Alastair D'Silva, 4 months ago

Having the NULL propagate doesn't seem to big deal when searching, it still means that items that don't have a populated search vector won't be found.

I have the following patch which is working for me, but does need a bit more effort, in particular, the bit which removes to_tsvector if we already have a ts_vector (you can't call to_tsvector on a ts_vector).

--- search.py.orig	2024-07-09 15:10:14.092901831 +1000
+++ search.py	2024-07-09 09:46:15.962716177 +1000
@@ -4,6 +4,7 @@
     Field,
     FloatField,
     Func,
+    JSONField,
     Lookup,
     TextField,
     Value,
@@ -113,19 +114,26 @@
 
     def as_sql(self, compiler, connection, function=None, template=None):
         clone = self.copy()
-        clone.set_source_expressions(
-            [
-                Coalesce(
+
+        new_expressions = []
+        for expression in clone.get_source_expressions():
+            if isinstance(expression.output_field, SearchVectorField):
+                new_expressions.append(expression)
+                function = ''
+            elif isinstance(expression.output_field, JSONField):
+                new_expressions.append(Coalesce(expression, Value("{}"))),
+            else:
+                new_expressions.append(Coalesce(
                     (
                         expression
                         if isinstance(expression.output_field, (CharField, TextField))
                         else Cast(expression, TextField())
                     ),
                     Value(""),
-                )
-                for expression in clone.get_source_expressions()
-            ]
-        )
+                ))
+
+        clone.set_source_expressions(new_expressions)
+
         config_sql = None
         config_params = []
         if template is None:

Here is my usage:

class Document(models.Model):
    title = models.CharField(max_length=255)
    metadata = models.JSONField(null=True)
    uploaded_at = models.DateTimeField(auto_now_add=True)
    tags = models.ManyToManyField(Tag, related_name='documents')
    type_fields = models.JSONField(default=dict)
    search_vector = SearchVectorField(null=True)

    def save(self, *args, **kwargs):
        super().save(*args, **kwargs)  # Save the instance first
        self.update_search_vector()

    def update_search_vector(self):
        self.search_vector = SearchVector('title', 'metadata')
        
        Document.objects.filter(pk=self.pk).update(search_vector=self.search_vector)

    class Meta:
        indexes = [
            GinIndex(SearchVector('search_vector'), name='Document search'),
        ]

Last edited 4 months ago by Alastair D'Silva (previous) (diff)

comment:5 by Simon Charette, 4 months ago

You don't need this patch at all of the save() shenanigans. All you are doing here is circumventing the fact that must specify a language configuration for ts_vector to be immutable by running the update(search_vector=self.search_vector) that will use the session configured default_text_search_config.

You can simply do

class Document(models.Model):
    title = models.CharField(max_length=255)
    metadata = models.JSONField(null=True)

    class Meta:
        indexes = [
            GinIndex(
                SearchVector('title', Coalesce('metadata', Value('')), config='english'),
                name='document_search',
            ),
        ]

Or use a GeneratedField if you want to materialize the search vector

class Document(models.Model):
    title = models.CharField(max_length=255)
    metadata = models.JSONField(null=True)
    search_vector = GeneratedField(
        SearchVector('title', Coalesce('metadata', Value('')), config='english')
        output_field=SearchVectorField()
    )

    class Meta:
        indexes = [
            GinIndex(
                'search_vector', name='search_vector_idx'
            ),
        ]

The key part here is that a config must be specified if you want to use SearchVector in an index or generated field to make it IMMUTABLE

Note: See TracTickets for help on using tickets.
Back to Top