﻿id	summary	reporter	owner	description	type	status	component	version	severity	resolution	keywords	cc	stage	has_patch	needs_docs	needs_tests	needs_better_patch	easy	ui_ux
32803	Not obvious interaction between Postgresql trgm index and icontains lookup	Dmitri Emelianov	nobody	"Hi! Long post ahead.

tldr; standard lookup is not working with standard index as expected

Using LIKE UPPER in icontains is not an accident, it is even stated that it's [https://github.com/django/django/blob/ca9872905559026af82000e46cde6f7dedc897b6/django/db/backends/postgresql/operations.py#L102 faster]

{{{
# Use UPPER(x) for case-insensitive lookups; it's faster.
if lookup_type in ('iexact', 'icontains', 'istartswith', 'iendswith'):
    lookup = 'UPPER(%s)' % lookup
}}}

However, the main reason I've created the ticket is not the speed, but usage icontains together with trgm index.

Let's create a simple test model:

{{{
class Book(models.Model):
    name = models.TextField()
}}}

and prefil it with kind of random data.

Random function (idea taken from https://stackoverflow.com/a/5478750/1185696):
{{{
CREATE OR REPLACE FUNCTION get_random_name() RETURNS text as $$
SELECT array_to_string(array 
       ( 
              select substr('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', trunc(random() * 62)::integer + 1, 1)
              FROM   generate_series(1, (random() * 15)::integer)), '') || ' ' ||  array_to_string(array 
       ( 
              select substr('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', trunc(random() * 62)::integer + 1, 1)
              from generate_series(1, (random() * 15)::integer)), '');
$$
VOLATILE
LANGUAGE SQL;

-- Example output: M2Fib gJr4CHLnZHcr
-- I want data to be kind of ""real"" if it is could be said about random string
}}}

->
{{{
INSERT INTO book_book(name) SELECT get_random_name() FROM generate_series(1, 10000);
}}}

Now let's do some search.
{{{
Book.objects.filter(name__icontains='orc')

# SELECT ""book_book"".""id"", ""book_book"".""name"" FROM ""book_book"" WHERE UPPER(""book_book"".""name""::text) LIKE UPPER('%orc%') ORDER BY ""book_book"".""id"" ASC
# ->  Seq Scan on book_book  (cost=0.00..220.00 rows=80 width=24) (actual time=5.755..5.755 rows=0 loops=1)
#        Filter: (upper(name) ~~ '%ORC%'::text)
#        Rows Removed by Filter: 9996
}}}

Seq scan, totally understandable, no complains here. Now let's do 1m entries:
{{{
->  Parallel Seq Scan on book_book  (cost=0.00..8865.51 rows=3098 width=24) (actual time=2.024..341.389 rows=143 loops=3)
               Filter: (upper(name) ~~ '%ORC%'::text)
               Rows Removed by Filter: 369857
 Planning Time: 0.214 ms
 Execution Time: 374.926 ms
}}}


Still OK, but let's do better by adding indices. Since pg_trgm is very popular solution for text search, let's use it (I am not sure I know the alternatives aside from tsvector for such kind of queries):
{{{
indexes = [
            GinIndex(name='test_index', fields=['name'], opclasses=['gin_trgm_ops']),
        ]
}}}
Repeat query:
{{{
->  Parallel Seq Scan on book_book  (cost=0.00..14641.50 rows=46 width=24) (actual time=331.265..331.266 rows=0 loops=3)
               Filter: (upper(name) ~~ '%ORC%'::text)
               Rows Removed by Filter: 370000
 Planning Time: 1.524 ms
 Execution Time: 368.272 ms
}}}
Looks like index is not working.

Change query a bit:
{{{
EXPLAIN ANALYZE SELECT ""book_book"".""id"", ""book_book"".""name"" FROM ""book_book"" WHERE ""book_book"".""name""::text ILIKE '%orc%' ORDER BY ""book_book"".""id"" ASC;
}}}
->
{{{
  ->  Bitmap Heap Scan on book_book  (cost=20.86..426.28 rows=111 width=24) (actual time=0.158..1.441 rows=430 loops=1)
         Recheck Cond: (name ~~* '%orc%'::text)
         Heap Blocks: exact=414
         ->  Bitmap Index Scan on test_index  (cost=0.00..20.83 rows=111 width=0) (actual time=0.075..0.076 rows=430 loops=1)
               Index Cond: (name ~~* '%orc%'::text)
 Planning Time: 0.178 ms
 Execution Time: 1.553 ms
}}}


I call ""UPPPER"" function usage as ""implementation detail"", because postgres have standard operator. And the main issue is that implementation detail adds additional complexity to index and you basically have to remember to use ""upper"" in all trgm indices. icontains is standard lookup. trgm index is kind of standard these days. I believe, one would expect pretty straightforward interaction between lookup and index (lookup should become faster). Current behavior, when default index is basically not working for default lookup is not obvious for me. This can be easily avoided just by using standard, out of the box operator.

Now let's do some measures to check if it is faster or not. I will use same data generator, I will use ""orc"" query always, I will always pick average of 3 measurements. No indices applied.
{{{
PostgreSQL 12.6 (Ubuntu 12.6-0ubuntu0.20.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
+
en_US.UTF-8 collation
}}}

Results:
{{{
                         10k                     100k                     1m                      10m
ILIKE                     9.0                     111                    629                     4247 
UPPER                     6.6                     72                     448                     3175 
LOWER                     7.1                     71                     462                     2962
}}}

https://stackoverflow.com/questions/20336665/lower-like-vs-ilike question shows similar difference, but notes that everything could be more complicated.

Indead, in my environment, ILIKE shows lower performance and I have no explanation about that.

However, I would call this results implementation detail as well, because the main question is using ""standard"" operators and predictable interactions rather than performance. If one is concerned about performance, it could use an index. I also understand, that starting from 3.2 it could be solved by custom operator in Index, but how many people would expect that default won't work? And what about indices created before 3.2? Especially when docs clearly tells (with shady SQL equivalent term) that ""icontains"" will use ILIKE. I guess this either could be changed to ILIKE (with performance penalty and broken existing custom UPPER indices), either heavily documented - icontains lookup should clearly tells which query it will use and which index should be used. Also GinIndex/GistIndex could use a note about TEXT fields (that you probably want to use UPPER function for text)."	New feature	closed	Database layer (models, ORM)	dev	Normal	duplicate	postgresql icontains lookup search		Unreviewed	0	0	0	0	0	0
