Opened 4 years ago
Closed 4 years ago
#32803 closed New feature (duplicate)
Not obvious interaction between Postgresql trgm index and icontains lookup
Reported by: | Dmitri Emelianov | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | postgresql icontains lookup search |
Cc: | Triage Stage: | Unreviewed | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Hi! Long post ahead.
tldr; standard lookup is not working with standard index as expected
Using LIKE UPPER in icontains is not an accident, it is even stated that it's faster
# Use UPPER(x) for case-insensitive lookups; it's faster. if lookup_type in ('iexact', 'icontains', 'istartswith', 'iendswith'): lookup = 'UPPER(%s)' % lookup
However, the main reason I've created the ticket is not the speed, but usage icontains together with trgm index.
Let's create a simple test model:
class Book(models.Model): name = models.TextField()
and prefil it with kind of random data.
Random function (idea taken from https://stackoverflow.com/a/5478750/1185696):
CREATE OR REPLACE FUNCTION get_random_name() RETURNS text as $$ SELECT array_to_string(array ( select substr('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', trunc(random() * 62)::integer + 1, 1) FROM generate_series(1, (random() * 15)::integer)), '') || ' ' || array_to_string(array ( select substr('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', trunc(random() * 62)::integer + 1, 1) from generate_series(1, (random() * 15)::integer)), ''); $$ VOLATILE LANGUAGE SQL; -- Example output: M2Fib gJr4CHLnZHcr -- I want data to be kind of "real" if it is could be said about random string
->
INSERT INTO book_book(name) SELECT get_random_name() FROM generate_series(1, 10000);
Now let's do some search.
Book.objects.filter(name__icontains='orc') # SELECT "book_book"."id", "book_book"."name" FROM "book_book" WHERE UPPER("book_book"."name"::text) LIKE UPPER('%orc%') ORDER BY "book_book"."id" ASC # -> Seq Scan on book_book (cost=0.00..220.00 rows=80 width=24) (actual time=5.755..5.755 rows=0 loops=1) # Filter: (upper(name) ~~ '%ORC%'::text) # Rows Removed by Filter: 9996
Seq scan, totally understandable, no complains here. Now let's do 1m entries:
-> Parallel Seq Scan on book_book (cost=0.00..8865.51 rows=3098 width=24) (actual time=2.024..341.389 rows=143 loops=3) Filter: (upper(name) ~~ '%ORC%'::text) Rows Removed by Filter: 369857 Planning Time: 0.214 ms Execution Time: 374.926 ms
Still OK, but let's do better by adding indices. Since pg_trgm is very popular solution for text search, let's use it (I am not sure I know the alternatives aside from tsvector for such kind of queries):
indexes = [ GinIndex(name='test_index', fields=['name'], opclasses=['gin_trgm_ops']), ]
Repeat query:
-> Parallel Seq Scan on book_book (cost=0.00..14641.50 rows=46 width=24) (actual time=331.265..331.266 rows=0 loops=3) Filter: (upper(name) ~~ '%ORC%'::text) Rows Removed by Filter: 370000 Planning Time: 1.524 ms Execution Time: 368.272 ms
Looks like index is not working.
Change query a bit:
EXPLAIN ANALYZE SELECT "book_book"."id", "book_book"."name" FROM "book_book" WHERE "book_book"."name"::text ILIKE '%orc%' ORDER BY "book_book"."id" ASC;
->
-> Bitmap Heap Scan on book_book (cost=20.86..426.28 rows=111 width=24) (actual time=0.158..1.441 rows=430 loops=1) Recheck Cond: (name ~~* '%orc%'::text) Heap Blocks: exact=414 -> Bitmap Index Scan on test_index (cost=0.00..20.83 rows=111 width=0) (actual time=0.075..0.076 rows=430 loops=1) Index Cond: (name ~~* '%orc%'::text) Planning Time: 0.178 ms Execution Time: 1.553 ms
I call "UPPPER" function usage as "implementation detail", because postgres have standard operator. And the main issue is that implementation detail adds additional complexity to index and you basically have to remember to use "upper" in all trgm indices. icontains is standard lookup. trgm index is kind of standard these days. I believe, one would expect pretty straightforward interaction between lookup and index (lookup should become faster). Current behavior, when default index is basically not working for default lookup is not obvious for me. This can be easily avoided just by using standard, out of the box operator.
Now let's do some measures to check if it is faster or not. I will use same data generator, I will use "orc" query always, I will always pick average of 3 measurements. No indices applied.
PostgreSQL 12.6 (Ubuntu 12.6-0ubuntu0.20.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit + en_US.UTF-8 collation
Results:
10k 100k 1m 10m ILIKE 9.0 111 629 4247 UPPER 6.6 72 448 3175 LOWER 7.1 71 462 2962
https://stackoverflow.com/questions/20336665/lower-like-vs-ilike question shows similar difference, but notes that everything could be more complicated.
Indead, in my environment, ILIKE shows lower performance and I have no explanation about that.
However, I would call this results implementation detail as well, because the main question is using "standard" operators and predictable interactions rather than performance. If one is concerned about performance, it could use an index. I also understand, that starting from 3.2 it could be solved by custom operator in Index, but how many people would expect that default won't work? And what about indices created before 3.2? Especially when docs clearly tells (with shady SQL equivalent term) that "icontains" will use ILIKE. I guess this either could be changed to ILIKE (with performance penalty and broken existing custom UPPER indices), either heavily documented - icontains lookup should clearly tells which query it will use and which index should be used. Also GinIndex/GistIndex could use a note about TEXT fields (that you probably want to use UPPER function for text).
Change History (1)
comment:1 by , 4 years ago
Component: | contrib.postgres → Database layer (models, ORM) |
---|---|
Owner: | set to |
Resolution: | → duplicate |
Status: | new → closed |
Type: | Uncategorized → New feature |
Closing as duplicate of #32485 and #20775.