Opened 3 years ago

Closed 3 years ago

#32492 closed New feature (fixed)

Add django API for Postgres pg_trgm word_similarity, fuzzy full-text search.

Reported by: Taneli Owned by: Nikita Marchant
Component: contrib.postgres Version: dev
Severity: Normal Keywords:
Cc: Paolo Melchiorre Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Taneli)

Use case: filter queryset by an inexact substring. For example, retrieve an object with the following headline Dogecoin is following bitcoin in its dramatic rise when searching for dogge.

Currently it is possible to filter a queryset on the basis of trigram similarity between the search string and full text stored in a column.

Author.objects.annotate(similarity=TrigramSimilarity('name', test),).filter(similarity__gt=0.3)

This is a wrapper around the similarity function of the pg_trgm extension. While it allows comparing full strings, i.e. searching for 'doge' would find 'dogs' or 'dogge' it is useless for fuzzy searching of substrings.

SELECT similarity('dogge', 'doge');
---------
0.57

SELECT similarity('dogge', 'dogecoin is following bitcoin');
------------
0.1

word_similarity does take into account the word boundaries

SELECT word_similarity('doge', 'dogecoin is following bitcoin');
--------------
0.5

Adding a django API to word_similarity would allow for better fuzzy fulltext search without a need to use either raw SQL or external tools like elasticsearch.

Change History (13)

comment:1 by Taneli, 3 years ago

Description: modified (diff)

comment:2 by Taneli, 3 years ago

Owner: changed from nobody to Taneli
Status: newassigned

comment:3 by Mariusz Felisiak, 3 years ago

Component: Uncategorizedcontrib.postgres
Triage Stage: UnreviewedAccepted
Type: UncategorizedNew feature
Version: 3.1master

Thanks. We should add also TrigramWordDistance() with the <<-> operator.

comment:4 by Matthew Schinckel, 3 years ago

FWIW, I use pg_trgm, and I also have a connection_created receiver that sets pg_trgm.similarity_threshold. It might be nice to include the ability to set this in some way, so that other operations that use this can be included.

comment:5 by Nikita Marchant, 3 years ago

Owner: Taneli removed

comment:6 by Nikita Marchant, 3 years ago

Owner: set to Nikita Marchant

comment:7 by Nikita Marchant, 3 years ago

Hi 👋

I've open a pull request (https://github.com/django/django/pull/14833) implementing the trigram_word_similar lookup and updating the documentation.

I also wrote a test, but for it to be meaningful, i needed a longer string that the 16 chars available on CharFieldModel.field in the tests. I did increase the limit to 64 and updated the migration at the same time. Is this the right approach or should it create a new model or a new migration ? (As the test database is recreated almost each time, i guessed that modifying a migration is not a big deal but i could be wrong).
I ran the tests on PostgreSQL 13.4 on Mac with Python 3.9.1

I also have some code (adding TrigramWordSimilarity and TrigramWordDistance) almost ready for the rest of the ticket but i still have some troubles/questions because the order of the arguments of WORD_SIMILARITY() is meaningful, unlike for SIMILARITY(). Should we discuss it here or on GitHub ?

comment:8 by Nikita Marchant, 3 years ago

Has patch: set

comment:9 by Mariusz Felisiak, 3 years ago

Needs documentation: set
Patch needs improvement: set

comment:10 by Nikita Marchant, 3 years ago

Needs documentation: unset
Patch needs improvement: unset

comment:11 by Mariusz Felisiak, 3 years ago

Triage Stage: AcceptedReady for checkin

comment:12 by Paolo Melchiorre, 3 years ago

Cc: Paolo Melchiorre added

comment:13 by Mariusz Felisiak <felisiak.mariusz@…>, 3 years ago

Resolution: fixed
Status: assignedclosed

In 4e4082f9:

Fixed #32492 -- Added TrigramWordSimilarity() and TrigramWordDistance() on PostgreSQL.

Note: See TracTickets for help on using tickets.
Back to Top