Full Text Search in Django with PostgreSQL

June 22, 2020 • Oscar Cortez

Search. The most successful company in the history of the Internet became famous by solving just that, search ...

Although it's a simple concept, search is something that many of us (developers) try to skimp on and therefore lower the overall quality of the user experience due to irrelevant search results and illogical search ranking (me culpa?).

In this post I am going to avoid using alternatives such as: Apache Solr, Elasticsearch, Whoosh, Xapian, etc. Instead, I will explain how to do searches with the Django ORM using PostgreSQL, keep in mind that Full Text Search is only supported if we use the PostgreSQL database backend with Django 1.10 or later.

Introduction

You are most be likely familiar with pattern searching, which has been part of the SQL standard from the beginning, and is available to all SQL-based databases, as follows:

SELECT column_name FROM table_name WHERE column_name LIKE 'pattern';

That will return the rows where column_name matches the pattern, the closest way to translate that SQL statement to the Django ORM is as follows:

In [1]: Product.objects.filter(name__contains='Marinela') 
# --> SELECT ... WHERE name LIKE '%Marinela%';
Out[2]: [<Product: Pinguinos Marinela>, <Product: Gansito Marinela>]

If you have used Django for a long period of time, or understand the basics of the ORM, you will know that this is the classic way of searching. There's nothing new here.

Then you could start playing around with the ORM API to build a better query, making use of the advanced functions in PostgreSQL:

In [1]: Author.objects.filter(name__unaccent__icontains='Helen')
Out[2]:[<Author: Helen Mirren>, <Author: Helena Bonham Carter>, <Author: Hélène Joy>]

It should be noted that queries that use this filter (unaccent) will generally perform full table scans, which can be slow on tables with a considerable number of records.

Some of you might be thinking of using a regular expression (regexp) query. Yes they could be used and can be incredibly powerful. However, implementing search queries with regexp can be a painful route. Regexp queries are generally difficult to be done correctly and usually suffer notoriously slow performance. Also, conceptually, they don't translate very well into the types of natural language queries that most users expect to use in their searches.

Full-Text Search

"(...) full-text search refers to techniques for searching a single
computer-stored document or a collection in a full text database; (...)
distinguished from searches based on metadata or on parts of the original
texts represented in databases (such as titles, abstracts, selected
sections, or bibliographical references). Wikipedia"

In other words, if you have a set of documents stored in a database, these documents may have metadata such as the author's name, the language used, a summary of the document, or the document in its entirety, and you want to know whether or not certain words (including their variations) are present in them.

Using search forms such as those mentioned above (LIKE) can be very complex to execute, which is in a more effective way to tackle this problem is to obtain a semantic vector for all the words contained in a document, a specific representation of the language of such words.

A more effective way to search this type of document is to obtain a "semantic vector" (a specific representation of the language) for all the words contained in the document.

So when you search for a word like "run", it will match all instances of the word and its different conjugated forms (ie, "ran" or "running"). However, it will not search the entire document itself (slow), but the vector (fast).

The easiest way to use full-text search is through the <field_name>__search method to search for a single term in a single column of the database, for example:

In [1]: Product.objects.filter(name__search='Shiny')
Out[2]: [<Product: Shiny Shoes>, <Product: Very Shiny LED>]

And if the information you are looking for is not very complex or you only need to search on a single column, perhaps this is the best option. Of course, if you could use all the tools you have from PostgreSQL to execute a full text query and why not? You're writing the query anyway so might as well take the extra time and get more bang for your buck. Django has 3 new functions to help you use typical PostgreSQL for full text search.

  • SearchVector (optionally can be used with SearchVectorField)
  • SearchQuery
  • SearchRank

Next, I will explain, from the PostgreSQL point of view, what each of these functions does, and what their objective is in the search process.

SearchVector

SearchVector is the abstraction of the to_tsvector function in PostgreSQL, which is responsible for returning a vector, where each word is translated to a lexema (unit of lexical meaning) with pointers (the positions in the document), and where words that have little meaning, such as articles (the) and conjunctions (and, or) are omitted:

SELECT to_tsvector('Thanks so much for cooking dinner. I really appreciate it.');  

                      to_tsvector
-------------------------------------------------------
 'appreci':9 'cook':5 'dinner':6 'much':3 'realli':8 'thank':1

What we get is the normalization of each word to an English lexeme (for example, "cooking" becomes "cook") with their respective vector positions, in this case the number 9 next to the lexeme appreci is nothing more than the position of that lexeme in the sentence. Note that this could vary depending on the localization settings of your PostgreSQL installation, or if you work with a language other than English. Either way PostgreSQL can handle it with an additional argument.

You can read more Postgres docs.

SearchQuery

The to_tsquery accept a set of words to search within the normalized vector we created using to_tsvector, these words can be combined using the Boolean operators & (AND), | (OR), and ! (NOT), you can also use parentheses to impose the grouping of operators, let's see the following example:

SELECT to_tsvector('Thanks so much for cooking dinner. I really appreciate it.') @@ to_tsquery('cook');

 ?column?
----------
 t

Here we use the @@ operator to check that our search query (tsquery) matches our text (tsvector), if it is true it will return the value t i.e.true.

You can also do more search combinations, for example:

SELECT to_tsvector('Thanks so much for cooking dinner. I really appreciate it.') @@ to_tsquery('cook | dinner');

You can read more in the Postgres docs.

SearchRank

Rank using ts_rank or ts_rank_cd tries to measure how relevant documents are to a particular query so that when there are many matches, the most relevant ones can be displayed first. This is done by considering how often the query terms appear in the document, how close the terms are in the document, and how important is the part of the document where they appear. However, the concept of relevance is vague and very application specific. Different applications may require additional information for classification, for example the document modification time.

SearchVectorField

To perform full-text searches efficiently, a database must pre-process the data and summarize it as search vectors. Because it takes time to convert strings to search vectors, it would also be better to save these search vectors in the database. Django 1.10 introduced the SearchVectorField model field to save the search vector in this column, which will be converted to TSVECTOR which is a PostgreSQL built-in text search type. You will need to keep this field updated in your app logic as it's not automatically done by Django or Postgres.

Setup

Now that you have understood the basic concepts about searching in PostgreSQL, let's see how to do it in Django. We're going to create an Article model as follows:

from django.db import models
from django.contrib.postgres.search import SearchVectorField


class Article(models.Model):
    headline = models.TextField()
    content = models.TextField()
    search_vector = SearchVectorField(null=True)

Remember that we need to keep our search_vector field updated. Here you could use different techniques such as overwriting the save method of the model, using a post_save signal, creating a task with celery so you'll not block the main application thread, or even use a function together with an SQL trigger. For demonstration purposes, we will use the save method in the following way:

...
from django.contrib.postgres.search import SearchVector


class Article(models.Model):
    ...

    def save(self, *args, **kwargs):
        self.search_vector = (
            SearchVector('headline', weight='A') 
            + SearchVector('content', weight='B')
        )
    	super().save(*args, **kwargs)

This way we make sure that our search_vector field will be kept updated every time an Article object is created or updated. Notice that we also added the "weight" to ourSearchVector in order to improve search results depending on the relevance of these once they are ranked. It should be noted that the argument that SearchVector takes is the name of a field in the model.

Model manager

In the likely event you'll want to use Article search in many places, it's a good idea to centralize the search logic. In this case we will use a custom model manager with a search method:

from django.contrib.postgres.aggregates import StringAgg
from django.contrib.postgres.search import (
    SearchQuery, SearchRank, SearchVector, TrigramSimilarity,
)
from django.db import models


class ArticleManager(models.Manager):
    def search(self, search_text):
        search_vectors = (
            SearchVector(
                'headline', weight='A', config='english'
            )
            + SearchVector(
                StringAgg('content', delimiter=' '),
                weight='B', 
                config='english',
            )
        ) 
        search_query = SearchQuery(
            search_text, config='english'
        )
        search_rank = SearchRank(search_vectors, search_query)
        trigram_similarity = TrigramSimilarity(
            'headline', search_text
        )
        qs = (
            self.get_queryset()
            .filter(search_vector=search_query)
            .annotate(rank=search_rank + trigram_similarity)
            .order_by('-rank')
        )
        return qs

And now what we have to do is simply import our manager and add it to the model:

from myapp.managers import ArticleManager


class Article(models.Model):
    ....
    objects = ArticleManager()

TrigramSimilarity

A search method that is tolerant to typographical errors is trigram similarity. Compare the number of trigrams, or three consecutive characters, shared between search terms and target text. Unlike other features, we need to make sure that an extension called pg_trgm is activated first in PostgreSQL, you can create this extension using SQL with CREATE EXTENSION pg_trgm; or create a data migration in Django:

from django.contrib.postgres.operations import TrigramExtension


class Migration(migrations.Migration):
    ...

    operations = [
        TrigramExtension(),
        ...
    ]

Searching...

And now that we have everything completed, we can only use the search method that we added in our manager. Here's an example function-based view:

import json
from django.http import Http404, HttpResponse
from myapp.models import Article

def search_articles(request):
    search_term = request.GET.get('q', None)
    if not search_term:
        raise Http404('Send a search term')

    articles = Articles.objects.search(search_term)

    response_data = [
        {
            'rank': art.rank,
            'headline': art.headline,
            'url': art.get_absolute_url(),
        } for art in articles
    ]

    return HttpResponse(
        json.dumps(response_data), 
        content_type='application/json',
    ) 

And done, with that you would get search results, ranked by order of relevance, without the need to configure other services. If you you want to investigate more on this topic, I recommend that you check the TrigramDistance function to see the difference between the terms to search and SearchHeadline in case you want to highlight the terms that match your search.

Conclusion

The full-text search functions in PostgreSQL are very powerful and fast. And while setting up a search engine will take some work, remember that this is a fairly advanced feature and not too long ago it used to require a full team of programmers and an extensive code base. PostgreSQL already did the heavy lifting, now you just need to tweak minor aspects to tailor it as needed.

Hope this gave you a basic introduction to the myriad of full-text search features that PostgreSQL and Django have to offer.

Happy searching!

References:


Have a response?

Start a discussion on the mailing list by sending an email to
~netlandish/blog-discussion@lists.code.netlandish.com.