June 22, 2020 • Oscar Cortez
Search. The most successful company in the history of the Internet became famous by solving just that, search ...
Although it's a simple concept, search is something that many of us (developers) try to skimp on and therefore lower the overall quality of the user experience due to irrelevant search results and illogical search ranking (me culpa?).
In this post I am going to avoid using alternatives such as: Apache Solr, Elasticsearch, Whoosh, Xapian, etc. Instead, I will explain how to do searches with the Django ORM using PostgreSQL, keep in mind that Full Text Search is only supported if we use the PostgreSQL database backend with Django 1.10 or later.
You are most be likely familiar with pattern searching, which has been part of the SQL standard from the beginning, and is available to all SQL-based databases, as follows:
SELECT column_name FROM table_name WHERE column_name LIKE 'pattern';
That will return the rows where column_name
matches the pattern
, the
closest way to translate that SQL statement to the Django ORM is as follows:
In [1]: Product.objects.filter(name__contains='Marinela')
# --> SELECT ... WHERE name LIKE '%Marinela%';
Out[2]: [<Product: Pinguinos Marinela>, <Product: Gansito Marinela>]
If you have used Django for a long period of time, or understand the basics of the ORM, you will know that this is the classic way of searching. There's nothing new here.
Then you could start playing around with the ORM API to build a better query, making use of the advanced functions in PostgreSQL:
In [1]: Author.objects.filter(name__unaccent__icontains='Helen')
Out[2]:[<Author: Helen Mirren>, <Author: Helena Bonham Carter>, <Author: Hélène Joy>]
It should be noted that queries that use this filter (unaccent
) will generally
perform full table scans, which can be slow on tables with a considerable
number of records.
Some of you might be thinking of using a regular expression (regexp) query. Yes they could be used and can be incredibly powerful. However, implementing search queries with regexp can be a painful route. Regexp queries are generally difficult to be done correctly and usually suffer notoriously slow performance. Also, conceptually, they don't translate very well into the types of natural language queries that most users expect to use in their searches.
"(...) full-text search refers to techniques for searching a single
computer-stored document or a collection in a full text database; (...)
distinguished from searches based on metadata or on parts of the original
texts represented in databases (such as titles, abstracts, selected
sections, or bibliographical references). Wikipedia"
In other words, if you have a set of documents stored in a database, these documents may have metadata such as the author's name, the language used, a summary of the document, or the document in its entirety, and you want to know whether or not certain words (including their variations) are present in them.
Using search forms such as those mentioned above (LIKE) can be very complex to execute, which is in a more effective way to tackle this problem is to obtain a semantic vector for all the words contained in a document, a specific representation of the language of such words.
A more effective way to search this type of document is to obtain a "semantic vector" (a specific representation of the language) for all the words contained in the document.
So when you search for a word like "run", it will match all instances of the word and its different conjugated forms (ie, "ran" or "running"). However, it will not search the entire document itself (slow), but the vector (fast).
The easiest way to use full-text search is through the <field_name>__search
method to search for a single term in a single column of the database, for
example:
In [1]: Product.objects.filter(name__search='Shiny')
Out[2]: [<Product: Shiny Shoes>, <Product: Very Shiny LED>]
And if the information you are looking for is not very complex or you only need to search on a single column, perhaps this is the best option. Of course, if you could use all the tools you have from PostgreSQL to execute a full text query and why not? You're writing the query anyway so might as well take the extra time and get more bang for your buck. Django has 3 new functions to help you use typical PostgreSQL for full text search.
SearchVector
(optionally can be used with SearchVectorField
)SearchQuery
SearchRank
Next, I will explain, from the PostgreSQL point of view, what each of these functions does, and what their objective is in the search process.
SearchVector
is the abstraction of the
to_tsvector
function in PostgreSQL, which is responsible for returning a vector, where
each word is translated to a lexema (unit of lexical meaning) with pointers
(the positions in the document), and where words that have little meaning, such
as articles (the) and conjunctions (and, or) are omitted:
SELECT to_tsvector('Thanks so much for cooking dinner. I really appreciate it.');
to_tsvector
-------------------------------------------------------
'appreci':9 'cook':5 'dinner':6 'much':3 'realli':8 'thank':1
What we get is the normalization of each word to an English lexeme (for example, "cooking" becomes "cook") with their respective vector positions, in this case the number 9 next to the lexeme appreci is nothing more than the position of that lexeme in the sentence. Note that this could vary depending on the localization settings of your PostgreSQL installation, or if you work with a language other than English. Either way PostgreSQL can handle it with an additional argument.
You can read more Postgres docs.
The to_tsquery
accept a set of words to search within the normalized vector
we created using to_tsvector
, these words can be combined using the Boolean
operators &
(AND), |
(OR), and !
(NOT), you can also use parentheses to
impose the grouping of operators, let's see the following example:
SELECT to_tsvector('Thanks so much for cooking dinner. I really appreciate it.') @@ to_tsquery('cook');
?column?
----------
t
Here we use the @@
operator to check that our search query (tsquery
)
matches our text (tsvector
), if it is true it will return the value t
i.e.true
.
You can also do more search combinations, for example:
SELECT to_tsvector('Thanks so much for cooking dinner. I really appreciate it.') @@ to_tsquery('cook | dinner');
You can read more in the Postgres docs.
Rank using ts_rank
or ts_rank_cd
tries to measure how relevant documents
are to a particular query so that when there are many matches, the most relevant
ones can be displayed first. This is done by considering how often the query
terms appear in the document, how close the terms are in the document, and how
important is the part of the document where they appear. However, the concept
of relevance is vague and very application specific. Different applications may
require additional information for classification, for example the document
modification time.
To perform full-text searches efficiently, a database must pre-process the
data and summarize it as search vectors. Because it takes time to convert
strings to search vectors, it would also be better to save these search vectors
in the database. Django 1.10 introduced the SearchVectorField
model field to
save the search vector in this column, which will be converted to TSVECTOR
which is a PostgreSQL built-in text search type. You will need to keep this
field updated in your app logic as it's not automatically done by Django or
Postgres.
Now that you have understood the basic concepts about searching in PostgreSQL,
let's see how to do it in Django. We're going to create an Article
model
as follows:
from django.db import models
from django.contrib.postgres.search import SearchVectorField
class Article(models.Model):
headline = models.TextField()
content = models.TextField()
search_vector = SearchVectorField(null=True)
Remember that we need to keep our search_vector
field updated. Here you could
use different techniques such as overwriting the save
method of the model,
using a post_save
signal, creating a task with celery
so you'll not block
the main application thread, or even use a function together with an SQL
trigger. For demonstration purposes, we will use the save
method in the
following way:
...
from django.contrib.postgres.search import SearchVector
class Article(models.Model):
...
def save(self, *args, **kwargs):
self.search_vector = (
SearchVector('headline', weight='A')
+ SearchVector('content', weight='B')
)
super().save(*args, **kwargs)
This way we make sure that our search_vector
field will be kept updated every
time an Article object is created or updated. Notice that we also added the
"weight" to ourSearchVector
in order to improve search results depending
on the relevance of these once they are ranked. It should be noted that the
argument that SearchVector
takes is the name of a field in the model.
In the likely event you'll want to use Article
search in many places, it's a
good idea to centralize the search logic. In this case we will use a custom
model manager with a search
method:
from django.contrib.postgres.aggregates import StringAgg
from django.contrib.postgres.search import (
SearchQuery, SearchRank, SearchVector, TrigramSimilarity,
)
from django.db import models
class ArticleManager(models.Manager):
def search(self, search_text):
search_vectors = (
SearchVector(
'headline', weight='A', config='english'
)
+ SearchVector(
StringAgg('content', delimiter=' '),
weight='B',
config='english',
)
)
search_query = SearchQuery(
search_text, config='english'
)
search_rank = SearchRank(search_vectors, search_query)
trigram_similarity = TrigramSimilarity(
'headline', search_text
)
qs = (
self.get_queryset()
.filter(search_vector=search_query)
.annotate(rank=search_rank + trigram_similarity)
.order_by('-rank')
)
return qs
And now what we have to do is simply import our manager and add it to the model:
from myapp.managers import ArticleManager
class Article(models.Model):
....
objects = ArticleManager()
A search method that is tolerant to typographical errors is trigram similarity.
Compare the number of trigrams, or three consecutive characters, shared between
search terms and target text. Unlike other features, we need to make sure that
an extension called pg_trgm
is activated first in PostgreSQL, you can create
this extension using SQL with CREATE EXTENSION pg_trgm;
or create a data
migration in Django:
from django.contrib.postgres.operations import TrigramExtension
class Migration(migrations.Migration):
...
operations = [
TrigramExtension(),
...
]
And now that we have everything completed, we can only use the search
method
that we added in our manager. Here's an example function-based view:
import json
from django.http import Http404, HttpResponse
from myapp.models import Article
def search_articles(request):
search_term = request.GET.get('q', None)
if not search_term:
raise Http404('Send a search term')
articles = Articles.objects.search(search_term)
response_data = [
{
'rank': art.rank,
'headline': art.headline,
'url': art.get_absolute_url(),
} for art in articles
]
return HttpResponse(
json.dumps(response_data),
content_type='application/json',
)
And done, with that you would get search results, ranked by order of relevance,
without the need to configure other services. If you you want to investigate
more on this topic, I recommend that you check the TrigramDistance
function
to see the difference between the terms to search and SearchHeadline
in case
you want to highlight the terms that match your search.
The full-text search functions in PostgreSQL are very powerful and fast. And while setting up a search engine will take some work, remember that this is a fairly advanced feature and not too long ago it used to require a full team of programmers and an extensive code base. PostgreSQL already did the heavy lifting, now you just need to tweak minor aspects to tailor it as needed.
Hope this gave you a basic introduction to the myriad of full-text search features that PostgreSQL and Django have to offer.
Happy searching!
References:
Start a discussion on the mailing list by sending an email to
~netlandish/blog-discussion@lists.code.netlandish.com.