Full-Text Search in Django with PostgreSQL
Article based on my talk about Full-Text Search in Django with PostgreSQL.
I’ve given this talk in: PyCon IT 2017, EuroPython 2017, PGDay.IT 2017, PyRoma 2017
- Goal
- Motivation
- Contents
- Full-Text Search
- FTS Features
- Tested Products
- Elasticsearch
- Apache Solr
- Existing Products
- FTS in PostgreSQL
- What are documents
- Django Support
- Making queries
- Standard queries
- Unaccented query
- Trigram similarity
- Search lookup
- SearchVector
- SearchQuery
- SearchRank
- Search configuration
- Queries weighting
- SearchVectorField
- www.concertiaroma.com
- Version 2.0
- Version 3.0
- Band models
- Band Manager
- Band Test Setup
- Band Test Method
- What’s next
- Conclusions
- Acknowledgements
- Resources
- Thank you
Goal¶
To show how I have used Django Full-Text Search and PostgreSQL in a real-world project.
Motivation¶
To implement Full-Text Search using only Django and PostgreSQL, without resorting to external products.
Contents¶
These are the main topics of this article:
- Full-Text Search in general
- external products for Full-Text Search
- Full-Text Search support in PostgreSQL
- Django Support for Full-Text Search
- the concertiaroma.com real-world project
- upcoming innovation in Full-Text Search
- some personal conclusions
- useful resources
Full-Text Search¶
Full-Text Search derives from the need to do some searches in computer-stored documents. For example, to find all documents that contain specific words and their variations. If a document contains “house” or “houses” it will be the same for the search.
“… Full-Text Search* refers to techniques for searching a single computer-stored document or a collection in a Full-Text Database …”
* FTS = Full-Text Search
FTS Features¶
This is a list of some features that we can find in a quite advanced Full-Text Search, to be used in a real-world website.
- Stemming
- Ranking
- Stop-words removal
- Multiple language support
- Accent support
- Indexing
- Phrase search
Tested Products¶
Elastic and Solr are two software programs for the Full-Text Search that are popular today. There are others, but these are the only ones that I have used in my professional projects. They are Lucene based and written in Java.
Elasticsearch¶
Snap Market
Snap market was a startup where I worked which produced a mobile phone application for buying and selling used items with about half a million mobile users.
Issues
Management problems
In this project we used Elasticsearch which had already been set up on a FreeBSD system. We had some difficulty managing and synchronizing it.
Patching a Java plug-in
We had to apply some patches to a Java plugin that we used for the “decompound” of the words in German.
Java
@@ -52,7 +52,8 @@ public class DecompoundTokenFilter … {
- posIncAtt.setPositionIncrement(0);
+ if (!subwordsonly)
+ posIncAtt.setPositionIncrement(0);
return true;
}
Apache Solr¶
GoalScout
GoalScout.com is website dedicated to showing sport videos uploaded by public users with about 30,000 videos.
Issues
Synchronization problems
The use of Solr for Full-Text Search in this project was a customer choice.
We have always had some problems synchronizing the data between PostgreSQL and Solr.
One way
After these problems we had to start doing all writes to PostgreSQL and all reads from Apache SOLR.
Existing Products¶
PROS
- Full featured products
- Resources (documentations, articles, …)
The products that I have write about are full-featured and advanced, there are many online resources regarding it: documentations, articles and frequently answered questions.
CONS
- Synchronization
- Mandatory use of driver (haystack, bungiesearch, …)
I have found some problems in synchronization. I have always had to use a driver that is a bottleneck between the Django and the search engine. In some cases I have had to fix the code.
Ops Oriented
The focus is on system integrations. I am more a #dev than an #ops so I don’t like to be forced to integrate various systems. I prefer developing and solving problems by writing python code.
FTS in PostgreSQL¶
PostgreSQL has been supporting Full-Text Search since 2008. Internally it uses “tsvecor” and “tsquery” data type to process the data to search. It has some indexes that can be used to speed up the search: GIN and GiST. PostgreSQL added support for Phrase Search in 2016. Support for JSON[B] Full-Text Search was added in version 10.
- FTS Support since version 8.3 (2008)
- TSVECTOR to represent text data
- TSQUERY to represent search predicates
- Special Indexes (GIN, GIST)
- Phrase Search since version 9.6 (2016)
- FTS for JSON[B] since version 10 (2017)
What are documents¶
The “document” is the general concept used in Full-Text Search and where the search is done. In a database a document can be a field on a table, the combination of many fields in a table or in different tables.
“… a document is the unit of searching in a Full-Text Search system; for example, a magazine article or email message …”
Django Support¶
Features
The module django.contrib.postgres contains the support to Full-Text Search in Django, since the version 1.10. BRIN and GIN indexes were added in the version 1.11. The GIN index is very useful to speed up Full-Text Search.
- Module django.contrib.postgres
- FTS Support since version 1.10 (2016)
- BRIN and GIN indexes since version 1.11 (2017)
Dev Oriented
The focus is on programming. The use of Postgres Full-Text Search in Django is more developer friendly.
Making queries¶
We can look at the functions of Full-Text Search in Django starting from the models present in the Django official documentation. We have a Blog and an Author class connected through an Entry class The Django ORM creates tables and constructs queries.
Python
from django.contrib.postgres.search import SearchVectorField
from django.db import models
class Blog(models.Model):
name = models.CharField(max_length=100)
tagline = models.TextField()
lang = models.CharField(max_length=100, default='english')
def __str__(self):
return self.name
class Author(models.Model):
name = models.CharField(max_length=200)
email = models.EmailField()
def __str__(self):
return self.name
class Entry(models.Model):
blog = models.ForeignKey(Blog, on_delete=models.CASCADE)
headline = models.CharField(max_length=255)
body_text = models.TextField()
pub_date = models.DateField(auto_now_add=True)
mod_date = models.DateField(auto_now=True)
authors = models.ManyToManyField(Author)
n_comments = models.IntegerField(default=0)
n_pingbacks = models.IntegerField(default=0)
rating = models.IntegerField(default=5)
search_vector = SearchVectorField(null=True)
def __str__(self):
return self.headline
Standard queries¶
These are the basic searches that we can use on models in Django using “filter”.
Python
from blog.models import Author, Blog, Entry
Author.objects.filter(
name__contains='Terry'
).values_list('name', flat=True)
SQL
SELECT "blog_author"."name"
FROM "blog_author"
WHERE "blog_author"."name"::text LIKE '%Terry%'
['Terry Gilliam', 'Terry Jones']
We can use “case-insensitive containment” in order to get more results.
Python
Author.objects.filter(
name__icontains='ERRY'
).values_list('name', flat=True)
SQL
SELECT "blog_author"."name"
FROM "blog_author"
WHERE UPPER("blog_author"."name"::text) LIKE UPPER('%ERRY%')
['Terry Gilliam', 'Terry Jones', 'Jerry Lewis']
Unaccented query¶
By activating the unaccent PostgreSQL module, we can use the “unaccent” extension.
Python
from django.contrib.postgres.operations import UnaccentExtension
class Migration(migrations.Migration):
...
operations = [
UnaccentExtension(),
...
]
SQL
CREATE EXTENSION unaccent;
We can search without worrying about accented characters, useful in different languages.
Python
Author.objects.filter(
name__unaccent='Helene Joy'
).values_list('name', flat=True)
SQL
SELECT "blog_author"."name"
FROM "blog_author"
WHERE UNACCENT("blog_author"."name") = (UNACCENT('Helene Joy'))
['Hélène Joy']
Warning
Queries using this filter will generally perform full table scans, which can be slow on large tables.
Trigram similarity¶
By activating the trigram PosgreSQL module, we can use the “trigram” extension.
Python
from django.contrib.postgres.operations import TrigramExtension
class Migration(migrations.Migration):
...
operations = [
TrigramExtension(),
...
]
SQL
CREATE EXTENSION pg_trgm;
A “trigram” is a group of three consecutive characters taken from a string. We can evaluate the similarity of two strings by the number of “trigrams” they share.
Python
Author.objects.filter(
name__trigram_similar='helena'
).values_list('name', flat=True)
SQL
SELECT "blog_author"."name"
FROM "blog_author"
WHERE "blog_author"."name" % 'helena'
['Helen Mirren', 'Helena Bonham Carter']
Search lookup¶
This is the base search lookup of Django and with this we can execute a real Full-Text Search call on a field.
Python
Entry.objects.filter(
body_text__search='Cheese'
).values_list('headline', flat=True)
SQL
SELECT "blog_entry"."headline"
FROM "blog_entry"
WHERE to_tsvector(COALESCE("blog_entry"."body_text", ''))
@@ (plainto_tsquery('Cheese')) = true
['Cheese on Toast recipes', 'Pizza Recipes']
To execute a more complex query we have to use three new Django objects: SearchVector, SearchQuery, SearchRank.
SearchVector¶
We can use a “SearchVector” to build our document in more fields of the same object or connected objects too. Then we can filter on the document with a string.
Python
from django.contrib.postgres.search import SearchVector
search_vector = SearchVector('body_text', 'blog__name')
Entry.objects.annotate(
search=search_vector
).filter(
search='Cheese'
).values_list('headline', flat=True)
SQL
SELECT "blog_entry"."headline"
FROM "blog_entry"
INNER JOIN "blog_blog"
ON ("blog_entry"."blog_id" = "blog_blog"."id")
WHERE to_tsvector(
COALESCE("blog_entry"."body_text", '') || ' ' ||
COALESCE("blog_blog"."name", '')
) @@ (plainto_tsquery('Cheese')) = true
['Cheese on Toast recipes', 'Pizza Recipes']
SearchQuery¶
When we insert text into a Full-Text Search by using a “search query” we can apply “stemming” and “stop-word removal” even to the user texts. and to these we can apply basic logical operations.
NOT
Python
from django.contrib.postgres.search import SearchQuery
search_query = ~SearchQuery('toast')
search_vector = SearchVector('body_text')
Entry.objects.annotate(
search=search_vector
).filter(
search=search_query
).values_list('headline', flat=True)
SQL
SELECT "blog_entry"."headline"
FROM "blog_entry"
WHERE to_tsvector(COALESCE("blog_entry"."body_text", ''))
@@ (!!(plainto_tsquery('toast'))) = true
['Pizza Recipes', 'Pain perdu']
OR
Python
search_query = SearchQuery('cheese') | SearchQuery('toast')
search_vector = SearchVector('body_text')
Entry.objects.annotate(
search=search_vector
).filter(
search=search_query
).values_list('headline', flat=True)
SQL
SELECT "blog_entry"."headline"
FROM "blog_entry"
WHERE to_tsvector(COALESCE("blog_entry"."body_text", ''))
@@ ((
plainto_tsquery('cheese') ||
plainto_tsquery('toast')
)) = true
['Cheese on Toast recipes', 'Pizza Recipes']
AND
Python
search_query = SearchQuery('cheese') & SearchQuery('toast')
search_vector = SearchVector('body_text')
Entry.objects.annotate(
search=search_vector
).filter(
search=search_query
).values_list('headline', flat=True)
SQL
SELECT "blog_entry"."headline"
FROM "blog_entry"
WHERE to_tsvector(COALESCE("blog_entry"."body_text", ''))
@@ ((
plainto_tsquery('cheese') &&
plainto_tsquery('toast')
)) = true
['Cheese on Toast recipes']
SearchRank¶
We can use the PostgreSQL “rank” to calculate the score of a document in relation to a searched text, and we can use it to filter and sort it.
Python
from django.contrib.postgres.search import SearchRank
search_vector = SearchVector('body_text')
search_query = SearchQuery('cheese')
search_rank = SearchRank(search_vector, search_query)
Entry.objects.annotate(
rank=search_rank
).order_by(
'-rank'
).values_list('headline', 'rank'))
SQL
SELECT "blog_entry"."headline",
ts_rank(
to_tsvector(COALESCE("blog_entry"."body_text", '')),
plainto_tsquery('cheese')
) AS "rank"
FROM "blog_entry"
ORDER BY "rank" DESC
[ ('Cheese on Toast recipes', 0.0889769), ('Pizza Recipes', 0.0607927), ('Pain perdu', 0.0) ]
Search configuration¶
We can setup the “search vector” or “search query” to execute “stemming” or “stop words removal” for a specific language.
Python
language = 'french'
search_vector = SearchVector('body_text', config=language)
search_query = SearchQuery('œuf', config=language)
Entry.objects.annotate(
search=search_vector
).filter(
search=search_query
).values_list('headline', flat=True)
SQL
SELECT "blog_entry"."headline"
FROM "blog_entry"
WHERE to_tsvector(
'french'::regconfig,
COALESCE("blog_entry"."body_text", ''))
@@ (
plainto_tsquery('french'::regconfig, 'œuf')
) = true
['Pain perdu']
We can get the language from a class field.
Python
from django.db.models import F
language = F('blog__lang')
search_vector = SearchVector('body_text', config=language)
search_query = SearchQuery('œuf', config=language)
Entry.objects.annotate(
search=search_vector
).filter(
search=search_query
).values_list('headline', flat=True)
SQL
SELECT "blog_entry"."headline"
FROM "blog_entry"
INNER JOIN "blog_blog"
ON ("blog_entry"."blog_id" = "blog_blog"."id")
WHERE to_tsvector(
"blog_blog"."lang"::regconfig,
COALESCE("blog_entry"."body_text", '')
)
@@ (
plainto_tsquery("blog_blog"."lang"::regconfig, 'œuf')
) = true
['Pain perdu']
Queries weighting¶
It’s possible to set up the search to give a different weight to various fields and then use these values in searches.
Python
search_vector = SearchVector('body_text', weight='A') +
SearchVector('headline', weight='B')
search_query = SearchQuery('cheese')
search_rank = SearchRank(search_vector, search_query)
Entry.objects.annotate(
rank=search_rank
).order_by(
'-rank'
).values_list('headline', 'rank')
SQL
SELECT "blog_entry"."headline",
ts_rank((
setweight(
to_tsvector(COALESCE("blog_entry"."body_text", '')),
'A'
)
||
setweight(
to_tsvector(COALESCE("blog_entry"."headline", '')),
'B'
)
), plainto_tsquery('cheese')) AS "rank"
FROM "blog_entry"
ORDER BY "rank" DESC
[ ('Cheese on Toast recipes', 0.896524), ('Pizza Recipes', 0.607927), ('Pain perdu', 0.0) ]
SearchVectorField¶
If we want to speed up and simplify the query execution we can add a “search vector field” to the model and then execute searches on this specific field.
Python
Entry.objects.filter(
search_vector='cheese'
).values_list('headline', flat=True)
SQL
SELECT "blog_entry"."headline"
FROM "blog_entry"
WHERE "blog_entry"."search_vector"
@@ (plainto_tsquery('cheese')) = true
['Cheese on Toast recipes', 'Pizza Recipes']
We have to update this field manually, for example executing a command periodically, using Django signals or with a PostgreSQL triggers.
Python
search_vector = SearchVector('body_text')
Entry.objects.update(search_vector=search_vector)
SQL
UPDATE "blog_entry"
SET "search_vector" = to_tsvector(
COALESCE("blog_entry"."body_text", ''))
www.concertiaroma.com¶
“… today’s shows in the Capital”
www.concertiaroma.com is a website used for searching for shows, festivals, bands and venues in the city of Rome and has been online since 2014.
The numbers of the project:
- about 1,000 venues
- about 15,000 bands
- more than 16,000 shows
- about 200 festivals
- about 30,000 user/month
Version 2.0¶
The old version of the website was developed some years ago with Django 1.7 and it runs on Python 2.7. The data was managed by PostgreSQL version 9.1 and the search was performed by using the SQL LIKE
syntax.
- Python 2.7
- Django 1.7
- PostgreSQL 9.1
- SQL LIKE
Version 3.0¶
The new version, recently released, was developed with Django 1.11 and it runs on Python 3.6. The data is managed by PostgreSQL 9.6 and the search uses its Full-Text Search engine.
- Python 3.6
- Django 1.11
- PostgreSQL 9.6
- PG FTS
Band models¶
We can look at the functions of Full-Text Search in www.concertiaroma.com starting from the same models present in the project. We have a Genre class connected to a Band class.
Python
from django.db import models
from .managers import BandManager
class Genre(models.Model):
name = models.CharField(max_length=255)
class Band(models.Model):
nickname = models.CharField(max_length=255)
description = models.TextField()
genres = models.ManyToManyField(Genre)
objects = BandManager()
Band Manager¶
This is an example of “Manager” for the Band class which defines a search method that contains all the Full-Text Search logic.
Python
from django.contrib.postgres.aggregates import StringAgg
from django.contrib.postgres.search import (
SearchQuery, SearchRank, SearchVector, TrigramSimilarity,
)
from django.db import models
search_vectors = (
SearchVector('nickname', weight='A', config='english') +
SearchVector(
StringAgg('genres__name', delimiter=' '),
weight='B', config='english') +
SearchVector('description', weight='D', config='english')
)
class BandManager(models.Manager):
def search(self, text):
search_query = SearchQuery(text, config='english')
search_rank = SearchRank(search_vectors, search_query)
trigram_similarity = TrigramSimilarity('nickname', text)
return self.get_queryset().annotate(
search=search_vectors
).filter(
search=search_query
).annotate(
rank=search_rank + trigram_similarity
).order_by('-rank')
Band Test Setup¶
To better understand the mechanism, we can take into consideration an example of a simplified test.
In the test setup we defined the example data that we will use afterwards to test our search: two bands and two musical genres that we assigned to the two bands.
Python
from collections import OrderedDict
from django.test import TestCase
from .models import Band, Genre
class BandTest(TestCase):
def setUp(self):
# Genres
blues, _ = Genre.objects.get_or_create(name='Blues')
jazz, _ = Genre.objects.get_or_create(name='Jazz')
swing, _ = Genre.objects.get_or_create(name='Swing')
# Bands
ella_fitzgerald, _ = Band.objects.get_or_create(
nickname='Ella Fitzgerald',
description=(
'Ella Jane Fitzgerald (25 Apr 1917-15 Jun 1996)'
' was an American jazz singer often referred to'
' as the First Lady of Song, Queen of Jazz and '
'Lady Ella. She was noted for her purity of '
'tone, impeccable diction, phrasing and '
'intonation, and a horn-like improvisational '
'ability, particularly in her scat singing.'))
django_reinhardt, _ = Band.objects.get_or_create(
nickname='Django Reinhardt',
description=(
'Jean Django Reinhardt (23 Jan 1910-16 May 1953)'
' was a Belgian-born, Romani French jazz '
'guitarist and composer, regarded as one of the '
'greatest musicians of the twentieth century. He'
' was the first jazz talent to emerge from '
'Europe and remains the most significant.'))
louis_armstrong, _ = Band.objects.get_or_create(
nickname='Louis Armstrong',
description=(
'Louis Armstrong (4 Aug 1901-6 Jul 1971), '
'nicknamed Satchmo, Satch and Pops, was an '
'American trumpeter, composer, singer and '
'occasional actor who was one of the most '
'influential figures in jazz. His career spanned'
' five decades, from the 1920s to the 1960s, '
'and different eras in the history of jazz.'))
# Bands and Genres
ella_fitzgerald.genres.add(blues)
django_reinhardt.genres.add(jazz)
louis_armstrong.genres.add(blues, swing)
def test_band_search(self):
# ...
Contents from “Wikipedia, The Free Encyclopedia”:
Band Test Method¶
In the search test on the bands we simply invoked the search method giving a search text and we got back the list of values for the fields ‘nickname’ and ‘rate’. ‘nickname’ is stored on the band table, while ‘rate’ is calculated by our search method at runtime.
Python
from collections import OrderedDict
from django.test import TestCase
from .models import Band, Genre
class BandTest(TestCase):
def setUp(self):
# ...
def test_band_search(self):
band_queryset = Band.objects.search(
'jazz').values_list('nickname', 'rank')
band_objects = list(
OrderedDict(band_queryset).items())
band_list = [
('Django Reinhardt', 0.265124),
('Ella Fitzgerald', 0.0759909),
('Louis Armstrong', 0.0759909)]
self.assertSequenceEqual(band_objects, band_list)
In this example we compared our search results with a list of lists where we defined the pair composed of the band’s nickname and the numerical value that is the search rate, or in other words, the numerical value that defines the importance of that term.
What’s next¶
We have seen a simplified use of the current features of Django and the PostgreSQL Full-Text Search.
Both of these software programs are getting better in these fields and these are some of the features that can be available in the new future.
- Advanced Misspelling support
- Multiple language configuration
- Search suggestions
- SearchVectorField with triggers
- JSON/JSONB Full-Text Search
- RUM indexing
Conclusions¶
In conclusion, the following are the conditions for evaluating the implementation of a Full-Text Search with PostgreSQL in Django:
- not having any extra dependencies
- not doing very complex searches
- managing all the components easily
- avoiding data synchronization between different systems
- having PostgreSQL already in the stack
- operating in a Python-only environment
Acknowledgements¶
20tab
For all the support (www.20tab.com )
Marc Tamlyn
For django.contrib.postgres (github.com/mjtamlyn)
Resources¶
These are the resources that I used to prepare this article and to develop the search function I have showed you.
- Wikipedia.org - Full-Text Search definition
- DjangoProject.com - Full-Text Search documentation
- PostgreSQL.org - Full-Text Search documentation
- StackOverflow.com - Django PostgreSQL Full-Text Search questions
- SimonWillison.net - Implementing faceted search with Django and PostgreSQL
- PostgreSQL source code
- Django source code
Thank you¶
CC BY-SA
This article and related presentation is released with Creative Commons Attribution ShareAlike license.
creativecommons.org/licenses/by-sa
Source Code
I published the source code used in this article on GitHub.
github.com/pauloxnet/django_queries
Slides
You can download the presentation from my SpeakerDeck account.
speakerdeck.com/pauloxnet
2017-12-22 11 min read 0 Comments Article #Conference 🗣️ #Database 🗃️ #Django 🦄 #Full-text search 🔍 #PostgreSQL 🐘 #Python 🐍 #SQL 🆔 #Talk 💬