Programming , Technology

Full-Text Search in Django with PostgreSQL

WW3

  4 years, 10 months ago

Full-Text Search in Django with PostgreSQL

Full-Text Search in Django with PostgreSQL

Article based on my talk about Full-Text Search in Django with PostgreSQL.

images/django-postgresql-fts.svg

I’ve given this talk in: PyCon IT 2017, EuroPython 2017, PGDay.IT 2017, PyRoma 2017

Goal

To show how I have used Django Full-Text Search and PostgreSQL in a real-world project.

Motivation

To implement Full-Text Search using only Django and PostgreSQL, without resorting to external products.

Contents

These are the main topics of this article:

Full-Text Search derives from the need to do some searches in computer-stored documents. For example, to find all documents that contain specific words and their variations. If a document contains “house” or “houses” it will be the same for the search.

“… Full-Text Search* refers to techniques for searching a single computer-stored document or a collection in a Full-Text Database …”

Wikipedia

* FTS = Full-Text Search

FTS Features

This is a list of some features that we can find in a quite advanced Full-Text Search, to be used in a real-world website.

  • Stemming
  • Ranking
  • Stop-words removal
  • Multiple language support
  • Accent support
  • Indexing
  • Phrase search

Tested Products

Elastic and Solr are two software programs for the Full-Text Search that are popular today. There are others, but these are the only ones that I have used in my professional projects. They are Lucene based and written in Java.

Elasticsearch

Snap Market

Snap market was a startup where I worked which produced a mobile phone application for buying and selling used items with about half a million mobile users.

Issues

Management problems

In this project we used Elasticsearch which had already been set up on a FreeBSD system. We had some difficulty managing and synchronizing it.

Patching a Java plug-in

We had to apply some patches to a Java plugin that we used for the “decompound” of the words in German.

Java

@@ -52,7 +52,8 @@ public class DecompoundTokenFilter … {
-            posIncAtt.setPositionIncrement(0);
+            if (!subwordsonly)
+                posIncAtt.setPositionIncrement(0);
             return true;
         }

Apache Solr

GoalScout

GoalScout.com is website dedicated to showing sport videos uploaded by public users with about 30,000 videos.

Issues

Synchronization problems

The use of Solr for Full-Text Search in this project was a customer choice.
We have always had some problems synchronizing the data between PostgreSQL and Solr.

One way

After these problems we had to start doing all writes to PostgreSQL and all reads from Apache SOLR.

Existing Products

PROS

  • Full featured products
  • Resources (documentations, articles, …)

The products that I have write about are full-featured and advanced, there are many online resources regarding it: documentations, articles and frequently answered questions.

CONS

  • Synchronization
  • Mandatory use of driver (haystack, bungiesearch, …)

I have found some problems in synchronization. I have always had to use a driver that is a bottleneck between the Django and the search engine. In some cases I have had to fix the code.

Ops Oriented

The focus is on system integrations. I am more a #dev than an #ops so I don’t like to be forced to integrate various systems. I prefer developing and solving problems by writing python code.

FTS in PostgreSQL

PostgreSQL has been supporting Full-Text Search since 2008. Internally it uses “tsvecor” and “tsquery” data type to process the data to search. It has some indexes that can be used to speed up the search: GIN and GiST. PostgreSQL added support for Phrase Search in 2016. Support for JSON[B] Full-Text Search was added in version 10.

  • FTS Support since version 8.3 (2008)
  • TSVECTOR to represent text data
  • TSQUERY to represent search predicates
  • Special Indexes (GIN, GIST)
  • Phrase Search since version 9.6 (2016)
  • FTS for JSON[B] since version 10 (2017)

What are documents

The “document” is the general concept used in Full-Text Search and where the search is done. In a database a document can be a field on a table, the combination of many fields in a table or in different tables.

“… a document is the unit of searching in a Full-Text Search system; for example, a magazine article or email message …”

PostgreSQL documentation

Django Support

Features

The module django.contrib.postgres contains the support to Full-Text Search in Django, since the version 1.10. BRIN and GIN indexes were added in the version 1.11. The GIN index is very useful to speed up Full-Text Search.

  • Module django.contrib.postgres
  • FTS Support since version 1.10 (2016)
  • BRIN and GIN indexes since version 1.11 (2017)

Dev Oriented

The focus is on programming. The use of Postgres Full-Text Search in Django is more developer friendly.

Making queries

We can look at the functions of Full-Text Search in Django starting from the models present in the Django official documentation. We have a Blog and an Author class connected through an Entry class The Django ORM creates tables and constructs queries.

Python

from django.contrib.postgres.search import SearchVectorField
from django.db import models


class Blog(models.Model):
    name = models.CharField(max_length=100)
    tagline = models.TextField()
    lang = models.CharField(max_length=100, default='english')

    def __str__(self):
        return self.name


class Author(models.Model):
    name = models.CharField(max_length=200)
    email = models.EmailField()

    def __str__(self):
        return self.name


class Entry(models.Model):
    blog = models.ForeignKey(Blog, on_delete=models.CASCADE)
    headline = models.CharField(max_length=255)
    body_text = models.TextField()
    pub_date = models.DateField(auto_now_add=True)
    mod_date = models.DateField(auto_now=True)
    authors = models.ManyToManyField(Author)
    n_comments = models.IntegerField(default=0)
    n_pingbacks = models.IntegerField(default=0)
    rating = models.IntegerField(default=5)
    search_vector = SearchVectorField(null=True)

    def __str__(self):
        return self.headline

Standard queries

These are the basic searches that we can use on models in Django using “filter”.

Python

from blog.models import Author, Blog, Entry

Author.objects.filter(
  name__contains='Terry'
).values_list('name', flat=True)

SQL

SELECT "blog_author"."name"
  FROM "blog_author"
 WHERE "blog_author"."name"::text LIKE '%Terry%'
['Terry Gilliam', 'Terry Jones']

We can use “case-insensitive containment” in order to get more results.

Python

Author.objects.filter(
  name__icontains='ERRY'
).values_list('name', flat=True)

SQL

SELECT "blog_author"."name"
  FROM "blog_author"
 WHERE UPPER("blog_author"."name"::text) LIKE UPPER('%ERRY%')
['Terry Gilliam', 'Terry Jones', 'Jerry Lewis']

Unaccented query

By activating the unaccent PostgreSQL module, we can use the “unaccent” extension.

Python

from django.contrib.postgres.operations import UnaccentExtension


class Migration(migrations.Migration):
    ...

    operations = [
        UnaccentExtension(),
        ...
    ]

SQL

CREATE EXTENSION unaccent;

We can search without worrying about accented characters, useful in different languages.

Python

Author.objects.filter(
  name__unaccent='Helene Joy'
).values_list('name', flat=True)

SQL

SELECT "blog_author"."name"
  FROM "blog_author"
 WHERE UNACCENT("blog_author"."name") = (UNACCENT('Helene Joy'))
['Hélène Joy']

Warning
Queries using this filter will generally perform full table scans, which can be slow on large tables.

Trigram similarity

By activating the trigram PosgreSQL module, we can use the “trigram” extension.

Python

from django.contrib.postgres.operations import TrigramExtension


class Migration(migrations.Migration):
    ...

    operations = [
        TrigramExtension(),
        ...
    ]

SQL

CREATE EXTENSION pg_trgm;

A “trigram” is a group of three consecutive characters taken from a string. We can evaluate the similarity of two strings by the number of “trigrams” they share.

Python

Author.objects.filter(
  name__trigram_similar='helena'
).values_list('name', flat=True)

SQL

SELECT "blog_author"."name"
  FROM "blog_author"
 WHERE "blog_author"."name" % 'helena'
['Helen Mirren', 'Helena Bonham Carter']

Search lookup

This is the base search lookup of Django and with this we can execute a real Full-Text Search call on a field.

Python

Entry.objects.filter(
  body_text__search='Cheese'
).values_list('headline', flat=True)

SQL

SELECT "blog_entry"."headline"
  FROM "blog_entry"
 WHERE to_tsvector(COALESCE("blog_entry"."body_text", ''))
       @@ (plainto_tsquery('Cheese')) = true
['Cheese on Toast recipes', 'Pizza Recipes']

To execute a more complex query we have to use three new Django objects: SearchVector, SearchQuery, SearchRank.

SearchVector

We can use a “SearchVector” to build our document in more fields of the same object or connected objects too. Then we can filter on the document with a string.

Python

from django.contrib.postgres.search import SearchVector

search_vector = SearchVector('body_text', 'blog__name')

Entry.objects.annotate(
  search=search_vector
).filter(
  search='Cheese'
).values_list('headline', flat=True)

SQL

SELECT "blog_entry"."headline"
  FROM "blog_entry"
 INNER JOIN "blog_blog"
    ON ("blog_entry"."blog_id" = "blog_blog"."id")
 WHERE to_tsvector(
         COALESCE("blog_entry"."body_text", '') || ' ' ||
         COALESCE("blog_blog"."name", '')
       ) @@ (plainto_tsquery('Cheese')) = true
['Cheese on Toast recipes', 'Pizza Recipes']

SearchQuery

When we insert text into a Full-Text Search by using a “search query” we can apply “stemming” and “stop-word removal” even to the user texts. and to these we can apply basic logical operations.

NOT

Python

from django.contrib.postgres.search import SearchQuery

search_query = ~SearchQuery('toast')
search_vector = SearchVector('body_text')

Entry.objects.annotate(
  search=search_vector
).filter(
  search=search_query
).values_list('headline', flat=True)

SQL

SELECT "blog_entry"."headline"
  FROM "blog_entry"
 WHERE to_tsvector(COALESCE("blog_entry"."body_text", ''))
       @@ (!!(plainto_tsquery('toast'))) = true
['Pizza Recipes', 'Pain perdu']

OR

Python

search_query = SearchQuery('cheese') | SearchQuery('toast')
search_vector = SearchVector('body_text')

Entry.objects.annotate(
  search=search_vector
).filter(
  search=search_query
).values_list('headline', flat=True)

SQL

SELECT "blog_entry"."headline"
  FROM "blog_entry"
 WHERE to_tsvector(COALESCE("blog_entry"."body_text", ''))
       @@ ((
         plainto_tsquery('cheese') ||
         plainto_tsquery('toast')
       )) = true
['Cheese on Toast recipes', 'Pizza Recipes']

AND

Python

search_query = SearchQuery('cheese') & SearchQuery('toast')
search_vector = SearchVector('body_text')

Entry.objects.annotate(
  search=search_vector
).filter(
  search=search_query
).values_list('headline', flat=True)

SQL

SELECT "blog_entry"."headline"
  FROM "blog_entry"
 WHERE to_tsvector(COALESCE("blog_entry"."body_text", ''))
       @@ ((
         plainto_tsquery('cheese') &&
         plainto_tsquery('toast')
       )) = true
['Cheese on Toast recipes']

SearchRank

We can use the PostgreSQL “rank” to calculate the score of a document in relation to a searched text, and we can use it to filter and sort it.

Python

from django.contrib.postgres.search import SearchRank

search_vector = SearchVector('body_text')
search_query = SearchQuery('cheese')
search_rank = SearchRank(search_vector, search_query)

Entry.objects.annotate(
  rank=search_rank
).order_by(
  '-rank'
).values_list('headline', 'rank'))

SQL

SELECT "blog_entry"."headline",
       ts_rank(
         to_tsvector(COALESCE("blog_entry"."body_text", '')),
         plainto_tsquery('cheese')
       ) AS "rank"
  FROM "blog_entry"
 ORDER BY "rank" DESC
[
  ('Cheese on Toast recipes', 0.0889769),
  ('Pizza Recipes', 0.0607927),
  ('Pain perdu', 0.0)
]

Search configuration

We can setup the “search vector” or “search query” to execute “stemming” or “stop words removal” for a specific language.

Python

language = 'french'

search_vector = SearchVector('body_text', config=language)
search_query = SearchQuery('œuf', config=language)

Entry.objects.annotate(
  search=search_vector
).filter(
  search=search_query
).values_list('headline', flat=True)

SQL

SELECT "blog_entry"."headline"
  FROM "blog_entry"
 WHERE to_tsvector(
         'french'::regconfig,
         COALESCE("blog_entry"."body_text", ''))
       @@ (
         plainto_tsquery('french'::regconfig, 'œuf')
       ) = true
['Pain perdu']

We can get the language from a class field.

Python

from django.db.models import F

language = F('blog__lang')

search_vector = SearchVector('body_text', config=language)
search_query = SearchQuery('œuf', config=language)

Entry.objects.annotate(
  search=search_vector
).filter(
  search=search_query
).values_list('headline', flat=True)

SQL

SELECT "blog_entry"."headline"
  FROM "blog_entry"
 INNER JOIN "blog_blog"
    ON ("blog_entry"."blog_id" = "blog_blog"."id")
 WHERE to_tsvector(
         "blog_blog"."lang"::regconfig,
         COALESCE("blog_entry"."body_text", '')
       )
       @@ (
         plainto_tsquery("blog_blog"."lang"::regconfig, 'œuf')
       ) = true
['Pain perdu']

Queries weighting

It’s possible to set up the search to give a different weight to various fields and then use these values in searches.

Python

search_vector = SearchVector('body_text', weight='A') +
                SearchVector('headline', weight='B')
search_query = SearchQuery('cheese')
search_rank = SearchRank(search_vector, search_query)

Entry.objects.annotate(
  rank=search_rank
).order_by(
  '-rank'
).values_list('headline', 'rank')

SQL

SELECT "blog_entry"."headline",
       ts_rank((
         setweight(
           to_tsvector(COALESCE("blog_entry"."body_text", '')),
           'A'
         )
         ||
         setweight(
           to_tsvector(COALESCE("blog_entry"."headline", '')),
           'B'
         )
       ), plainto_tsquery('cheese')) AS "rank"
  FROM "blog_entry"
 ORDER BY "rank" DESC
[
  ('Cheese on Toast recipes', 0.896524),
  ('Pizza Recipes', 0.607927),
  ('Pain perdu', 0.0)
]

SearchVectorField

If we want to speed up and simplify the query execution we can add a “search vector field” to the model and then execute searches on this specific field.

Python

Entry.objects.filter(
  search_vector='cheese'
).values_list('headline', flat=True)

SQL

SELECT "blog_entry"."headline"
  FROM "blog_entry"
 WHERE "blog_entry"."search_vector"
       @@ (plainto_tsquery('cheese')) = true
['Cheese on Toast recipes', 'Pizza Recipes']

We have to update this field manually, for example executing a command periodically, using Django signals or with a PostgreSQL triggers.

Python

search_vector = SearchVector('body_text')

Entry.objects.update(search_vector=search_vector)

SQL

UPDATE "blog_entry"
   SET "search_vector" = to_tsvector(
         COALESCE("blog_entry"."body_text", ''))

www.concertiaroma.com

“… today’s shows in the Capital”

www.concertiaroma.com is a website used for searching for shows, festivals, bands and venues in the city of Rome and has been online since 2014.

The numbers of the project:

  • about 1,000 venues
  • about 15,000 bands
  • more than 16,000 shows
  • about 200 festivals
  • about 30,000 user/month

Version 2.0

The old version of the website was developed some years ago with Django 1.7 and it runs on Python 2.7. The data was managed by PostgreSQL version 9.1 and the search was performed by using the SQL LIKE syntax.

  • Python 2.7
  • Django 1.7
  • PostgreSQL 9.1
  • SQL LIKE

Version 3.0

The new version, recently released, was developed with Django 1.11 and it runs on Python 3.6. The data is managed by PostgreSQL 9.6 and the search uses its Full-Text Search engine.

  • Python 3.6
  • Django 1.11
  • PostgreSQL 9.6
  • PG FTS

Band models

We can look at the functions of Full-Text Search in www.concertiaroma.com starting from the same models present in the project. We have a Genre class connected to a Band class.

Python

from django.db import models
from .managers import BandManager


class Genre(models.Model):
    name = models.CharField(max_length=255)


class Band(models.Model):
    nickname = models.CharField(max_length=255)
    description = models.TextField()
    genres = models.ManyToManyField(Genre)

    objects = BandManager()

Band Manager

This is an example of “Manager” for the Band class which defines a search method that contains all the Full-Text Search logic.

Python

from django.contrib.postgres.aggregates import StringAgg
from django.contrib.postgres.search import (
    SearchQuery, SearchRank, SearchVector, TrigramSimilarity,
)
from django.db import models

search_vectors = (
    SearchVector('nickname', weight='A', config='english') +
    SearchVector(
        StringAgg('genres__name', delimiter=' '),
        weight='B', config='english') +
    SearchVector('description', weight='D', config='english')
)


class BandManager(models.Manager):

    def search(self, text):
        search_query = SearchQuery(text, config='english')
        search_rank = SearchRank(search_vectors, search_query)
        trigram_similarity = TrigramSimilarity('nickname', text)
        return self.get_queryset().annotate(
            search=search_vectors
        ).filter(
            search=search_query
        ).annotate(
            rank=search_rank + trigram_similarity
        ).order_by('-rank')

Band Test Setup

To better understand the mechanism, we can take into consideration an example of a simplified test.

In the test setup we defined the example data that we will use afterwards to test our search: two bands and two musical genres that we assigned to the two bands.

Python

from collections import OrderedDict
from django.test import TestCase
from .models import Band, Genre


class BandTest(TestCase):

    def setUp(self):
        # Genres
        blues, _ = Genre.objects.get_or_create(name='Blues')
        jazz, _ = Genre.objects.get_or_create(name='Jazz')
        swing, _ = Genre.objects.get_or_create(name='Swing')
        # Bands
        ella_fitzgerald, _ = Band.objects.get_or_create(
            nickname='Ella Fitzgerald',
            description=(
                'Ella Jane Fitzgerald (25 Apr 1917-15 Jun 1996)'
                ' was an American jazz singer often referred to'
                ' as the First Lady of Song, Queen of Jazz and '
                'Lady Ella. She was noted for her purity of '
                'tone, impeccable diction, phrasing and '
                'intonation, and a horn-like improvisational '
                'ability, particularly in her scat singing.'))
        django_reinhardt, _ = Band.objects.get_or_create(
            nickname='Django Reinhardt',
            description=(
                'Jean Django Reinhardt (23 Jan 1910-16 May 1953)'
                ' was a Belgian-born, Romani French jazz '
                'guitarist and composer, regarded as one of the '
                'greatest musicians of the twentieth century. He'
                ' was the first jazz talent to emerge from '
                'Europe and remains the most significant.'))
        louis_armstrong, _ = Band.objects.get_or_create(
            nickname='Louis Armstrong',
            description=(
                'Louis Armstrong (4 Aug 1901-6 Jul 1971), '
                'nicknamed Satchmo, Satch and Pops, was an '
                'American trumpeter, composer, singer and '
                'occasional actor who was one of the most '
                'influential figures in jazz. His career spanned'
                ' five decades, from the 1920s to the 1960s, '
                'and different eras in the history of jazz.'))
        # Bands and Genres
        ella_fitzgerald.genres.add(blues)
        django_reinhardt.genres.add(jazz)
        louis_armstrong.genres.add(blues, swing)

    def test_band_search(self):
        # ...

Contents from “Wikipedia, The Free Encyclopedia”:

Band Test Method

In the search test on the bands we simply invoked the search method giving a search text and we got back the list of values for the fields ‘nickname’ and ‘rate’. ‘nickname’ is stored on the band table, while ‘rate’ is calculated by our search method at runtime.

Python

from collections import OrderedDict
from django.test import TestCase
from .models import Band, Genre


class BandTest(TestCase):

    def setUp(self):
        # ...

    def test_band_search(self):
        band_queryset = Band.objects.search(
            'jazz').values_list('nickname', 'rank')
        band_objects = list(
            OrderedDict(band_queryset).items())
        band_list = [
            ('Django Reinhardt', 0.265124),
            ('Ella Fitzgerald', 0.0759909),
            ('Louis Armstrong', 0.0759909)]
        self.assertSequenceEqual(band_objects, band_list)

In this example we compared our search results with a list of lists where we defined the pair composed of the band’s nickname and the numerical value that is the search rate, or in other words, the numerical value that defines the importance of that term.

What’s next

We have seen a simplified use of the current features of Django and the PostgreSQL Full-Text Search.

Both of these software programs are getting better in these fields and these are some of the features that can be available in the new future.

  • Advanced Misspelling support
  • Multiple language configuration
  • Search suggestions
  • SearchVectorField with triggers
  • JSON/JSONB Full-Text Search
  • RUM indexing

Conclusions

In conclusion, the following are the conditions for evaluating the implementation of a Full-Text Search with PostgreSQL in Django:

  • not having any extra dependencies
  • not doing very complex searches
  • managing all the components easily
  • avoiding data synchronization between different systems
  • having PostgreSQL already in the stack
  • operating in a Python-only environment

Acknowledgements

20tab

For all the support (www.20tab.com )

Marc Tamlyn

For django.contrib.postgres (github.com/mjtamlyn)

Resources

These are the resources that I used to prepare this article and to develop the search function I have showed you.

Thank you

CC BY-SA

This article and related presentation is released with Creative Commons Attribution ShareAlike license.
creativecommons.org/licenses/by-sa

Source Code

I published the source code used in this article on GitHub.
github.com/pauloxnet/django_queries

Slides

You can download the presentation from my SpeakerDeck account.
speakerdeck.com/pauloxnet

 2017-12-22   11 min read   0 Comments  Article   #Conference 🗣️ #Database 🗃️ #Django 🦄 #Full-text search 🔍 #PostgreSQL 🐘 #Python 🐍 #SQL 🆔 #Talk 💬

Read comments