Full-text search

Chapter 9.5.

In this chapter we’re going to make our movie title filter easier to use by adapting it to support partial matches, rather than requiring a match on the full title. So, for example, if a client wants to find The Breakfast Club they will be able to find it with just the query string title=breakfast.

There are a few different ways we could implement this feature in our codebase, but an effective and intuitive method (from a client point of view) is to leverage PostgreSQL’s full-text search functionality, which allows you to perform natural language searches on text fields in your database.

PostgreSQL full-text search is a powerful and highly configurable tool, and explaining how it works and the available options in full could easily take up a whole book in itself. So we’ll keep the explanations in this chapter high level, and focus on the practical implementation.

To implement a basic full-text search on our title field, we’re going to update our SQL query to look like this:

SELECT id, created_at, title, year, runtime, genres, version
FROM movies
WHERE (to_tsvector('simple', title) @@ plainto_tsquery('simple', $1) OR $1 = '') 
AND (genres @> $2 OR $2 = '{}')     
ORDER BY id

That looks pretty complicated at first glance, so let’s break it down and explain what’s going on.

The to_tsvector('simple', title) function takes a movie title and splits it into lexemes. We specify the simple configuration, which means that the lexemes are just lowercase versions of the words in the title^†. For example, the movie title "The Breakfast Club" would be split into the lexemes 'breakfast' 'club' 'the'.

The plainto_tsquery('simple', $1) function takes a search value and turns it into a formatted query term that PostgreSQL full-text search can understand. It normalizes the search value (again using the simple configuration), strips any special characters, and inserts the and operator & between the words. As an example, the search value "The Club" would result in the query term 'the' & 'club'.

The @@ operator is the matches operator. In our statement we are using it to check whether the generated query term matches the lexemes in the title. To continue the example, the query term 'the' & 'club' will match rows which contain both lexemes 'the' and 'club'.

There are a lot of specialist words in the paragraphs above, but if we illustrate it with a couple of examples it’s actually very intuitive:

// Return all movies where the title includes the case-insensitive word 'panther'.
/v1/movies?title=panther

// Return all movies where the title includes the case-insensitive words 'the' and 
// 'club'.
/v1/movies?title=the+club

Let’s go ahead and put this into action. Open up your internal/data/movies.go file and update the GetAll() method to use the new SQL query like so:

File: internal/data/movies.go

package data

...

func (m MovieModel) GetAll(title string, genres []string, filters Filters) ([]*Movie, error) {
    // Use full-text search for the title filter.
    query := `
        SELECT id, created_at, title, year, runtime, genres, version
        FROM movies
        WHERE (to_tsvector('simple', title) @@ plainto_tsquery('simple', $1) OR $1 = '') 
        AND (genres @> $2 OR $2 = '{}')     
        ORDER BY id`

    // Nothing else below needs to change.
    ...
}

If you’re following along, restart the application and try making some requests with different values for the movie title. You should find that partial searches now work as we described above.

For example:

$ curl "localhost:4000/v1/movies?title=panther"
{
    "movies": [
        {
            "id": 2,
            "title": "Black Panther",
            "year": 2018,
            "runtime": "134 mins",
            "genres": [
                "sci-fi",
                "action",
                "adventure"
            ],
            "version": 2
        }
    ]
}

$ curl "localhost:4000/v1/movies?title=the+club"
{
    "movies": [
        {
            "id": 4,
            "title": "The Breakfast Club",
            "year": 1985,
            "runtime": "97 mins",
            "genres": [
                "comedy"
            ],
            "version": 5
        }
    ]
}

Adding indexes

To keep our SQL query performing quickly as the dataset grows, it’s sensible to use indexes to help avoid full table scans and avoid generating the lexemes for the title field every time the query is run.

In our case it makes sense to create GIN indexes on both the genres field and the lexemes generated by to_tsvector(), both of which are used in the WHERE clause of our SQL query.

If you’re following along, go ahead and create a new pair of migration files:

$ migrate create -seq -ext .sql -dir ./migrations add_movies_indexes

Then add the following statements to the ‘up’ and ‘down’ migration files to create and drop the necessary indexes:

File: migrations/000003_add_movies_indexes.up.sql

CREATE INDEX IF NOT EXISTS movies_title_idx ON movies USING GIN (to_tsvector('simple', title));
CREATE INDEX IF NOT EXISTS movies_genres_idx ON movies USING GIN (genres);

File: migrations/000003_add_movies_indexes.down.sql

DROP INDEX IF EXISTS movies_title_idx;
DROP INDEX IF EXISTS movies_genres_idx;

Once that’s done, you should be able to execute the ‘up’ migration to add the indexes to your database:

$ migrate -path ./migrations -database $GREENLIGHT_DB_DSN up
3/u add_movies_indexes (38.638777ms)

Additional information

Non-simple configuration and more information

As mentioned above, you can also use a language-specific configuration for full-text searches instead of the simple configuration that we’re currently using. When you create lexemes or query terms using a language-specific configuration, it will strip out common words for the language and perform word stemming.

So, for example, if you use the english configuration, then the lexemes generated for "One Flew Over the Cuckoo's Nest" would be 'cuckoo' 'flew' 'nest' 'one'. Or with the spanish configuration, the lexemes for "Los lunes al sol" would be 'lun' 'sol'.

You can retrieve a list of all available configurations by running the \dF meta-command in PostgreSQL:

postgres=# \dF
              List of text search configurations
   Schema   |    Name    |              Description              
------------+------------+---------------------------------------
 pg_catalog | arabic     | configuration for arabic language
 pg_catalog | danish     | configuration for danish language
 pg_catalog | dutch      | configuration for dutch language
 pg_catalog | english    | configuration for english language
 ...

And if you wanted to use the english configuration to search our movies, you could update the SQL query like so:

SELECT id, created_at, title, year, runtime, genres, version
FROM movies
WHERE (to_tsvector('english', title) @@ plainto_tsquery('english', $1) OR $1 = '') 
AND (genres @> $2 OR $2 = '{}')     
ORDER BY id

If you’d like to learn more about PostgreSQL full-text search, the official documentation is an excellent source of information and examples.

Using STRPOS and ILIKE

If you don’t want to use full-text search for the partial movie title lookup, some alternatives are the PostgreSQL STRPOS() function and ILIKE operator.

The PostgreSQL STRPOS() function allows you to check for the existence of a substring in a particular database field. We could use it in our SQL query like this:

SELECT id, created_at, title, year, runtime, genres, version
FROM movies
WHERE (STRPOS(LOWER(title), LOWER($1)) > 0 OR $1 = '') 
AND (genres @> $2 OR $2 = '{}')     
ORDER BY id

From a client perspective, the downside of this is that it may return some unintuitive results. For example, searching for title=the would return both The Breakfast Club and Black Panther in our dataset.

From a server perspective it’s also not ideal for large datasets. Because there’s no effective way to index the title field to see if the STRPOS() condition is met, it means the query could potentially require a full-table scan each time it is run.

Another option is the ILIKE operator, which allows you to find rows which match a specific (case-insensitive) pattern. We could use it in our SQL query like so:

SELECT id, created_at, title, year, runtime, genres, version
FROM movies
WHERE (title ILIKE $1 OR $1 = '') 
AND (genres @> $2 OR $2 = '{}')     
ORDER BY id

This approach would be better from a server point of view because it’s possible to create an index on the title field using the pg_trgm extension and a GIN index (for details please see this post).

From the client side it’s arguably better that the STRPOS() approach too, as they can control the matching behavior by prefixing/suffixing the search term with a % wildcard character (which will need to be escaped to %25 in the URL query string). For example, to search for movies with a title that starts with “the”, a client could send the query string parameter title=the%25.