Faster SQLite LIKE Queries Using FTS5 Trigram Indexes

Originally Published: January 8, 2024

In most of my projects which have low to moderate amount of read-only data (10MiB - 100GiB), I use SQLite as the database. It’s fast (if you have the right indexes), reliable, easy to use and there’s no need to run a manage a separate database server.

Recently I had a requirement to do substring searches on a column with millions of small strings. A linear scan on this data took over 15 seconds. Searching online, I found that SQLite supports trigram index in FTS5 which speeds up LIKE and GLOB queries. There’s not much documentation that focuses on this use case (the official documentation is more focused on full text search), so I thought I’d document my findings here.

Table of Contents

TL;DR

$ sqlite3 origins.db
# Schema
sqlite> .schema
CREATE TABLE origins (name text primary key) strict;
# There are 18.2 million rows.
sqlite> select count() from origins;
18265721
# LIKE Query Before Index (1.7s)
sqlite> .timer on
sqlite> select count() from origins where name like '%google%';
4692
Run Time: real 1.758 user 1.626613 sys 0.129966
# Create The Index
sqlite> create virtual table origins_trigram using fts5(name, tokenize='trigram', detail='none');
Run Time: real 0.002 user 0.000682 sys 0.001100
# Insert Data Into The Index (144s)
sqlite> insert into origins_trigram select name from origins;
Run Time: real 144.802 user 136.300882 sys 5.993971
# LIKE Query After Index (14ms, over 100x faster)
sqlite> select count() from origins_trigram where name like '%google%';
4692
Run Time: real 0.014 user 0.010803 sys 0.003055

Creating a Sample Database

We’ll start by creating a test database loaded with over 18 million origins from the CRUX dataset.

$ curl https://raw.githubusercontent.com/crissyfield/crux-dumps/main/2023/11/meta.json | \
jq -r '.files[].url' | \
xargs -n 1 curl -L | \
xzcat >! origins.txt
$ wc -l origins.txt
18265721 origins.txt

We get a total of 18.2 million origins.

Let’s insert this into a new SQLite database.

$ sqlite3 origins.db
SQLite version 3.44.2 2023-11-24 11:41:44
Enter ".help" for usage hints.
sqlite> .timer on
sqlite> create table origins (name text primary key) strict;
Run Time: real 0.002 user 0.000376 sys 0.000922
sqlite> .import origins.txt origins
sqlite> select count(*) from origins;
18265721
Run Time: real 0.225 user 0.074397 sys 0.150993
sqlite> vacuum;
Run Time: real 17.217 user 4.159183 sys 9.244823
sqlite> .quit
$ wc -c origins.db
1340198912

Took just over a minute to run and now we have a 1.3GiB database with 18.2 million rows.

LIKE Queries Without Any Index

Let’s time how long simple substring queries take to get a baseline.

$ sqlite3 origins.db
SQLite version 3.44.2 2023-11-24 11:41:44
Enter ".help" for usage hints.
sqlite> .timer on
sqlite> select count() from origins where name like '%google%';
4692
Run Time: real 1.758 user 1.626613 sys 0.129966
sqlite> select count() from origins where name like '%google%';
4692
Run Time: real 1.749 user 1.620124 sys 0.127537
sqlite> select count() from origins where name like '%yahoo%';
967
Run Time: real 1.804 user 1.653220 sys 0.143775
sqlite> select count() from origins where name like '%domaindoesnotexist%';
0
Run Time: real 1.838 user 1.701729 sys 0.134198

So it takes about 1.75 seconds to run a simple substring query on 18.2 million rows no matter how many rows match.

Creating an FTS Trigram Index With Default Settings

Let’s create the FTS Trigram Index for our table.

$ sqlite3 origins.db
SQLite version 3.44.2 2023-11-24 11:41:44
Enter ".help" for usage hints.
sqlite> .timer on
sqlite> create virtual table origins_trigram using fts5(name, tokenize='trigram');
Run Time: real 0.002 user 0.000594 sys 0.000883
sqlite> insert into origins_trigram select name from origins;
Run Time: real 191.576 user 175.199979 sys 12.122090
sqlite> vacuum;
Run Time: real 49.991 user 9.018166 sys 24.368622
sqlite> .quit
$ wc -c origins.db
3738050560

Took about 3 minutes to populate the index and then 50 seconds to vacuum. The database size has increased to 3.7GiB, nearly 3x of the original size. (We’ll see how to reduce this later.)

Let’s run the same queries we ran before, this time on the FTS5 index table:

$ sqlite3 origins.db
SQLite version 3.44.2 2023-11-24 11:41:44
Enter ".help" for usage hints.
sqlite> .timer on
sqlite> select count() from origins_trigram where name like '%google%';
4692
Run Time: real 0.015 user 0.011322 sys 0.003880
sqlite> select count() from origins_trigram where name like '%google%';
4692
Run Time: real 0.012 user 0.008554 sys 0.002729
sqlite> select count() from origins_trigram where name like '%yahoo%';
967
Run Time: real 0.008 user 0.006093 sys 0.001908
sqlite> select count() from origins_trigram where name like '%domaindoesnotexist%';
0
Run Time: real 0.028 user 0.025801 sys 0.002801

The queries now run in about 10-30 milliseconds. That’s a 50-100x speedup!

Reducing The Index Size Using detail='none'

A full FTS5 index supports many more features and thus uses a lot of disk space. We can reduce the size of the index by disabling some of these features which we don’t need for our LIKE queries using detail='none'.

$ sqlite3 origins.db
SQLite version 3.44.2 2023-11-24 11:41:44
Enter ".help" for usage hints.
sqlite> drop table origins_trigram;
sqlite> vacuum;
sqlite> .timer on
sqlite> create virtual table origins_trigram using fts5(name, tokenize='trigram', detail='none');
Run Time: real 0.002 user 0.000682 sys 0.001100
sqlite> insert into origins_trigram select name from origins;
Run Time: real 144.802 user 136.300882 sys 5.993971
sqlite> vacuum;
Run Time: real 28.869 user 8.047915 sys 17.717558
sqlite> .quit
$ wc -c origins.db
2803511296

Creating the index with detail='none' is a bit faster and uses a bit over half the storage in our case (3.7 - 1.3 = 2.4GiB before, 2.8 - 1.3 = 1.5GiB now).

Just to verify, our queries still run just as fast as with a full FTS5 index:

$ sqlite3 origins.db
SQLite version 3.44.2 2023-11-24 11:41:44
Enter ".help" for usage hints.
sqlite> .timer on
sqlite> select count() from origins_trigram where name like '%google%';
4692
Run Time: real 0.014 user 0.010803 sys 0.003055
sqlite> select count() from origins_trigram where name like '%google%';
4692
Run Time: real 0.014 user 0.010935 sys 0.003145
sqlite> select count() from origins_trigram where name like '%yahoo%';
967
Run Time: real 0.008 user 0.006360 sys 0.001771
sqlite> select count() from origins_trigram where name like '%domaindoesnotexist%';
0
Run Time: real 0.032 user 0.030455 sys 0.001660

So, we’ve reduced the index size to nearly half and still get the same performance.