Full Text Search Explained
Search on the web has revolutionized the way we use the internet. Google can search the entire internet and provide a great experience to users who are trying to find content. Why then, is search missing or not optimal in so many of the applications we use today? Most applications have far less data to search through than Google does, but yet they commonly do a poor job. This article will introduce you to "Full Text Search" a powerful way of indexing data that can dramatically improve the user experience in applications.
A full text search engine is a mechanism for retrieving data that is optimized to efficiently find the data when the user knows only some portion of the data they are looking for.
Note
This article is intended for those just getting up to speed on full text search. My intent is not to perfectly describe every detail of full text search, but to give a simple enough explanation of the internals so that beginners can have a conceptual model of how search works rather than treating it as a black box.
Image Credit: brewbooks
If Your Application Was a Book
To gain an understanding of Full Text Search, it is helpful to make an analogy. Imagine that your application is a book. The database engine would be the table of contents. An ordered listing of what page some identifiable content is on. If you want to find chapter 7, for example, the table of contents is a very efficient way to find it. You simply look in the table of contents, jump straight to number 7 and find the content you wanted.
Now imagine that you want to recall an excellent quote that you have read in the book. You don't know which chapter it is in, you only remember that it contained the word "fox". The table of contents is of no use to you, because it only contains chapter numbers and titles. So you are stuck scanning the entire book to find the quote. Unless, your book has an index.
A book index is an ordered listing of the words that occur in the book. Each entry in the index will contain a list of page numbers where that word occurs. This allows the reader to quickly find any information in the book relevant to the word "fox" for instance.
Example
So, to explain full text search in terms of computing, rather than publishing. Imagine you have the following data in a document data store:
[
{
"id": 1,
"name": "Jon Smith",
"description": "Architect, Family"
},
{
"id": 2,
"name": "Frank Jones",
"description": "Family, Basketball"
},
{
"id": 3,
"name": "Jon Frank Whitaker",
"description": "Basketball, Architect"
},
]
If you want to find the information in record with id 2, then any database engine can efficiently retrieve that record for you because they will have an index that looks something like this:
1 => [Location On Disk: 987654]
2 => [Location On Disk: 549877]
3 => [Location On Disk: 654722]
If you need to find all the records that contain the word "Jon" however, the database engine will not be efficient at that query because it will have to scan through each record looking through the entire name
field and the entire description
field searching for the word "Jon".
A full text search index, on the other hand would look something like this:
"Architect"
1 => [Location On Disk: 987654]
3 => [Location On Disk: 654722]
"Basketball"
2 => [Location On Disk: 549877]
3 => [Location On Disk: 654722]
"Family"
1 => [Location On Disk: 987654]
2 => [Location On Disk: 549877]
"Frank"
2 => [Location On Disk: 549877]
3 => [Location On Disk: 654722]
"Jon"
1 => [Location On Disk: 987654]
3 => [Location On Disk: 654722]
"Jones"
2 => [Location On Disk: 549877]
"Smith"
1 => [Location On Disk: 987654]
"Whitaker"
3 => [Location On Disk: 654722]
Here, you can quickly see that records 1 and 3 contain the text "Jon" and you can efficiently load those records from disk.
Scoring
The engine has just searched for "Jon" and gotten two records. It will return both to the caller, and it will also attach a score to each record. The score is an indication of how close the record was to the search text compared to the other records in the result set. In our example record 1 "Jon Smith" might score higher than record 3 "Jon Frank Whitaker" because there is more text in record "Jon Frank Whitaker" that does not match "Jon" than there is in "Jon Smith".
Tokens
As you can see, at index time (the time at which the full text search index is constructed) the engine has broken down the string fields at word boundaries. The resulting substrings are called tokens. The tokens are the things that the engine will use to search for results. In our example, the strings were tokenized at word boundaries, but it is also possible to tokenize in other ways.
NGram tokenizers can be used to split the words up further. This is often useful when doing "search as you type" or auto-complete scenarios. An ngram tokenization example would like this:
"Frank" => [
"f", "fr", "fra", "fran", "frank",
"r", "ra", "ran", "rank",
"a", "an", "ank",
"n", "nk",
"k"
]
As you might guess, this will increase the size of the search index, but will allow the user to get "Frank" as a result by only typing "fr" for example.
Summary
Search can tremendously improve the user experience in your application. I hope this article has given you a decent enough view into how full text search works at a high level. My intent is not to bog you down with deep internal workings of full text search. I want to give you enough information about the internal workings so that terms like "index" and "tokens" have some meaning to you. Now get out there and add some search to your application!