The Absolute Basics of Indexing Data
Ever wondered how a search engine works? In this post I would like to show you a high level view of the internal workings of a search engine and how it can be used to give fast access to your data. I won’t go into any technical details, what I am describing here holds true for any Lucene based search engine, be it Lucene itself, Solr or Elasticsearch.
Input
Normally a search engine is agnostic to the real data source of indexing data. Most often you push data into it via an API that already needs to be in the expected format, mostly Strings and data types like integers. It doesn’t matter if this data originally resides in a document in the filesystem, on a website or in a database.
Search engines are working with documents that consist of fields and values. Though not always used directly you can think of documents as JSON documents. For this post imagine we are building a book database. In our simplified world a book just consists of a title and one or more authors. This would be two example documents:
{ "title" : "Search Patterns", "authors" : [ "Morville", "Callender" ], } { "title" : "Apache Solr Enterprise Search Server", "authors" : [ "Smiley", "Pugh" ] }
Even though the structure of both documents is the same in our case, the format of the document doesn’t need to be fixed. Both documents could have totally different attributes, nevertheless both could be stored in the same index. In reality you will try to keep the documents similar, after all, you need a way to handle the documents in your application.
Lucene itself doesn’t even have the concept of a key. But of course you need a key to identify your documents when updating them. Both Solr and Elasticsearch have ids that can either be chosen by the application or be autogenerated.
Analyzing
For every field that is indexed a special process called analyzing is employed. What it does can differ from field to field. For example, in a simple case it might just split the terms on whitespace and remove any punctuation so Search Patterns would become two terms, Search and Patterns.
Index Structure
An inverted index, the structure search engines are using, is similar to a map that contains search terms as key and a reference to a document as value. This way the process of searching is just a lookup of the term in the index, a very fast process. Those might be the terms that are indexed for our example documents.
Field | Term | Document Id |
---|---|---|
title | Apache | 2 |
Enterprise | 2 | |
Patterns | 1 | |
Search | 1,2 | |
Server | 2 | |
Solr | 2 | |
author | Callender | 1 |
Morville | 1 | |
Pugh | 2 | |
Smiley | 2 |
A real index contains more information like position information to enable phrase queries and frequencies for calculating the relevancy of a document for a certain search term.
As we can see the index holds a reference to the document. This document, that is also stored with the search index, doesn’t necessarily have to be the same as our input document. You can determine for each field if you would like to keep the original content which is normally controlled via an attribute named stored. As a general rule, you should have all the fields stored that you would like to display with the search results. When indexing lots of complete books and you don’t need to display it in a results page it might be better to not store it at all. You can still search it, as the terms are available in the index, but you can’t access the original content.
More on Analyzing
Looking at the index structure above we can already imagine how the search process for a book might work. The user enters a term, e.g. Solr, this term is then used to lookup the documents that contain the term. This works fine for cases when the user types the term correctly. A search for solr won’t match for our current example.
To mitigate those difficulties we can use the analyzing process already mentioned above. Besides the tokenization that splits the field value into tokens we can do further preprocessing like removing tokens, adding tokens or modifying tokens (TokenFilter).
For our book case it might at first be enough to do lowercasing on the incoming data. So a field value Solr will then be stored as solr in the index. To enable the user to also search for Solr with an uppercase letter we need to do analyzing for the query as well. Often it is the same process that is used for indexing but there are also cases for different analyzers.
The analyzing process not only depends on the content of the documents (field types, language of text fields) but also on your application. Take one common scenario: Adding synonyms for terms to the index. You might think that you just take a huge list of synonyms like WordNet and add those to each of your application. This might in fact decrease the search experience of your users as there are too many false positives. Also, for certain terms of the domain of your users WordNet might not contain the correct synonyms at all.
Duplication
When designing the index structure there are two competing forces: Often you either optimize for query speed or for index size. If you have a lot of data you probably need to take care that you only store data that you really need and even only put terms in the index that are necessary for lookups. Oftentimes for smaller datasets the index size doesn’t matter that much and you can design your index for query performance.
Let’s look at an example that can make sense for both cases. In our book information system we would like to display an alphabetic navigation for the last name of the author. If the user clicks on A, all the books of authors starting with the letter A should be displayed. When using the Lucene query syntax you can do something like this with its wildcard support: Just issue a query that contains the letter the user clicked and a trailing *, e.g. a*.
Wildcard queries have become very fast with recent Lucene versions, nevertheless it still is a query time impact. You can also choose another way. When indexing the data you can add another field that just stores the first letter of the name. This is what the relevant configuration might look like in Elasticsearch but the concept is the same for Lucene and Solr:
"author": { "type": "multi_field", "fields": { "author" : { "type": "string" }, "letter" : { "type": "string", "analyzer": "single_char_analyzer" } } }
Under the hood, another term dictionary for the field author.letter will be created. For our example it will look like this:
Field | Term | Document Id |
---|---|---|
author.letter | C | 1 |
M | 1 | |
P | 2 | |
S | 2 |
Now, instead of issuing a wildcard query on the author field we can directly query the author.letter field with the letter. You can even build the navigation from all the terms in the index using techniques like faceting the extract all the available terms for a field from the index.
Conclusion
These are the basics of the indexing data for a search engine. The inverted index structure makes searching really fast by moving some processing to the indexing phase. When we are not bound by any index size concerns we can design our index for query performance and add additional fields that duplicate some of the data. This design for queries is what makes search engines similar to how lots of the NoSQL solutions are used.
If you’d like to go deeper in the topics I recommend watching the talk What is in a Lucene index by Adrien Grand. He shows some of the concepts I have mentioned here (and a lot more) but also how those are implemented in Lucene.
Reference: | The Absolute Basics of Indexing Data from our JCG partner Florian Hopf at the Dev Time blog. |