Elasticsearch: Text analysis for content enrichment
Every text search solution is as powerful as the text analysis capabilities it offers. Lucene is such open source information retrieval library offering many text analysis possibilities. In this post, we will cover some of the main text analysis features offered by ElasticSearch available to enrich your search content.
Content Enrichment
Taking an example of a typical eCommerce site, serving the right content in search to the end customer is very important for the business. The text analysis strategy provided by any search solution plays very big role in it. As a search user, I would prefer some of typical search behavior for my query to automatically return,
- should look for synonyms matching my query text
- should match singluar and plural words or words sounding similar to enter query text
- should not allow searching on protected words
- should allow search for words mixed with numberic or special characters
- should not allow search on html tags
- should allow search text based on proximity of the letters and number of matching letters
Enriching the content here would be to add above search capabilities to you content while indexing and searching for the content.
Lucene Text Analysis
Lucene is information retrieval (IR) allowing full text indexing and searching capability. For quick reference, check post Text Analysis inside Lucene. In Lucene, the document contains fields of Text. Analysis is the process of converting field text further into terms. These terms are used to match a search query. There are three main implementations for the whole analysis process,
- Analyzer: An Analyzer is responsible for building a TokenStream which can be consumed by the indexing and searching processes.
- Tokenizer: A Tokenizer is a TokenStream and is responsible for breaking up incoming text into Tokens. In most cases,an Analyzer will use a Tokenizer as the first step in the analysis process.
- TokenFilter: A TokenFilter is also a TokenStream and is responsible for modifying Tokens that have been created by the Tokenizer.
A common usage style of TokenStreams and TokenFilters inside an Analyzer is to use the chaining pattern that lets you build complex analyzers from simple Tokenizer/TokenFilter building blocks. Tokenizers start the analysis process by demarcating the character input into tokens (mostly these correspond to words in the original text). TokenFilters then take over the remainder of the analysis, initially wrapping a Tokenizer and successively wrapping nested TokenFilters.
ElasticSearch Text Analysis
ElasticSearch uses Lucene inbuilt capabilities of text analysis and allows you to enrich your search content. As stated above text analysis is dividing into filters, tokenizers and analyzers. ElasticSearch offers you quite some inbuilt analyzers with preconfirgured tokenizers and filters. For details list of existing analyzers, check complete list for Analysis
Update Analysis Settings
ElasticSearch allows you to dynamically update index settings and mapping. To update index setting from java api client,
Settings settings = settingsBuilder().loadFromSource(jsonBuilder() .startObject() //Add analyzer settings .startObject("analysis") .startObject("filter") .startObject("test_filter_stopwords_en") .field("type", "stop") .field("stopwords_path", "stopwords/stop_en") .endObject() .startObject("test_filter_snowball_en") .field("type", "snowball") .field("language", "English") .endObject() .startObject("test_filter_worddelimiter_en") .field("type", "word_delimiter") .field("protected_words_path", "worddelimiters/protectedwords_en") .field("type_table_path", "typetable") .endObject() .startObject("test_filter_synonyms_en") .field("type", "synonym") .field("synonyms_path", "synonyms/synonyms_en") .field("ignore_case", true) .field("expand", true) .endObject() .startObject("test_filter_ngram") .field("type", "edgeNGram") .field("min_gram", 2) .field("max_gram", 30) .endObject() .endObject() .startObject("analyzer") .startObject("test_analyzer") .field("type", "custom") .field("tokenizer", "whitespace") .field("filter", new String[]{"lowercase", "test_filter_worddelimiter_en", "test_filter_stopwords_en", "test_filter_synonyms_en", "test_filter_snowball_en"}) .field("char_filter", "html_strip") .endObject() .endObject() .endObject() .endObject().string()).build(); CreateIndexRequestBuilder createIndexRequestBuilder = client.admin().indices().prepareCreate(indexName); createIndexRequestBuilder.setSettings(settings);
You can also set your index and settings in your configuration file. The path mentioned in the above example are relative to the config directory of installed elasticsearch server. The above example allows you to create custom filters and analyzers for your index, ElasticSearch has existing combination of different filters and tokenizers allowing you to select right combination for your data.
Synonyms
Synonym are the words with the same or similar meaning. Synonym Expansion is where we take variants of the word and assign them to the search engine at the indexing and/or query time. To add synonym filter to the settings for the index.
.startObject("test_filter_synonyms_en") .field("type", "synonym") .field("synonyms_path", "synonyms/synonyms_en") .field("ignore_case", true) .field("expand", true) .endObject()
Check the Synonym Filter for complete syntax. You can add synonym in Slor or WordNet format. Have a look at Slor Synonym Format for further examples,
# If expand==true, "ipod, i-pod, i pod" is equivalent to the explicit mapping: ipod, i-pod, i pod => ipod, i-pod, i pod # If expand==false, "ipod, i-pod, i pod" is equivalent to the explicit mapping: ipod, i-pod, i pod => ipod
Check the wordlist for the list of words and synonyms matching to your requirements.
Stemming
Word stemming is defined as the ability to include word variations. For example any noun-word would include variations (whose importance is directly proportional to the degree of variation) With word stemming, we use quantified methods for the rules of grammar to add word stems and rank them according to their degree of separation from the root word. To add stemming filter to the settings for the index.
.startObject("test_filter_snowball_en") .field("type", "snowball") .field("language", "English") .endObject()
Check the Snowball Filter syntax for details. Stemming programs are commonly referred to as stemming algorithms or stemmers. Lucene analysis can be algorithmic or dictionary based. Snowball, based on Martin Porter’s Snowball algorithm provides stemming functionality and used as stemmer in above example. Check the list of snowball stemmers for different supported languages. Synonym and stemming sometime return you strange results based on the order of text processing. Make sure to use the two in the order matching your requirements.
Stop words
Stop words are the list of words which you do not want to allow user to index or query upon. To add a stop word filter to the settings,
.startObject("test_filter_stopwords_en") .field("type", "stop") .field("stopwords_path", "stopwords/stop_en") .endObject()
Check the complete syntax for stop words filter. Check Snowball Stop words list for English language to derive your own list. Check Solr shared list of stop words for English language.
Word Delimiter
Word delimiter filter allows you to split a word into sub words, for further processing on the sub words. To add a word delimiter filter to the settings,
.startObject("test_filter_worddelimiter_en") .field("type", "word_delimiter") .field("protected_words_path", "worddelimiters/protectedwords_en") .field("type_table_path", "typetable") .endObject()
The common split of words is based on non alphanumeric nature, case transitions and intra word delimiters etc. Check the complete syntax and different available options for Word Delimiter Filter. The list of protected words allows you to protect business relevant words from being delimited in the process.
N-grams
N-gram is a continuous sequence of n letters for a given sequence of text. To add a edge ngram filter to the settings,
.startObject("test_filter_ngram") .field("type", "edgeNGram") .field("min_gram", 2) .field("max_gram", 30) .endObject()
Based on your configuration, the input text will be broken down into multiple token of length configured above during the indexing time. It allows you to return the result based on matching ngram tokens also based on the proximity. Check the detailed syntax from the Edge NGram Filter
HTML Strip Char Filter
Most of the websites have HTML content content available that should be indexable. Allowing to index and query on standard html text is not desired for most of the sites. ElasticSearch allows you to filter the html tags, which won’t be indexed and won’t be available for query.
.startObject("analyzer") .startObject("test_analyzer") .field("type", "custom") .field("tokenizer", "whitespace") .field("filter", new String[]{"lowercase", "test_filter_worddelimiter_en", "test_filter_stopwords_en", "test_filter_synonyms_en", "test_filter_snowball_en"}) .field("char_filter", "html_strip") .endObject() .endObject()
Check the complete syntax of HTML Strip Char Filter for details. In addition to the above mentioned common filters, there are many more available filters allowing you to enrich your search content in desired way based on end user requirements and your business data.
The link for the post “Text Analysis inside Lucene” is broken