Integrating Lucene Search into an Application
This article is part of our Academy Course titled Apache Lucene Fundamentals.
In this course, you will get an introduction to Lucene. You will see why a library like this is important and then learn how searching works in Lucene. Moreover, you will learn how to integrate Lucene Search into your own applications in order to provide robust searching capabilities. Check it out here!
Table Of Contents
1. Introduction
Java Lucene provides a quite powerful query language for performing searching operations in a large amount of data.
A query is broken up into terms and operators. There are three types of terms: Single Terms, Phrases, and Subqueries. A Single Term is a single word such as “test” or “hello”. A Phrase is a group of words surrounded by double quotes such as “hello dolly”. A Subquery is a query surrounded by parentheses such as “(hello dolly)”.
Lucene supports fields of data. When performing a search you can either specify a field, or use the default field. The field names depend on indexed data and default field is defined by current settings.
2. Parsing a query string
The job of a query parser is to convert a query string submitted by a user into query objects.
A query is used by a query parser which parses its content. Here is an example:
{ "query_string" : { "default_field" : "content", "query" : "this AND that OR thus" } }
The query_string
top level parameters include:
When a multi term query is being generated, one can control how it gets rewritten using the rewrite
parameter.
2.1. Rules of QueryParser
Suppose you are searching the Web for pages that contain both the words java and net but not the word dot. What if search engines made you type in something like the following for this simple query?
BooleanQuery query = new BooleanQuery(); query.add(new TermQuery(new Term("contents","java")), true, false); query.add(new TermQuery(new Term("contents", "net")), true, false); query.add(new TermQuery(new Term("contents", "dot")), false, true);
That would be a real drag. Thankfully, Google, Nutch, and other search engines are friendlier than that, allowing you to enter something much more succinct:java AND net NOT dot
.First we’ll see what is involved to use QueryParser in an application.
2.2. Using QueryParser
Using a QueryParser
is quite straightforward. Three things are needed: an expression, the default field name to use for unqualified fields in the expression, and an analyzer to pieces of the expression.Field-selection qualifiers are discussed in the query syntax section. Analysis, specific to query parsing, is covered in the “Analysis paralysis” section. Now, let’s parse an expression:
String humanQuery = getHumanQuery(); Query query = QueryParser.parse(humanQuery, "contents", new StandardAnalyzer());
Once you’ve obtained a Query
object, searching is done the same as if the query had been created directly through the API. Here is a full method to search an existing index with a user-entered query string, and display the results to the console:
public static void search(File indexDir, String q) throws Exception{ Directory fsDir = FSDirectory.getDirectory(indexDir, false); IndexSearcher is = new IndexSearcher(fsDir); Query query = QueryParser.parse(q, "contents", new StandardAnalyzer()); Hits hits = is.search(query); System.out.println("Found " + hits.length() + " document(s) that matched query '" + q + "':"); for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i); System.out.println(doc.get("filename")); } }
Expressions handed to the QueryParser
are parsed according to a simple grammar. When an illegal expression is encountered, QueryParser
throws a ParseException
.
2.3. QueryParser expression syntax
The following items in this section describe the syntax that QueryParser
supports to create the various query types.
Single-term query
A query string of only a single word is converted to an underlying TermQuery
.
Phrase query
To search for a group of words together in a field, surround the words with double-quotes. The query “hello world” corresponds to an exact phrase match, requiring “hello” and “world” to be successive terms for a match. Lucene also supports sloppy phrase queries, where the terms between quotes do not necessarily have to be in the exact order. The slop factor measures against how many moves it takes to rearrange the terms into the exact order. If the number of moves is less than a specified slop factor, it is a match. QueryParser
parses the expression “hello world”~2 as a PhraseQuery
with a slop factor of 2, allowing matches on the phrases “world hello”, “hello world”, “hello * world”, and “hello * * world”, where the asterisks represent irrelevant words in the index. Note that “world * hello” does not match with a slop factor of 2. Because the number of moves to get that back to “hello world” is 3. Hopping the word “world” to the asterisk position is one, to the “hello” position is two, and the third hop makes the exact match.
Range query
Text or date range queries use bracketed syntax, with TO between the beginning term and ending term. The type of bracket determines whether the range is inclusive (square brackets) or exclusive (curly brackets).
NOTES: Non-date range queries use the start and end terms as the user entered them without modification. In the case of {Aardvark TO Zebra}, the terms are not lowercased. Start and end terms must not contain whitespace, or parsing fails; only single words are allowed. The analyzer is not run on the start and end terms.
Date range handling
When a range query (such as [1/1/03 TO 12/31/03]) is encountered, the parser code first attempts to convert the start and end terms to dates. If the terms are valid dates, according to DateFormat.SHORT
and lenient parsing, then the dates are converted to their internal textual representation (however, date field indexing is beyond the scope of this article). If either of the two terms fails to parse as a valid date, they both are used as is for a textual range.
Wildcard and prefix queries
If a term contains an asterisk or question mark, it is considered a WildcardQuery
, except when the term only contains a trailing asterisk and QueryParser
optimizes it to a PrefixQuery
instead. While the WildcardQuery
API itself supports a leading wildcard character, the QueryParser
does not allow it. An example wildcard query is w*ldc?rd, whereas the query prefix* is optimized as a PrefixQuery
.
Fuzzy query
Lucene’s FuzzyQuery
matches terms close to a specified term. The Levenshtein distance algorithm determines how close terms in the index are to a specified target term. “Edit distance” is another term for “Levenshtein distance” and is a measure of similarity between two strings, where distance is measured as the number of character deletions, insertions, or substitutions required to transform one string to the other string. For example, the edit distance between “three” and “tree” is one, as only one character deletion is needed. The number of moves is used in a threshold calculation, which is ratio of distance to string length. QueryParser
supports fuzzy-term queries using a trailing tilde on a term. For example, searching for wuzza~ will find documents that contain “fuzzy” and “wuzzy”. Edit distance affects scoring, such that lower edit distances score higher.
Boolean query
Constructing Boolean queries textually is done using the operators AND, OR, and NOT. Terms listed without an operator specified use an implicit operator, which by default is OR. A query of abc xyz will be interpreted as abc OR xyz. Placing a NOT in front of a term excludes documents containing the following term. Negating a term must be combined with at least one non-negated term to return documents. Each of the uppercase word operators has shortcut syntax shown in the following table.
QueryParser
is a quick and effortless way to give users powerful query construction, but it is not for everyone. QueryParser
cannot create every type of query that can be constructed using the API. For instance, a PhrasePrefixQuery
cannot be constructed. You must keep in mind that all of the possibilities available when exposing freeform query parsing to an end user. Some queries have the potential for performance bottlenecks. The syntax used by the built-in QueryParser
may not be suitable for your needs. Some control is possible with subclassing QueryParser
, though it is still limited.
3. Create an index with index searcher
In general, an applications usually need only to call the inherited
Searcher.search(org.apache.lucene.search.Query,int)
or
Searcher.search(org.apache.lucene.search.Query,org.apache.lucene.search.Filter,int)
methods. For performance improvement we can open an indexSearcher
& use it for all other search operations. Here is an simple example of how to create an index in lucene & searching that index using indexSearcher
.
public void simpleLucene() { Analyzer analyzer = new StandardAnalyzer(); // Store the index in memory: Directory directory = new RAMDirectory(); // To store an index on disk, use this instead (note that the // parameter true will overwrite the index in that directory // if one exists): // Directory directory = FSDirectory.getDirectory("/tmp/myfiles", true); IndexWriter iwriter = new IndexWriter(directory, analyzer, true); iwriter.setMaxFieldLength(25000); Document doc = new Document(); String text = "This is the text to be indexed."; doc.add(new Field("fieldname", text, Field.Store.YES, Field.Index.TOKENIZED)); iwriter.addDocument(doc); iwriter.close(); // Now search the index: IndexSearcher isearcher = new IndexSearcher(directory); // Parse a simple query that searches for "text": QueryParser parser = new QueryParser("fieldname", analyzer); Query query = parser.parse("text"); Hits hits = isearcher.search(query); assertEquals(1, hits.length()); // Iterate through the results: for (int i = 0; i < hits.length(); i++) { Document hitDoc = hits.doc(i); assertEquals("This is the text to be indexed.", hitDoc.get("fieldname")); } isearcher.close(); directory.close(); }
4. Different types of query
Lucene supports a variety of query. Here are some of them.
- TermQuery
- BooleanQuery
- WildcardQuery
- PhraseQuery
- PrefixQuery
- MultiPhraseQuery
- FuzzyQuery
- RegexpQuery
- TermRangeQuery
- NumericRangeQuery
- ConstantScoreQuery
- DisjunctionMaxQuery
- MatchAllDocsQuery
4.1 TermQuery
Matches documents that have fields that contain a term (not analyzed). The term query maps to Lucene TermQuery
. The following matches documents where the user field contains the term kimchy:
{ "term" : { "user" : "kimchy" } }
A boost can also be associated with the query:
{ "term" : { "user" : { "value" : "kimchy", "boost" : 2.0 } } }
Or :
{ "term" : { "user" : { "term" : "kimchy", "boost" : 2.0 } } }
With Lucene, it’s possible to search for a particular word that has been indexed using the TermQuery
class. This tutorial will compare TermQuery
searches with QueryParser
searches, as well as show some of the nuances involved with a term query.
4.2 BooleanQuery
We can run multifield searches in Lucene using either the BooleanQuery
API or using the MultiFieldQueryParser
for parsing the query text. For e.g. If a index has 2 fields FirstName
and LastName
and if you need to search for “John” in the FirstName
field and “Travis” in the LastName
field one can use a BooleanQuery
as such:
BooleanQuery bq = new BooleanQuery(); Query qf = new TermQuery(new Lucene.Net.Index.Term("FirstName", "John")); Query ql = new TermQuery(new Lucene.Net.Index.Term("LastName", "Travis")); bq.Add(qf, BooleanClause.Occur.MUST); bq.Add(ql, BooleanClause.Occur.MUST); IndexSearcher srchr = new IndexSearcher(@"C:\\indexDir"); srchr.Search(bq);
4.3 WildcardQuery
Matches documents that have fields matching a wildcard expression (not analyzed). Supported wildcards are *, which matches any character sequence (including the empty one), and ?, which matches any single character. Note this query can be slow, as it needs to iterate over many terms. In order to prevent extremely slow wildcard queries, a wildcard term should not start with one of the wildcards * or ?. The wildcard query maps to Lucene WildcardQuery.
{ "wildcard" : { "user" : "ki*y" } }
A boost can also be associated with the query:
{ "wildcard" : { "user" : { "value" : "ki*y", "boost" : 2.0 } } }
Or :
{ "wildcard" : { "user" : { "wildcard" : "ki*y", "boost" : 2.0 } } }
This multi term query allows to control how it gets rewritten using the rewrite parameter.
4.4 PhraseQuery
With Lucene, a PhaseQuery
can be used to query for a sequence of terms, where the terms do not necessarily have to be next to each other or in order. The PhaseQuery
object’s setSlop()
method can be used to set how many words can be between the various words in the query phrase.
We can use PhraseQuery
like this,
Term term1 = new Term(FIELD_CONTENTS, string1); Term term2 = new Term(FIELD_CONTENTS, string2); PhraseQuery phraseQuery = new PhraseQuery(); phraseQuery.add(term1); phraseQuery.add(term2); phraseQuery.setSlop(slop);
4.5 PrefixQuery
Matches documents that have fields containing terms with a specified prefix (not analyzed). The prefix query maps to Lucene PrefixQuery
. The following matches documents where the user field contains a term that starts with ki:
{ "prefix" : { "user" : "ki" } }
A boost can also be associated with the query:
{ "prefix" : { "user" : { "value" : "ki", "boost" : 2.0 } } }
Or :
{ "prefix" : { "user" : { "prefix" : "ki", "boost" : 2.0 } } }
This multi term query allows to control how it gets rewritten using the rewrite parameter.
4.6 MultiPhraseQuery
The built-in MultiPhraseQuery
is definitely a niche query, but it’s potentially useful. MultiPhraseQuery
is just like PhraseQuery
except that it allows multiple terms per position. You could achieve the same logical effect, albeit at a high performance cost, by enumerating all possible phrase combinations and using a BooleanQuery
to “OR” them together.
For example, suppose we want to find all documents about speedy foxes, with quick or fast followed by fox. One approach is to do a “quick fox” OR “fast fox” query. Another option is to use MultiPhraseQuery
.
4.7 FuzzyQuery
FuzzyQuery
can be categorized into two, a. fuzzy like this query & b. fuzzy like this field querya. fuzzy like this query–Fuzzy like this query find documents that are “like” provided text by running it against one or more fields.
{ "fuzzy_like_this" : { "fields" : ["name.first", "name.last"], "like_text" : "text like this one", "max_query_terms" : 12 } }
fuzzy_like_this can be shortened to flt.
The fuzzy_like_this top level parameters include:
fields
-> A list of the fields to run the more like this query against. Defaults to the _all field.like_text
-> The text to find documents like it, required.ignore_tf
-> Should term frequency be ignored. Defaults to false.max_query_terms
-> The maximum number of query terms that will be included in any generated query. Defaults to 25.fuzziness
-> The minimum similarity of the term variants. Defaults to 0.5. See the section called “Fuzzinessedit”.prefix_length
-> Length of required common prefix on variant terms. Defaults to 0.boost
-> Sets the boost value of the query. Defaults to 1.0.analyzer
-> The analyzer that will be used to analyze the text. Defaults to the analyzer associated with the field.
Fuzzifies ALL terms provided as strings and then picks the best n differentiating terms. In effect this mixes the behaviour of FuzzyQuery
and MoreLikeThis
but with special consideration of fuzzy scoring factors. This generally produces good results for queries where users may provide details in a number of fields and have no knowledge of boolean query syntax and also want a degree of fuzzy matching and a fast query.
For each source term the fuzzy variants are held in a BooleanQuery
with no coord factor (because we are not looking for matches on multiple variants in any one doc). Additionally, a specialized TermQuery
is used for variants and does not use that variant term’s IDF because this would favor rarer terms, such as misspellings. Instead, all variants use the same IDF ranking (the one for the source query term) and this is factored into the variant’s boost. If the source query term does not exist in the index the average IDF of the variants is used.b. fuzzy like this field query–
The fuzzy_like_this_field query is the same as the fuzzy_like_this query, except that it runs against a single field. It provides nicer query DSL over the generic fuzzy_like_this query, and support typed fields query (automatically wraps typed fields with type filter to match only on the specific type).
{ "fuzzy_like_this_field" : { "name.first" : { "like_text" : "text like this one", "max_query_terms" : 12 } } }
fuzzy_like_this_field can be shortened to flt_field.The fuzzy_like_this_field top level parameters include:
- like_text -> The text to find documents like it, required.
- ignore_tf -> Should term frequency be ignored. Defaults to false.
- max_query_terms -> The maximum number of query terms that will be included in any generated query. Defaults to 25.
- fuzziness -> The fuzziness of the term variants. Defaults to 0.5. See the section called “Fuzzinessedit”.
- prefix_length -> Length of required common prefix on variant terms. Defaults to 0.
- boost -> Sets the boost value of the query. Defaults to 1.0.
- analyzer -> The analyzer that will be used to analyze the text. Defaults to the analyzer associated with the field.
4.8 RegexpQuery
The regexp query allows you to use regular expression term queries. See Regular expression syntax for details of the supported regular expression language.
Note: The performance of a regexp query heavily depends on the regular expression chosen. Matching everything like .* is very slow as well as using lookaround regular expressions. If possible, you should try to use a long prefix before your regular expression starts. Wildcard matchers like .*?+ will mostly lower performance.
{ "regexp":{ "name.first": "s.*y" } }
Boosting is also supported
{ "regexp":{ "name.first":{ "value":"s.*y", "boost":1.2 } } }
You can also use special flags
{ "regexp":{ "name.first": { "value": "s.*y", "flags" : "INTERSECTION|COMPLEMENT|EMPTY" } } }
Possible flags are ALL
, ANYSTRING
, AUTOMATON
, COMPLEMENT
, EMPTY
, INTERSECTION
, INTERVAL
, or NONE
. Regular expression queries are supported by the regexp and the query_string queries. The Lucene regular expression engine is not Perl-compatible but supports a smaller range of operators.
Standard operators
AnchoringMost regular expression engines allow you to match any part of a string. If you want the regexp pattern to start at the beginning of the string or finish at the end of the string, then you have to anchor it specifically, using ^ to indicate the beginning or $ to indicate the end.Lucene’s patterns are always anchored. The pattern provided must match the entire string. For string “abcde”:
ab.* # match
abcd # no match
Allowed characters
Any Unicode characters may be used in the pattern, but certain characters are reserved and must be escaped. The standard reserved characters are:
. ? + * | { } [ ] ( ) ” \
If you enable optional features (see below) then these characters may also be reserved:
# @ & < > ~
Any reserved character can be escaped with a backslash “\*” including a literal backslash character:
“\\”
Additionally, any characters (except double quotes) are interpreted literally when surrounded by double quotes:
john”@smith.com”
Match any character
The period “.” can be used to represent any character. For string “abcde”:
ab... # match
a.c.e # match
One-or-more
The plus sign “+” can be used to repeat the preceding shortest pattern once or more times. For string “aaabbb”:
a+b+ # match
aa+bb+ # match
a+.+ # match
aa+bbb+ # no match
Zero-or-more
The asterisk “*” can be used to match the preceding shortest pattern zero-or-more times. For string “aaabbb”:
a*b* # match
a*b*c* # match
.*bbb.* # match
aaa*bbb* # match
Zero-or-one
The question mark “?” makes the preceding shortest pattern optional. It matches zero or one times. For string “aaabbb”:
aaa?bbb? # match
aaaa?bbbb? # match
.....?.? # match
aa?bb? # no match
Min-to-max
Curly brackets “{}” can be used to specify a minimum and (optionally) a maximum number of times the preceding shortest pattern can repeat. The allowed forms are:
{5} # repeat exactly 5 times
{2,5} # repeat at least twice and at most 5 times
{2,} # repeat at least twice
For string “aaabbb”:
a{3}b{3} # match
a{2,4}b{2,4} # match
a{2,}b{2,} # match
.{3}.{3} # match
a{4}b{4} # no match
a{4,6}b{4,6} # no match
a{4,}b{4,} # no match
Grouping
Parentheses “()” can be used to form sub-patterns. The quantity operators listed above operate on the shortest previous pattern, which can be a group. For string “ababab”:
(ab)+ # match
ab(ab)+ # match
(..)+ # match
(...)+ # no match
(ab)* # match
abab(ab)? # match
ab(ab)? # no match
(ab){3} # match
(ab){1,2} # no match
Alternation
The pipe symbol “|” acts as an OR operator. The match will succeed if the pattern on either the left-hand side OR the right-hand side matches. The alternation applies to the longest pattern, not the shortest. For string “aabb”:
aabb|bbaa # match
aacc|bb # no match aa(cc|bb) # match
a+|b+ # no match
a+b+|b+a+ # match
a+(b|c)+ # match
Character classes
Ranges of potential characters may be represented as character classes by enclosing them in square brackets “[]”. A leading ^ negates the character class. The allowed forms are:
[abc] # 'a' or 'b' or 'c'
[a-c] # 'a' or 'b' or 'c'
[-abc] # '-' or 'a' or 'b' or 'c'
[abc\\-] # '-' or 'a' or 'b' or 'c'
[^a-c] # any character except 'a' or 'b' or 'c'
[^a-c] # any character except 'a' or 'b' or 'c'
[-abc] # '-' or 'a' or 'b' or 'c'
[abc\\-] # '-' or 'a' or 'b' or 'c'
Note that the dash “-” indicates a range of characeters, unless it is the first character or if it is escaped with a backslash.For string “abcd”:
ab[cd]+ # match
[a-d]+ # match
[^a-d]+ # no match
4.9 TermRangeQuery
A Query
that matches documents within an range of terms. This query matches the documents looking for terms that fall into the supplied range according toString#compareTo(String)
, unless a Collator
is provided. It is not intended for numerical ranges.
Here is an example of how to use TermRangeQuery
in lucene,
private Query createQuery(String field, DateOperator dop) throws UnsupportedSearchException { Date date = dop.getDate(); DateResolution res = dop.getDateResultion(); DateTools.Resolution dRes = toResolution(res); String value = DateTools.dateToString(date, dRes); switch(dop.getType()) { case ON: return new TermQuery(new Term(field ,value)); case BEFORE: return new TermRangeQuery(field, DateTools.dateToString(MIN_DATE, dRes), value, true, false); case AFTER: return new TermRangeQuery(field, value, DateTools.dateToString(MAX_DATE, dRes), false, true); default: throw new UnsupportedSearchException(); } }
4.10 NumericRangeQuery
A NumericRangeQuery
, that matches numeric values within a specified range. To use this, you must first index the numeric values.We can combine NumericRangeQuery
with TermQuery
like this,
String termQueryString = "title:\\"hello world\\""; Query termQuery = parser.parse(termQueryString); Query pageQueryRange = NumericRangeQuery.newIntRange("page_count", 10, 20, true, true); Query query = termQuery.combine(new Query[]{termQuery, pageQueryRange});
4.11 ConstantScoreQuery
A query that wraps another query or a filter and simply returns a constant score equal to the query boost for every document that matches the filter or query. For queries it therefore simply strips of all scores and returns a constant one.
{ "constant_score" : { "filter" : { "term" : { "user" : "kimchy"} }, "boost" : 1.2 } }
The filter object can hold only filter elements, not queries. Filters can be much faster compared to queries since they don’t perform any scoring, especially when they are cached.A query can also be wrapped in a constant_score query:
{ "constant_score" : { "query" : { "term" : { "user" : "kimchy"} }, "boost" : 1.2 } }
4.12 DisjunctionMaxQuery
A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries.
This is useful when searching for a word in multiple fields with different boost factors (so that the fields cannot be combined equivalently into a single search field). We want the primary score to be the one associated with the highest boost, not the sum of the field scores (as Boolean Query would give). If the query is “albino elephant” this ensures that “albino” matching one field and “elephant” matching another gets a higher score than “albino” matching both fields. To get this result, use both Boolean Query and DisjunctionMaxQuery
: for each term a DisjunctionMaxQuery searches for it in each field, while the set of these DisjunctionMaxQuery’s is combined into a BooleanQuery
.
The tie breaker capability allows results that include the same term in multiple fields to be judged better than results that include this term in only the best of those multiple fields, without confusing this with the better case of two different terms in the multiple fields.The default tie_breaker is 0.0.This query maps to Lucene DisjunctionMaxQuery
.
{ "dis_max" : { "tie_breaker" : 0.7, "boost" : 1.2, "queries" : [ { "term" : { "age" : 34 } }, { "term" : { "age" : 35 } } ] } }
4.13 MatchAllDocsQuery
A query that matches all documents. Maps to Lucene MatchAllDocsQuery
.
{ "match_all" : { } }
Which can also have boost associated with it:
{ "match_all" : { "boost" : 1.2 } }