Enterprise Java

Lucene Components Overview

This article is part of our Academy Course titled Apache Lucene Fundamentals.

In this course, you will get an introduction to Lucene. You will see why a library like this is important and then learn how searching works in Lucene. Moreover, you will learn how to integrate Lucene Search into your own applications in order to provide robust searching capabilities. Check it out here!

1. Information Overload/explosion

Nowadays, searching functionality in applications is becoming an increasingly important feature. After all, Web is all about information and it is all about getting information at the right time and at the right hand.

Information explosion is coming in the form of rapid increase in the amount of published digital information in our modern world and the effects of abundance of this raw and unstructured data. This information explosion has led to a constant state of information overload for all of us.

Information Overload is now a common phenomenon in offices around the World. Some of the causes include:

  1. The widespread access to the Web
  2. The ease of sending e-mail messages to large numbers of people
  3. As information can be duplicated for free, there is no variable cost in producing more copies – people send reports and information to people who may need to know, rather than definitely need to know.
  4. Poorly created information sources (especially online), which:
    • are not simplified or filtered to make them shorter
    • are not written clearly, so people have to spend more time understanding them
    • contain factual errors or inconsistencies – requiring further research

Solution

Although there is no simple and single solution to the problem above, there are some ways that can be used to mitigate the problem.

These include:

  1. Spending less time on gaining information that is “nice to know” and more time on things that we “need to know now”.
  2. Focusing on quality of information, rather than quantity. A short concise e-mail is more valuable than a long e-mail.
  3. Learning how to create better information. Be direct in what we ask people, so that they can provide precise answers.
  4. Single-tasking, and keeping the mind focused on one issue at a time.

Now, apart from those, we can implement an information retrieval solution with the Open Source Search Library Apache Lucene, which is capable of retrieving information from such unstructured contents, as long as we can get textual data from a content repository.

A short overview of a search application

  1. Store files in file system.
  2. While storing the files, we will need to add a file as document in Lucene index.
  3. While removing files, we will need to remove entry of the file from corresponding Lucene index.
  4. Analyze the document with Lucene Standard Analyser (We can use several other analyzers which are pluggable to Lucene)
  5. Update the Lucene index with an extra field such as file path in the document.
  6. Start search with Lucene Standard Analysis
  7. We can get the result of such search at a far improved speed than the relational database search for millions of documents.
  8. Now if we have the link for the file in the file system in Lucene indexed repository, we can browse it – which may be one of our application goal.

The above use case is not the only solution for all text based searching and information retrieval from huge information repositories of course. A plain-old database search functionality could be enough in some cases. And data processing requirement with other tools such as Apache Hadoop are also viable options.

2. Component for indexing of unstructured data

The Indexing component maintains a directory for the files that are available to user for file retrieval . The Indexing component is an optional feature and should be installed on any server that will be accessed by users for file retrieval. The Indexing component supports searches by file, file versions and recent activity.

Let’s examine some terms related to Lucene Indexing:

Indexed entity

Those files or information which are stored in Lucene index repository are referred as indexed entities.

Each Lucene index is managed by one index manager which is uniquely identified by name. In most cases there is also a one to one relationship between an indexed entity and a single IndexManager (manages indexing). The exceptions are the use cases of index sharding and index sharing. The former can be applied when the index for a single entity becomes too big and indexing operations are slowing down the application. In this case, a single entity is indexed into multiple indexes each with its own index manager. The latter, index sharing, is the ability to index multiple entities into the same Lucene index.

Sharding indexes

In some cases it can be useful to split (shard) the indexed data of a given entity into several Lucene indexes.

Possible use cases for sharding are:

  1. A single index is so huge that index update times are slowing the application down.
  2. A typical search will only hit a sub-set of the index, such as when data is naturally segmented by customer, region or application.

Sharing indexes

It is technically possible to store the information of more than one entity into a single Lucene index. There are two ways to accomplish this:

  1. Configuring the underlying directory providers to point to the same physical index directory. We should use the same index (directory) for the Furniture and Animal entity. We just set indexName for both entities to for example “Animal”. Both entities will then be stored in the Animal directory.
  2. Setting the @Indexed annotation’s index attribute of the entities you want to merge to the same value. If we again wanted all Furniture instances to be indexed in the Animal index along with all instances of Animal we would specify @Indexed(index="Animal") on both Animal and Furniture classes.

3. Components of Searching the Data

Core indexing classes

Lucene is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead. This would be the equivalent of retrieving pages in a book related to a keyword by searching the index at the back of a book, as opposed to searching the words in each page of the book.

This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).

We can sort out core indexing classes accordingly,

  1. IndexWriter
  2. Directory
  3. Analyzer
  4. Document
  5. Field

To create an index, the first thing that need to do is to create an IndexWriter object. The IndexWriter object is used to create the index and to add new index entries (i.e., Documents) to this index. You can create an IndexWriter as follows:

IndexWriter indexWriter = new IndexWriter("index-directory", new StandardAnalyzer(), true);

The first parameter specifies the directory in which the Lucene index will be created, which is index-directory in this case. The second parameter specifies the “document parser” or “document analyzer” that will be used when Lucene indexes your data. Here, we are using the StandardAnalyzer for this purpose. More details on Lucene analyzers follow shortly. The third parameter tells Lucene to create a new index if an index has not been created in the directory yet.

Document is the unit of the indexing and searching process.

Fields are the actual content holders of Lucene. They are basically a hashtable, with a name and value.

An IndexWriter creates and maintains an index.

The create argument to the constructor determines whether a new index is created, or whether an existing index is opened. We can open an index with ‘create=true’ even while readers are using the index. The old readers will continue to search the “point in time” snapshot they had opened, and won’t see the newly created index until they re-open. There are also constructors with no create argument which will create a new index if there is not already an index at the provided path and otherwise open the existing index.

Changes done during the above mentioned method calls are buffered in memory and a flush is triggered when there are enough buffered deletes or enough added documents since the last flush, whichever is sooner. Flush can be also be called forcefully. When a flush occurs, both pending deletes and added documents are flushed to the index. A flush may also trigger one or more segment merges.

The optional autoCommit argument to the constructors controls visibility of the changes to IndexReader instances reading the same index. When this is false, changes are not visible until close() is called. Changes will still be flushed to the Directory as new files, but are not committed (no new segments_N file is written referencing the new files) until close() is called. If something goes terribly wrong (for example the JVM crashes) before close(), then the index will reflect none of the changes made (it will remain in its starting state). We can also call abort(), which closes the writer without committing any changes, and removes any index files that had been flushed but are now unreferenced. This mode is useful for preventing readers from refreshing at a bad time (for example after we have done all deletes but before we have done adds). It can also be used to implement simple single-writer transactional semantics (“all or none”).

When autoCommit is true then every flush is also a commit. When running in this mode, one thing should be kept in mind that the readers should not refresh while optimizes or segment merges are taking place as this can tie up substantial disk space.

Regardless of autoCommit, an IndexReader or IndexSearcher will only see the index as of the “point in time” that it was opened. Any changes committed to the index after the reader was opened are not visible until the reader is re-opened.

If an index will not have more documents added for a while and optimal search performance is desired, then the optimize method should be called before the index is closed.

Opening an IndexWriter creates a lock file for the directory in use. Trying to open another IndexWriter on the same directory will lead to a LockObtainFailedException. The LockObtainFailedException is also thrown if an IndexReader on the same directory is used to delete documents from the index.

Core searching classes

The core searching classes are:

  1. IndexSearcher
  2. Term
  3. Query
  4. TermQuery
  5. TopDocs

Lucene uses instances of the aptly named IndexReader to read data from an index.

Lucene supplies an IndexSearcher class that performs the actual search. Every index searcher wraps an index reader to get a handle on the indexed data. Once we have an index searcher, we can supply queries to it and enumerate results in order of their score. There is really nothing to configure in an index searcher other than its reader.

IndexSearcher instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently. If an application requires external synchronization, there is no need of synchronize on the IndexSearcher instance; we can use our own (non-Lucene) objects instead.

Here is the syntax of IndexReader class:

IndexSearcher is = new IndexSearcher(path);

A query is broken up into terms and operators. There are two types of terms: Single Terms and Phrases.

A Single Term is a single word such as “test” or “hello”.

A Phrase is a group of words surrounded by double quotes such as “hello User”.

Multiple terms can be combined together with Boolean operators to form a more complex query.

Lucene supports fielded data, which Search Lucene API modules often use in faceted searches. By default, Search Lucene API searches the contents field. However, you can search data in a specific field by typing the field name followed by a colon “:” and the term we are looking for.

For example, if we search for a node entitled “The Right Way” which contains the text “go”, we can enter:

title:"The Right Way" AND contents:go

or

title:"The Right Way" AND go

Since contents is the default field, the field indicator is not required.

The field is only valid for the term that it directly precedes, so the query

title:Right Way

will only find “Right” in the title field. It will attempt to find “way” in the default field (in this case the contents field).

All available field types are listed below,

  1. UnStored
  2. Keyword
  3. UnIndexed
  4. Text
  5. Binary

TopDocs is a collection of documents which are sorted down after searching using the query string. The most matched documents are listed top in the TopDocs.

For a searching operation, an IndexSearcher class is needed, which implements the main search methods. For each search, a new Query object is needed and this can be obtained from a QueryParser instance. Note that the QueryParser has to be created using the same type of Analyzer that the index was created with, in our case using a SimpleAnalyzer. A Version is also used as constructor argument and is a class that is “Used by certain classes to match version compatibility across releases of Lucene”, according to the JavaDocs.

When the search is performed by the IndexSearcher, a TopDocs object is returned as a result of the execution. This class just represents search hits and allows us to retrieve ScoreDoc objects. Using the ScoreDocs we find the Documents that match our search criteria and from those Documents we retrieve the wanted information. Let’s see all of these in action.

4. Simple Search Application Using Apache Lucene

Before we start our first search application we have to download the latest version of Lucene.

We have downloaded Lucene jar files which are of version 4.6.

Next, we have to build a project named ‘LuceneWink’ & add the jar file to the classpath of the project.

Before we begin running search queries, we need to build an index, against which the queries will be executed. This will be done with the help of a class named IndexWriter, which is the class that creates and maintains an index. The IndexWriter receives Documents as input, where documents are the unit of indexing and search. Each Document is actually a set of Fields and each field has a name and a textual value. To create an IndexWriter, an Analyzer is required. This class is abstract and the concrete implementation that we will use is SimpleAnalyzer.

We will try to find out the files which includes a string that we will provide as query in the below application.

So, we have to build an index of the files and search on them among which we will conduct our search operations .

Here is the Sample Program and comments are given inline:

package com.wf.lucene;

import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class LuceneOnFileSystemExample {
	static String DATA_FOLDER = "/home/piyas/Documents/Winkframe/sample_text_files/drugs/"; // Where the files are.
	static String INDEX_FOLDER = "/home/piyas/Documents/Winkframe/sample_text_files/drugindex/"; // Where the Index files are.
	private static StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
        private static IndexWriter writer;
	private static ArrayList<File> queue = new ArrayList<File>();

	public static void indexFilesAndShowResults(String dataFilePath,String indexFilePath,String searchTerm) throws Exception {

		// Indexing part
		indexOnThisPath(indexFilePath); // Function for setting the Index Path
		indexFileOrDirectory(dataFilePath); // Indexing the files
		closeIndex(); //Function for closing the files

		// Search Part
		searchInIndexAndShowResult(indexFilePath, searchTerm);

	}

	public static void searchInIndexAndShowResult(String indexFilePath,String searchString) throws Exception{
	    IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(indexFilePath))); // The api call to read the index
	    IndexSearcher searcher = new IndexSearcher(reader); // The Index Searcher Component
	    TopScoreDocCollector collector = TopScoreDocCollector.create(5, true);

	    Query q = new QueryParser(Version.LUCENE_46, "contents", analyzer).parse(searchString);
	    searcher.search(q, collector);
	    ScoreDoc[] hits = collector.topDocs().scoreDocs;

            // display results
            System.out.println("Found " + hits.length + " hits.");
            for(int i=0;i<hits.length;++i) {
              int docId = hits[i].doc;
              Document d = searcher.doc(docId);
              System.out.println((i + 1) + ". " + d.get("path") + " score=" + hits[i].score); // Found the document
            }
	}

	 public static void closeIndex() throws Exception {
		    writer.close(); // Close the Index
		  }

	public static void indexOnThisPath(String indexDir) throws Exception {
	    // the boolean true parameter means to create a new index everytime,
	    // potentially overwriting any existing files there.
	    FSDirectory dir = FSDirectory.open(new File(indexDir));
	    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, analyzer);
	    writer = new IndexWriter(dir, config);

	}

	/**
	   * Indexes a file or directory
	   * @param fileName the name of a text file or a folder we wish to add to the index
	   * @throws java.io.IOException when exception
	   */
	public static void indexFileOrDirectory(String filePath) throws Exception {
            // Adding the files in lucene index
	    //===================================================
	    //gets the list of files in a folder (if user has submitted
	    //the name of a folder)
	    //===================================================
	    addFiles(new File(filePath));

	    int originalNumDocs = writer.numDocs();
	    for (File f : queue) {
	      FileReader fr = null;
	      try {
	        Document doc = new Document();

	        //===================================================
	        // add contents of file
	        //===================================================
	        fr = new FileReader(f);
	        doc.add(new TextField("contents", fr));
	        doc.add(new StringField("path", f.getPath(), Field.Store.YES));
	        doc.add(new StringField("filename", f.getName(), Field.Store.YES));

	        writer.addDocument(doc);
	        System.out.println("Added: " + f);
	      } catch (Exception e) {
	        System.out.println("Could not add: " + f);
	      } finally {
	        fr.close();
	      }
	    }

	    int newNumDocs = writer.numDocs();
	    System.out.println("");
	    System.out.println("************************");
	    System.out.println((newNumDocs - originalNumDocs) + " documents added.");
	    System.out.println("************************");

	    queue.clear();
	  }

	  private static void addFiles(File file) {

	    if (!file.exists()) {
	      System.out.println(file + " does not exist.");
	    }
	    if (file.isDirectory()) {
	      for (File f : file.listFiles()) {
	        addFiles(f);
	      }
	    } else {
	      String filename = file.getName().toLowerCase();
	      //===================================================
	      // Only index text files
	      //===================================================
	      if (filename.endsWith(".htm") || filename.endsWith(".html") ||
	              filename.endsWith(".xml") || filename.endsWith(".txt")) {
	        queue.add(file);
	      } else {
	              System.out.println("Skipped " + filename);
	      }
	    }
	  }

	/**
	 * @param args
	 */
	public static void main(String[] args)  {
		// TODO Auto-generated method stub
		try{
			indexFilesAndShowResults(DATA_FOLDER,INDEX_FOLDER,"HIV"); // Indexing files and Searching the word from files.
		}
		catch(Exception e)
		{
			e.printStackTrace();
		}
	}

}

Our sample application is attached with this article.

We provide the index directory, the search query string and the maximum number of hits and then call the searchIndex method. In that method, we create an IndexSearcher, a QueryParser and a Query object. Note that QueryParser uses the name of the field that we used to create the Documents with IndexWriter (“contents”) and again that the same type of Analyzer is used (SimpleAnalyzer). We perform the search and for each Document that a match has been found, we extract the value of the field that holds the name of the file (“filename”) and we print it.

Here we have made a simple search application using Apache Lucene. In next articles, we will use more advanced queries and other advanced options of Lucene indexing and searching in details.

5. Download the Source Code

You may download the source code here and the data archive here.

Piyas De

Piyas is Sun Microsystems certified Enterprise Architect with 10+ years of professional IT experience in various areas such as Architecture Definition, Define Enterprise Application, Client-server/e-business solutions.Currently he is engaged in providing solutions for digital asset management in media companies.He is also founder and main author of "Technical Blogs(Blog about small technical Know hows)" Hyperlink - http://www.phloxblog.in
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button