Enterprise Java

Searching made easy with Apache Lucene 4.3

Lucene is a Full Text Search Engine written in Java which can lend powerful search capabilities to any application. At heart of Lucene lies a file based Full Text Index. Lucene provides APIs to create this index and then add and delete contents to this index. Further it allows search and retrieval of information from this index using powerful search algorithms. The data stored can be pulled from disparate sources like a database, filesystem and as well as the websites. Before beginning let us ponder on few terms.

Inverted Index

Inverted index is a datastructure which stores a mapping of a content and the location of object that contains that content. To make it more clear here are some examples

  1. Book Index – The Index of book contains the important words and the pages that contain those words. So book index helps us in navigating to the pages that contain a particular word.
  2. Listing of wines using price ranges – The price range is content and winename is the object that has that price range
  3. Web Index – Listing of website address by keywords. For example list of all webpages containing keywords “Apache Lucene”
  4. Shopping Cart – Listing of items in shopping cart by categories. 

Faceted Search

Any object can have multiple properties, each of these properties are facet of that object. Faceted search allows us to search for collection of objects based on multiple facets. Faceted search is also known as faceted navigation or faceted browsing and it allows us to search on information that is organized according to faceted organization structure.

Consider an example of an item in shopping cart. Item can have multiple facets like category, title, price, color, weight etc. Now a facet search would allow us to search for all the items which are in garden category, has red color and is between price range of Rs.30 to Rs.40.

Lucene provides us an API

  1. To create an inverted index.
  2. Store information according to faceted classification.
  3. Retrieve information using faceted search.

All the above makes Lucene a super-fast search engine which returns super relevant search results.

Lucene Features

  1. Relevance Ranking search
  2. Phrase, proximity, wildcard search.
  3. Plug-gable analyzer.
  4. Faceted Search.
  5. Field based sorting
  6. Range queries
  7. Mutliple index searching.
  8. Fast indexing 150GB/hour.
  9. Easy Backup and restore.
  10. Small RAM requirement.
  11. Incremental addition and fast searches.

For full list visit here: http://lucene.apache.org/core/features.html

Lucene Concepts and Terminologies

  1. Indexing – Indexing involves adding a document to the Lucene index by help of a class called “IndexWriter“.
  2. Searching – Searching involves retrieval of a document from Lucene index by help of a class called “IndexSearcher
  3. Document – A Lucene Document is a single unit of search and index. For example item in a shopping cart. Lucene index can contain millions of documents.
  4. Fields – Fields are properties of any document. In other words fields are the facets of the document which is an object. For example category of an item in shopping cart. Each document can have multiple fields.
  5. Queries – Lucene has its own query language. This allows us to search for document based on mulitple fields. We can assign weight to a field and also use boolean expressions like and and or to the query. For example – Return all items in cart which belong to category garden or home and has color red and has price less than Rs.1000.
  6. Analyzers – When a field text is to be indexed then they need to be converted into its most basic form. First they are tokenized and then they are converted to lowercase, sigularized, depunctuated. These tasks are performed by Analyzers. Analyzers are complicted and we require a deep study on how to use them. Most often the built in analyzers don’t suffice for our requirement, in that case we can create a new one. For this tutorial we will be using StandardAnalyzer as they contain most of the basic features we require.

Tutorial objective

  1. Try creating a Lucene index.
  2. Insert book records in it.
  3. Performing various kinds of searches on this index.

The book item will have following Facets

  1.  Book Title(String
  2. Book Author(String)
  3. Book Catgory(String)
  4. #Pages(int)
  5. Price(float)

The code for this tutorial has been committed to SVN. It can be checked out from: https://www.assembla.com/code/weblog4j/subversion/nodes/24/SpringDemos/trunk

This is an extended project with more tutorials. The lucene classes are in com.aranin.spring.lucene package

  1. LuceneUtil – This class contains utitlity method to create index, create IndexWriter and IndexSearcher.
  2. MySearcherManager – This class uses LuceneUtil and performs searches on the index.
  3. MyWriterManager – This class uses LuceneUtil and performs writes on the index.

Step by step walk-through

1. Dependencies – The dependencies can be added via maven

<dependency>
        <artifactId>lucene-core</artifactId>
        <groupId>org.apache.lucene</groupId>
        <type>jar</type>
        <version>${lucene-version}</version>
      </dependency>

      <dependency>
        <artifactId>lucene-queries</artifactId>
        <groupId>org.apache.lucene</groupId>
        <type>jar</type>
        <version>${lucene-version}</version>
      </dependency>

      <dependency>
        <artifactId>lucene-queryparser</artifactId>
        <groupId>org.apache.lucene</groupId>
        <type>jar</type>
        <version>${lucene-version}</version>
      </dependency>

      <dependency>
        <artifactId>lucene-analyzers-common</artifactId>
        <groupId>org.apache.lucene</groupId>
        <type>jar</type>
        <version>${lucene-version}</version>
      </dependency>

      <dependency>
        <artifactId>lucene-facet</artifactId>
        <groupId>org.apache.lucene</groupId>
        <type>jar</type>
        <version>${lucene-version}</version>
      </dependency>

2. Creating the index – The index can be created by creating an IndexWriter in create mode.

public void createIndex() throws Exception {

    boolean create = true;
    File indexDirFile = new File(this.indexDir);
    if (indexDirFile.exists() && indexDirFile.isDirectory()) {
       create = false;
    }

    Directory dir = FSDirectory.open(indexDirFile);
    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
    IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43, analyzer);

    if (create) {
       // Create a new index in the directory, removing any
       // previously indexed documents:
       iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    }

    IndexWriter writer = new IndexWriter(dir, iwc);
    writer.commit();
    writer.close(true);
 }
  • indexDir is the directory where you want to create your index.
  • Directory is a flat list of files used for storing index. It can be a RAMDirectory, FSDirectory or a DB based directory.
  • FSDirectory implements Directory and saves indexes in files in file system.
  • IndexWriterConfig.Open mode creates a writer in create or create_append or appned mode. Create mode creates a new index if it does not exist or overwrites an existing one. For purpose of creation we create an existing one.
  • Calling above method creates an empty index.

3. Writing to the index – Once the index is created we can write documents to it. That can be done via following.

public void createIndexWriter() throws Exception {

     boolean create = true;
     File indexDirFile = new File(this.indexDir);

     Directory dir = FSDirectory.open(indexDirFile);
     Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
<span style="color: #222222; font-family: 'Courier 10 Pitch', Courier, monospace; line-height: 21px;">IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43, analyzer);</span>
     iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
     this.writer = new IndexWriter(dir, iwc);

    }

Above method creates a writer in create_append mode. In this mode if index is created then it will not be overwritten. You can note that this method does not close the writer. It just creates and returns it. Creating IndexWriter is an costly operation. Thus we should not create a writer everytime we have to write a document to the index. Instead we should create a pool of IndexWriter and use a thread system to get the writer from the pool write to the index and then return the writer to the pool.

public void addBookToIndex(BookVO bookVO) throws Exception {
     Document document = new Document();
     document.add(new StringField("title", bookVO.getBook_name(), Field.Store.YES));
     document.add(new StringField("author", bookVO.getBook_author(), Field.Store.YES));
     document.add(new StringField("category", bookVO.getCategory(), Field.Store.YES));
     document.add(new IntField("numpage", bookVO.getNumpages(), Field.Store.YES));
     document.add(new FloatField("price", bookVO.getPrice(), Field.Store.YES));
     IndexWriter writer =  this.luceneUtil.getIndexWriter();
     writer.addDocument(document);
     writer.commit();
 }

We dont create a writer in the code while inserting. Instead we have used a precreated writer which was stored as a instance variable.

4. Searching the index – This is again a done in two steps 1. Creating IndexSearcher 2. Creating a query and doing the search.

public void createIndexSearcher(){
    IndexReader indexReader = null;
    IndexSearcher indexSearcher = null;
    try{
         File indexDirFile = new File(this.indexDir);
         Directory dir = FSDirectory.open(indexDirFile);
         indexReader  = DirectoryReader.open(dir);
         indexSearcher = new IndexSearcher(indexReader);
    }catch(IOException ioe){
        ioe.printStackTrace();
    }

    this.indexSearcher = indexSearcher;
 }

Note – The Analyzer used in searcher should be same as the one used to create the writer as analyzer is responsible for the way in which data is stored in index. Again creating IndexSearcher is a costly operation hence it makes sense to pre create a pool of IndexSearcher and use it in similar way as IndexWriter.

public List<BookVO> getBooksByField(String value, String field, IndexSearcher indexSearcher){
     List<BookVO> bookList = new ArrayList<BookVO>();
     Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
     QueryParser parser = new QueryParser(Version.LUCENE_43, field, analyzer);

     try {
         BooleanQuery query = new BooleanQuery();
         query.add(new TermQuery(new Term(field, value)), BooleanClause.Occur.MUST);

        //Query query = parser.Query(value);
        int numResults = 100;
        ScoreDoc[] hits =   indexSearcher.search(query,numResults).scoreDocs;
        for (int i = 0; i < hits.length; i++) {
             Document doc = indexSearcher.doc(hits[i].doc);
             bookList.add(getBookVO(doc));
        }

     } catch (IOException e) {
         e.printStackTrace(); 
     }

     return bookList;
}

The IndexSearcher was pre-created and passed on to the the method. The main part of searching is query formation. Lucene supports lots of different kinds of queires.

  1. TermQuery
  2. BooleanQuery
  3. WildcardQuery
  4. PhraseQuery
  5. PrefixQuery
  6. MultiPhraseQuery
  7. FuzzyQuery
  8. RegexpQuery
  9. TermRangeQuery
  10. NumericRangeQuery
  11. ConstantScoreQuery
  12. DisjunctionMaxQuery
  13. MatchAllDocsQuery

You can choose the appropriate queries for your searches. The query language syntax can be learnt from here: http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.pdf

Resources

  1. http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.pdf
  2. http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/all/org/apache/lucene/index/IndexWriterConfig.OpenMode.html
  3. http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/store/FSDirectory.html
  4. https://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html
  5. http://www.lucenetutorial.com/lucene-query-syntax.html
  6. http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/Query.html

Summary

Search remains a backbone of any content driven application. The traditional DB driven searches are not very powerful and leaves a lot to be desired. So there is a need of a fast, accurate and powerful search solution which can be easily incorporated in the application code. Lucene beautifully fills in that gap, it makes the search a breeze and is backed by a powerful array of search algorithms like relevance ranking, phrase, wildcard, proximity and ranged search. It is also space and memory efficient. No wonder so many applications have been built on top of Lucene. This article intends to provide a basic tutorial on empowering dear readers with tools for getting started with Lucene.  There is lot more to be said but then don’t you want to explore some on your own?
 

Reference: Searching made easy with Apache Lucene 4.3 from our JCG partner Niraj Singh at the Weblog4j blog.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

15 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Majid Lotfi
Majid Lotfi
11 years ago

Hi,
Thank you for this tutorial, I noticed that the source code you provided in the SVN is missing lot, it does not have the lucene package, the pom is incomplete, can you please add a readme file on how to setup or run this project ?
thanks lot.

Niraj Singh
11 years ago

Hi Majid,

The SVN for this project has not been configured very well. But you can checkout the code from url below

https://www.assembla.com/code/weblog4j/subversion/nodes/29/SpringDemos/trunk

Please note that the revision version is 29 rather than 24.

It is a stand alone java project written in Intellij so migrating to IDE of your preference would just require copying the java files in lucene package and pom dependencies over to your POM.

Please let me know if you have issues.

Regards
Niraj

Teresa
Teresa
11 years ago

Hi,

I find it is very helpful, do you have the schema for the database? so I can create the database to test the code?

Thanks in advance,
Teresa

Niraj Singh
11 years ago
Reply to  Teresa

Hi Teresa,

I am glad that you found it useful. For this project we are not dealing with database. We are creating a Lucene index and inserting/retrieving data from it. If you want to know the dimensions of book data we are dealing with have a look at addBookToIndex method.

You can download the whole project from https://www.assembla.com/code/weblog4j/subversion/nodes/29/SpringDemos/trunk

Please let me know if you have further queries.

Regards
Niraj

GG
GG
10 years ago

I tried downloading and running it on my Netbeans (Ubuntu 12.10 OS) but it was unable to run. Some error in cropping up while accessing “D:/samayik” Also, please tell which is the main class…

Niraj Singh
10 years ago

Hi GG, The path “D:/samayik/mydemoindex ” is a hardcoded path in main methods of classes com.aranin.spring.lucene.MyWriterManager and com.aranin.spring.lucene.MySearcherManager. Just to explain the project a bit. There are two main functions of lucene writing and reading. So as a first step we create the index in com.aranin.spring.lucene.MyWriterManager main method. Then we go on to write something to the index. Next step we read from the index using com.aranin.spring.lucene.MySearcherManager. Check out the main method of this class to see how we are reading. So to answer your question. 1. There are two main classes MySearcherManager and MyWriterManager. 2. Modify the index path… Read more »

Veer
Veer
10 years ago

By default, facets are weighted as the number of documents present in them. Can we give our own weighing parameter (Eg. Relevance, Score)??

Niraj Singh
10 years ago

Hi Veer, As we know basic info we store in an index is an Document. Each Document contain fields which are really the facets. Each field has a score. When we search then all the score for each of the field is taken into consideration. The sum is used to make a decision as to which search has higher scoring. Now there are ways to set scoring and after some search on net here are few ways 1. Boost scoring There are ways to boost scoring at indexing time and searching time as well. This can be done using Field.setBoost()… Read more »

Veer
Veer
10 years ago
Reply to  Niraj Singh

I know we can modify the scoring system for returning seacrh results.

But my question is while doing Faceting (in LUCENE 4.4), if we use CountFacetRequest, it would return us the number of documents present in that category (facet), but I want to change this count. I mean if it is possible to return scores (I tried using SumScoreFacetRequest). Its not working. Its still returning count.

How can we change this parameter?

Pete Lyon
Pete Lyon
10 years ago

Hi! Thank you for this information!
I had to convert from Lucene 3 and it works perfectly now in 4!

Regards Pete

Niraj Singh
10 years ago
Reply to  Pete Lyon

Hi Pete,

Thanks for the comment. I am glad that you found the post useful.

Regards
Niraj

naresh
naresh
10 years ago

i need to add a web site for indexing. which lucene class i should plz help me..

Niraj Singh
10 years ago
Reply to  naresh

Hi Naresh,

You need to write a webcrawler to crawl the website and then create the index of content/website/link to do web indexing.

This can be quite a daunting task. There are lots of APIs that have been developed on top of lucene for this purpose. For example Nutch. You can use those instead of directly using Lucene.

Hope this helps.

Regards
Niraj

Prasad Bhat
Prasad Bhat
9 years ago

Thanks for the tutorial.
How does Lucene internally works ? What are the alternatives to Apache Lucene?

Sachin
Sachin
8 years ago

Hi,

Nice tutorial, very useful.

I need to process regex queries with Leucene.

Please implement it in your example.

Back to top button