Keyword extraction and similarity calculation among textual content
Background
Web applications are becoming smarter. Gone are the days when to avail a service from a website, user had to fill up a giant form. Let’s say, you have a website which is for book lovers. Before web 2.0, sites like these used to ask user all kind of questions in a form like age, books they read, types of book they like, language preference, author preference etc. Now days, it is a common practice to ask user to write a paragraph on themselves (profile). In this note, user express some details, but the challenge is, how we extract useful information from such free form text and more over how we find user who have similar interest?
This use case has become so common that every java developer should know some tricks about information retrieval from text. In this article, I shall walk you through one simple yet effective way to do it.
Processes to extract information from text
- Filter Words: Read textual content word by word and remove unwanted words. As part of this filtering state, remove all commonly used English words. One can also apply censor rules and remove sexually explicit words or hate speech etc.
- Perform Stemming: Words like ‘search’ or ‘searched’ or ‘searching’ which all mean ‘search’. This process of reducing word to its root is called stemming.
- Calculate Similarity: After first two steps, we now have a set of keywords that truly represent original text (user profile in this example). We can treat these keywords as set of unique words. To calculate similarity between two user profiles, it would be better if we represent similarity in terms of a number which represent how similar two contents are in a 0 (not similar) to 1 (completely similar) scale. One way to achieve that is to calculate Jaccard Index which is used to calculate similarity or diversity of sets.
Jaccard index J(A,B) = |A∩B|/| A⋃B|
where A and B are sets and J(A,B) lies between 0 to 1.
Implementation Details
Based on the points outlined above, one can develop a library to extract keywords and calculate similarity. However, Apache Lucene is a java library that has plenty of API to perform keyword extraction. Here is a brief description of different important areas of this API.
Tokenizer
Tokenizer splits your text into chunks. There are different tokenizers and depending upon the tokenizer you use, you can get different output token streams (sequences of chunks of text).
Stemmers
Stemmers are used to get the base of a word in question. It heavily depends on the language used. Words like ‘seaerch’, ’searched’, ’searching’ etc comes from the root word ‘search’. In information retrieval field, it is very useful if we get to the root words as that reduce noise and with fewer words we can still carry the intent of the document. One of the famous stemmer algorithm is Porter Stemmer algo.
TokenFilter
Tokenfilter can be applied on the tokenizer output to normalize or filter tokens. Like LowerCaseFilter which normalizes token text to lower case or stopfilter that suppress most frequent and almost useless words. Again, it heavily depends on language. For English these stop words are “a”, “the”, “I”, “be”, “have”, etc.
Analyzer
An analyzer is the higher level class that uses tokenizers to produce tokens from input, uses stemmers to reduce the token, uses filters to suppress/normalize the tokens. This is the class that glue the other three main components. Different Analyzers use different combinations of tokenizers and filters. For example, StandardAnalyzer uses StandardTokenizer to extract tokens from string, pass that through LowerCaseFilter to convert tokens into lower case and then pass the stream of tokens through StopFilter to remove most commonly used English words. It does not perform stemming by default. One can develop a custom analyzer by mixing and matching tokenizer and tokenfilters according the need.
Code walk through
Source code of this example can be accessed from https://github.com/shamikm/similarity . Below is a highlight of the steps:
- Create a custom analyzer that perform the following steps:
- Tokenize English words based on space, comma, period etc. Use StandardTokenizer for this task.
- Convert the tokens into lowercase using LowerCaseFilter
- Stop common English words using StopFilter
- Stem English words using Porter Stemmer
From StemmAnalyzer class:
@Override public TokenStream tokenStream(String fieldName, Reader reader) { (a).. final StandardTokenizer src = new StandardTokenizer(matchVersion, reader); TokenStream tok = new StandardFilter(matchVersion, src); (b).. tok = new LowerCaseFilter(matchVersion, tok); (c).. tok = new StopFilter(matchVersion, tok, getStopWords()); (d).. return new PorterStemFilter(tok); }
- Once we have set of words, it’s easy to calculate similarity between two sets.
From JaccardIndexBasedSimilarity class:
public double calculateSimilarity(String oneContent, String otherContet) { Set<String> keyWords1 = keywordGenerator.generateKeyWords(oneContent); Set<String> keyWords2 = keywordGenerator.generateKeyWords(otherContet); Set<String> denominator = Sets.union(keyWords1,keyWords2); Set<String> numerator = Sets.intersection(keyWords1,keyWords2); return denominator.size()>0? (double)numerator.size()/(double)denominator.size() : 0; }
Here is a sample test case to demonstrate how the code works:
@Test public void calculateSim(){ SimilarityCalculator calculator = new JaccardIndexBasedSimilarity(); Assert.assertEquals(calculator.calculateSimilarity("They Licked the platter clean","Jack Sprat could eat no fat"),0.0); //1(lamb) out of 6(littl,lamb,mari,had,go,sure) words are same Assert.assertEquals(calculator.calculateSimilarity("Mary had a little lamb", "The lamb was sure to go."), 0.16, 0.02); Assert.assertEquals(calculator.calculateSimilarity("Mary had a little lamb","Mary had a little lamb"),1.0); }
You can run this process offline and find out how one user profile is similar to any other users in your database and can start recommending users based on what similar users are reading.
Conclusion
Information retrieval from text is a common use case now-a-days. Having a basic knowledge on this critical field is helpful for any developer and in this article, we looked at how Apache Lucene API can be used effectively to extract keyword and calculate similarity among text.
Resources:
- http://en.wikipedia.org/wiki/Jaccard_index
- http://tartarus.org/martin/PorterStemmer/
- http://www.manning.com/ingersoll/
- http://www.amazon.com/Algorithms-Intelligent-Web-Haralambos-Marmanis/dp/1933988665