Indonesian Language in Lucene, Solr and Elasticsearch
Indonesian, or Bahasa Indonesia, is a very approachable language for westerners. It uses latin characters, there’s a clear structure, no tenses, no gender or plural forms and it contains many foreign words (as a German I especially enjoy the dutch influenced terms like knalpot for exhaust pipe). If you’re growing up outside of Asia Indonesia might be a quite distant country for you which you don’t hear a lot about. But because the country is so big there are actually quite a lot of people speaking the language, making it, together with its sibling Bahasa Melayu, one of the most common languages on earth. And if that is not enough, once you visit Indonesia you will see that the people are very positive minded and happy. Maybe another reason to be interested in the language.
As I’ve been learning a bit of Indonesian and got to spend quite some time in Indonesia for work and leisure I thought it might be a good idea to look into the Indonesian Analyzer for Lucene and see how it processes text. If you don’t know what an Analyzer does I can point you to one of my older posts on the absolute basics of indexing data.
The IndonesianAnalyzer in Lucene
If you want to use the IndonesianAnalyzer, it is available with lucene-analyzers-common, which you most likely have included already. You can just create an instance and use it in any way you like. This snippet will display the terms for the text in a String.
private List<String> analyze(String text) throws IOException { List<String> terms = new ArrayList<>(); try(Analyzer analyzer = new IndonesianAnalyzer(); TokenStream tokenStream = analyzer.tokenStream(null, text)) { tokenStream.reset(); while (tokenStream.incrementToken()) { terms.add(tokenStream.getAttribute(CharTermAttribute.class).toString()); } } return terms; }
The IndonesianAnalyzer in elasticsearch
The IndonesianAnalyzer can be used with elasticsearch as well. In the mapping you can refer to it by the analyzer name indonesian
.
{ "mappings": { "doc": { "properties": { "content": { "type": "text", "analyzer": "indonesian" } } } } }
The elasticsearch documentation also has a section on the analyzer explaining how to rebuild it using different filters.
The IndonesianAnalyzer in Solr
Most of the time you would create your own analyzer chain in Solr. This is from the reference guide.
<analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.IndonesianStemFilterFactory" stemDerivational="true" /> </analyzer>
Features of the Analyzer
Let’s look at a very simple example sentence first.
Saya mau makan mie ayam.
I want to eat Chicken noodles. Not only did you learn that I like indonesian food but you can also see that the indonesian language uses latin characters and separates words by whitespace. Let’s see what the IndonesianAnalyzer does with this text.
If you look at the terms produced by the Lucene example above you will get the following list.
[makan, mie, ayam]
So only three of the five words are left. Saya (I) and mau (want to) are dropped. This is caused by a default list of stopwords, words that are considered not to be important when searching. Those words are maintained in a text file that is shipped with the analyzer. If you want to use a different list for you content you can use one of the constructors that accepts a CharArraySet
, for elasticsearch and Solr you can use a custom StopFilter.
Now, the rest of the words remained the same, there’s no stemming involved yet, which is a common way to process natural language by reducing terms to its base form. Let’s look at another example.
Kami, bangsa Indonesia, dengan ini menjatakan kemerdekaan Indonesia.
This is the first sentence of the declaration of independence of Indonesia which was proclaimed in 1945. We, the people of Indonesia, hereby declare the independence of Indonesia.
If you process this text using the Analyzer you will get the following list of terms.
[bangsa, indonesia, jata, merdeka, indonesia]
Again, words like kami, dengan, ini have been removed as those are in the list of stopwords. But something else has happened. menjatakan became jata and kemerdekaan became merdeka. The Indonesian language doesn’t have verb inflection but there are many prefixes and suffixes that can change the meaning of words. In this case kemerdekaan (independence) is a variation of merdeka (independent). There are many prefixes and suffixes. makan is to eat, makanan is food. minum is to drink, minuman is a drink. sama is same, bersama is together. The IndonesianAnalyzer will stem those examples correctly (even though sama and bersama are stopwords).
Implementation
Like most analyzers the IndonesianAnalyzer combines just a few other components, namely a Tokenizer and serveral TokenFilters.
- StandardTokenizer
- StandardFilter
- LowercaseFilter
- StopFilter
- SetKeywordMarkerFilter
- IndonesianStemFilter
The IndonesianStemFilter is the interesting component that is responsible for the stemming. It uses the IndonesianStemmer that is based on the paper A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia.
As with most other rule based stemmers some words might not be stemmed correctly. An example: menunggu means waiting, it is stemmed to unggu, but the correct base form would be tunggu. If you want to get rid of cases like this you can either add the words to the stemExclusionSet
that can be passed in to the analyzer to protect them from stemming. Or you can build your own analyzer that uses the StemmerOverrideFilter – maybe that’s material for another blogpost.
Scoring
Bahasa Indonesia poses an interesting challenge when it comes to scoring search results. Scoring algorithms like TF/IDF and BM25 rely on the frequency of terms. But in Indonesian a plural is often formed by just repeating a word. mobil means car – mobil mobil means cars. But if a text talks about a single car or multiple cars shouldn’t make a difference when it comes to scoring. Depending on the text you are searching it might be necessary to ignore the frequencies – or write a custom filter that skips words that are repeated immediately.
Conclusion
Stemming doesn’t have a place in every search application. But it’s one of the techniques that can help making natural language more accessible without being too complex. It can make your search seem like magic.
Working with natural languages is one thing I enjoy a lot when working with search engines. And if like in this case I am learning something about the language in the process that is even better.
Published on Java Code Geeks with permission by Florian Hopf, partner at our JCG program. See the original article here: Indonesian Language in Lucene, Solr and Elasticsearch Opinions expressed by Java Code Geeks contributors are their own. |