Wordcounter, Counting Words in Java with Lambdas and Fork/Join
These days I released Wordcounter, a Java library and command-line utility for counting words in text files and performing analysis on the word counts that makes heavy use of functional programming constructs and parallel computing approaches. This is my fourth entry for the “Geeky Quickies” contest at SAP, after Feeder, Todor, and Hanoier.
The library uses JDK 8 lambdas, as well as new JDK 7 features such as Fork / Join and NIO.2. It is built and can only be used with the early access version of JDK 8 with lambda support.
With the introduction of lambdas and their supporting features in JDK 8, the way we build software in Java is going to change. If you would like to get an idea how your Java code might look like in a few years, you may take a look at Wordcounter. Unlike most resources available at the moment, this is not a simple tutorial, but a real working project.
The contest task requested to implement an algorithm using Fork / Join and lambdas that analyzes all files in a directory and finds the ten most used words in the files together with how many times they occur. Rather than simply sticking with Fork / Join, I tried to find the most suitable parallel approach for this particular task, which led me to choose Producer / Consumer for the core word counting logic.
You can explore the source code on github. There is also a rather comprehensive README which provides more detailed documentation.
The latest binary, javadoc, and sources packages can be found in the GitHub downloads section. If there is enough interest, I will publish the Javadoc online and make the library available also in central Maven repositories.
Feedback, comments, and contributions are welcome!
Overview
Library Features
- Count all words in a string, a single text file, or a directory tree containing text files.
- Analyze the word counts to find the top N most used words, the bottom N least used words, or the total word count.
- Specify whether a character is a word character via an external predicate.
- Specify an optional operation to be performed on words, for example converting to lower case, via an external operator.
- Choose between non-parallel and parallel implementations to compare their performance.
- Specify the parallelism level to be a value different from the number of cores, if you need to.
Programming Highlights
- Uses Producer / Consumer for reading files and counting the words in each file in parallel. The actual mechanism is encapsulated in a generic, reusable implementation.
- Uses Fork / Join for performing analysis on the word counts. Here again the actual mechanism is encapsulated in a generic, reusable implementation.
- Uses NIO.2 for traversing directory trees and reading files.
- Makes heavy use of functional interfaces and lambda expressions in order to pass functions rather than data where appropriate.
- There are comprehensive unit and performance tests for the two most important classes.
- As usual, the code is clean, well-structured, and easy to read. Formatting, naming, and comments are uniform and consistent. A lot of attention has been put to the appropriate use of both object-oriented and functional programming techniques.
Command Line Interface
To start the command line program, execute the following command:
java -jar wordcounter-1.0.4.jar <options>
All options have reasonable default values so none of them is mandatory. Using the defaults for all options results in finding the top 10 most used words in the current directory and its subdirectories. Specifying non-default values allows specifying different directories, analysis modes, word characters, number of words, and parallelism level, as well as ignoring the case or using a serial rather than parallel computation, for example:
Find the top 10 most used words in the directory “words”: -p words
Find the bottom 5 least used words in the directory “wordsx”, considering numbers as word characters, ignoring case, with info logging: -p wordsx -m bottom -d 1234567890 -i -n 5 -l info
For more information about the command line interface options, see Command Line Interface in the README.
Design
The design of the library partitions the problem into generic parallel processing utilities, classes that encapsulate the data structures used to represent raw and sorted word counts, and finally classes that perform the counting and analysis using the capabilities of the previous two groups. Practically all of these classes use instances of functional interfaces quite a lot in order to allow specific customizations to their generic behavior. This results in code which is heavily infused with lambda expressions and method references. Welcome to the world of functional programming in Java!
Generic Parallel Processing Utilities
The ForkJoinComputer Class
The ForkJoinComputer<T>
class is a generic Fork / Join computer. It divides the initial size by 2 until either reaching the specified parallelism level or falling below the specified threshold, computes each portion serially using the specified Computer<T>
, and then joins the results of all computations using the specified Merger<T>
. Here, Computer and Merger are functional interfaces that are defined as follows:
public interface Computer<T> { T compute(int lo, int hi); } public interface Merger<T> { T merge(T result1, T result2); }
This class can be used by simply instantiating it with the appropriate lambdas and then calling its compute
method.
new ForkJoinComputer<Integer>(n, 1000, (lo, hi) -> { int sum = 0; for (int i = lo + 1; i <= hi; i++) sum += i; return sum; }, (a, b) -> a + b).compute();
The ProducerConsumerExecutor Class
The ProducerConsumerExecutor<T1, T2>
class is a generic Producer / Consumer executor. It starts a single Producer<T1>
task and multiple Mediator<T1, T2>
and Consumer<T2>
tasks with their number equal to the specified parallelism level. The producer puts T1
instances in a BlockingQueue<T1>
. The mediators take these instances from there, convert them to T2
, and put them in another blocking queue of type BlockingQueue<T2>
. Finally, the consumers take the T2
instances from the second blocking queue and process them.
Here, Producer, Consumer
, and Mediator
are functional interfaces that are defined as follows:
public interface Producer<T> { void produce(Block<T> block); } public interface Consumer<T> { void consume(T t); } public interface Mediator<T1, T2> { void mediate(T1 t, Block<T2> block); }
In the above code, Block
is a standard function defined in java.util.functions
. The block passed to the Producer
and Mediator
methods puts the produced data in the corresponding blocking queue.
Similarly to ForkJoinComputer
, this class can be used by simply instantiating it with the appropriate lambdas and then calling its execute
method.
Data Structure Classes
These classes encapsulate the data structures used to represent raw and sorted word counts.
- The WordCounts class represents a list of words mapped to their usage counts.
- The
TopWordCounts
class represents a sorted list of word usage counts mapped to all words that have such counts.
Word Counting and Analysis Classes
The WordCounter Class
The WordCounter
class provides a method for counting words in a Path
representing a file or a directory tree, either serially or in parallel. It can be used by simply instantiating it with the appropriate lambdas and then calling its count
method:
// Count all words consisting of only alphabetic chars, ignoring case, using parallel processing WordCounts wc = new WordCounter(path, (c) -> Character.isAlphabetic(c), (s) -> s.toLowerCase(), true).count();
The parallel implementation uses ProducerConsumerExecutor<Path, String>
. The producer simply walks the directory tree and produces Path
instances. The mediators read the files into text pieces, and the consumers count the words in each text piece and collect them in a single WordCounts
instance. This is done with the following piece of code:
private WordCounts countPar() { final WordCounts wc = new WordCounts(parLevel); new ProducerConsumerExecutor<Path, String>( (block) -> collectPaths(block), (file, block) -> readFileToBlock(file, block), (text) -> wc.add(countWords(text, pred, op)), parLevel).execute(); return wc; }
The WordCountAnalyzer Class
The WordCountAnalyzer
class provides methods for performing analysis on the word counts produced by WordCounter
, such as finding the top N most used words. It also can be used by simply instantiating it and then calling one of its methods such as findTop
or total
:
// Find the top 10 most used words in wc TopWordCounts twc = new WordCountAnalyzer(wc, true).findTop(10, (x, y) -> (y - x));
The differnet analysis types implement the internal Analysis<T>
interface, which is defined as follows:
interface Analysis<T> { T compute(int lo, int hi); T merge(T r1, T r2); }
Since the signatures of the above two methods mimic the Computer
and Merger
functional interfaces used by ForkJoinComputer
, we can use fork / join for all analysis types in the following way:
public TopWordCounts findTop(int number, Comparator<Integer> comparator) { return analyse(new FindTopAnalysis(number, comparator)); } private <T> T analyse(Analysis<T> a) { if (par) { return new ForkJoinComputer<T>(wc.getSize(), THRESHOLD, a::compute, a::merge, parLevel).compute(); } else { return a.compute(0, wc.getSize()); } }
For more information about the library design, see Design in the README.
Performance
I found out that the parallel Producer / Consumer word counting implementation is adapting nicely to the different number of cores and I/O speeds. It is significantly faster than the serial implementation. Unlike it, the parallel Fork / Join analysis implementation is only faster than the serial one when testing with an unrealistically large number of unique words, and only by a mild degree. With small number of unique words, it is actually slower than the serial one.
The tables below compare the performance of word counting and find top analysis under the following conditions:
- CPU AMD Phenom II X4 965 3.4 GHz (4 cores), 4 GB RAM, Windows 7, JDK 8
- Default options: words consisting of alphabetic characters, case-sensitive
- Default parallelism level, equal to the number of cores
Word Counting Performance
Implementation | Files | Words | Size (MB) | Time (ms) |
---|---|---|---|---|
Serial | 1 | 10000000 | ~65 | 2200-2400 |
Parallel | 1 | 10000000 | ~65 | 500-600 |
Serial | 100 | 10000000 | ~65 | 1600-1800 |
Parallel | 100 | 10000000 | ~65 | 500-600 |
Find Top Analysis Performance
Implementation | Words | Max Count | Top | Time (ms) |
---|---|---|---|---|
Serial | 2000000 | 10000000 | 10 | 200-250 |
Parallel | 2000000 | 10000000 | 10 | 200-250 |
Playing with the Code
If you would like to play with the code, I would recommend using the latest NetBeans 7.3 beta, which as of time of writing is NetBeans IDE 7.3 Beta 2. Note that even in this version it is not possible to compile the lambdas in the IDE, so you there are red marks all around. However, starting the Maven build and running tests from the IDE still works fine. According to this blog post, it should be possible to use IntelliJ IDEA 12 EAP build 122.202 or later with lambdas, however I didn’t try this out myself. I did try Eclipse and found that it’s a lost game since Eclipse uses its own JDT compiler which is lambda-ignorant.
Conclusion
This was my first significant encounter with functional programming in Java. Although Java is still not Scala, the new functional programming constructs changed dramatically the way I designed and implemented Wordcounter, compared to my previous Java code. I find this new programming style to be powerfull and expressive, and I believe with the release of Java 8 it will quickly become dominant.
This was also the final “Geeky Quickies” task for me. The jury awarded my submissions generously enough, and I found myself the winner even before the contest is over.
If this post picked your interest, feel free to download and explore Wordcounter, use it to learn the new Java functional programming constructs, and let me know if I could help you in this journey.
Reference: Wordcounter, Counting Words in Java with Lambdas and Fork/Join from our JCG partner Stoyan Rachev at the Blog of Stoyan Rachev blog.
Lamba is bad for Java. Why do Lamba-lovers always cram multiple lines of code into the same line whenever they post examples? This isn’t the first time I’ve seen this.
You can also use the new java 8 classes in java.util.stream that make all the parallel processing for you under the hood, powerfull and way more simple. See the latest two Java Magzine where there is a nice presentation of Streams.
http://www.oraclejavamagazine-digital.com/javamagazine_open/20140304#pg51
http://www.oraclejavamagazine-digital.com/javamagazine_open/20140506#pg50
There is an example where they show how to count words. :)
You’re article is a nice introduction to parallel data processing anyway.