Selecting your Collections library
Is this really something you should bother? Is there something fundamentally wrong with java.util.ArrayList and java.util.HashMap? For most of the source code out there the answer is – no; those implementations are perfectly OK. But as always, the devil is in the details. And there exist situations when either the feature set built into the Collections API is not sufficient enough or you find the overhead provided by the standard collections to be way too high for your likings.
In the past years we have continuously stumbled upon the very same problems. And thought it would be nice to share the experience – hopefully it can save somebody a day or two. Either via avoiding yet another bidirectional Map implementation or understanding why his HashSet consumes 10x more memory than intended.
We have divided the review into two different groups of Collection libraries. First of them provides additional features to the standard Collections API. In this group we have players such as Guava and Apache Commons Collections. Another set of Collection libraries works with some aspect of performance. In this group we see libraries like Trove, fastutil and Huge Collections. We start our overview with feature-adding libraries and then move into performance-oriented landscape.
Collections API
One of the really basic APIs each and every Java developer should be familiar with. But at least once a month we run into code written by someone who is either truly creative or just has not bothered to familiarize himself with the Collections API.
So in case any of your coworkers is also into using Vectors when ArrayList is more suitable or does not understand the difference between TreeSet and TreeMap, then Janeve George has put together a nice overview on the whole API with explanation when to use what type of Collections.
It is also nice to refer to the origin of the Collections API as we know it today. Most of the abstractions, structure and functionality available within Collections API was originally designed by Doug Lea in his collections package.
Guava
Formerly known as Google Collections, Guava is a set of libraries used within several Google products. Significant part of this library is dedicated to Collections. You should look into Guava when you need for example:
- 100% thread-safe collections – check out the ImmutableCollections
- Easily count number of occurrences of a specific element in a set – MultiSet and SortedMultiSet are there to save your day.
- Looking to implement an unlabeled directed graph? Multimap is your best friend.
- Need to implement a map in which both keys and values are unique and both should be usable as keys to search values? Guava’s bidirectional map will help you out.
… and lot more interesting and useful implementations which you can check out from the project’s home page.
Apache Commons Collections
Used by several Apache projects, Commons Collections is a nice add-on if you need additional features, such as:
- FIFO/LIFO implementation – look into Queues and Buffers.
- A collection to keep multiple copies of the same element? Put them into a Bag implementation.
- A Map where elements are in particular order but not sorted according to the key’s compareTo() – the solution is available in form of OrderedMap.
In order to keep this article under 30,000 characters I am not going to cover all the goodies, but I can verify that there are a lot more nice utilities and tools available in the Apache Commons Collections project.
Trove
If all you keep in your collections are numbers, then Trove might help you increase performance and decrease memory overhead. Trove is a set of collection classes to hold exclusively primitives. Which, if you recall, consume a lot less memory than their Object-wrapping counterparts.
Quick tests show that on larger collections Trove implementations require at least three times less heap than standard Java Collection implementations. If you consider that to be an irrelevant overhead, then on 32-bit machines a Map containing 100,000 integers requires 6.3MB of heap using java.util.HashMap and 1.8MB when using Trove. Putting this now into perspective with collections containing tens or hundreds of millions of elements things already start making sense.
The would-be-equivalents to Trove, namely Apache Commons Primitives and Primitive Collections for Java both seem to be abandoned. Last releases in both cases originate from year 2003.
Huge Collections
If the pain point you are trying to remove is related to long GC pauses caused by large collections landed in old generation, then you should check out Huge Collections. They keep their content off the heap altogether, thus not affecting garbage collection almost at all. As you can see by yourself from the following dataset visualization where the author is publishing the numbers spent on GC with different sizes on data structures:
Highly Scalable Java
Need a solution where locking is not important? And need to use the data structures in environments with tens or hundreds of cores? Then Highly Scalable Java was created for you. Thank Cliff Click for that.
fastutil
Need to work with really large collections with more than 2^31 elements? Check out fastutil – it gives you data structures to work with those beasts. This problem has historically not been much of an issue. Why – because we have not had data structures of this size. And on the rare occasions we did have structures this size we faced RAM limitations before the 2^31 limitations kicked in from the API itself. So instead creating a single collection containing 100’s of millions of elements we switched to partitioning in those cases.
Other
Libraries ending up in this section are (most likely than not) anyhow worse than the ones getting more attention. It’s just that those are the Collections we have not have had a chance to try out in practice. At least until this article was published. But thanks to our readers we got hints about and added the section the already published article on 9th of January :
Javolution – main reason to use Javolution collections is that they are time-deterministic (the maximum execution time is very close to the minimum execution time) and they are RTSJ-Safe. So if you are writing either hard or soft real-time applications and have to deal with execution deadlines and CPU time budgets then check it out.
Goldman Sachs Collections. You might or might not like investment banking, but the concepts behind the GS Collections are interesting for sure. The library adds a lot of convenience methods for common operations otherwise requiring you to write Iterators and anonymous classes. There is also a claim that the GS collections also consume less CPU cycles and heap. The library has apparently been in use for Goldman’s internal applications since 2005 and we are eager to try it out by ourselves. Library also ships with good hands-on training materials packaged in GS Collections Kata project.
Summary
As often happens with of our articles – in 95% of your problems at hand the libraries introduced here will not add anything besides complicating your setup. But on the rare occasions when you need additional features or need to squeeze out the last bit of performance – it is good to be familiar with the landscape. And make an educated choice before writing a half-baked solution yourself.
Full disclaimer: after familiarizing ourselves with all the aforementioned solutions we still ended up building our own due to different performance-related reasons. But more often than not you are not so performance bound as a –javaagent tracing all object creations and destructions and can allow a bit more overhead. In case of which you most likely are better off by trying out existing solutions instead of building your own from the scratch.
Reference: Selecting your Collections library from our JCG partner Vladimir Sor at the Plumbr Blog blog.
This is indeed a very useful bunch of information. Many thanks.