How to use bloom filter to build a large in memory cache in Java
BackGround
Cache is an important concept to solve day to day software problems. Your application may perform CPU intensive operations, which you do not want to perform again and again, instead you derive the result once and cache it in memory. Sometimes the bottleneck is IO, like you do not want to hit the database repeatedly and would like to cache the results and update the cache only if underlying data changes.
Similarly there are other use cases where we need to perform a quick lookup to decide what to do with an incoming request. For example, consider this use case where you have to identify that one URL points to a malware site or not. There could be many URLs like that, to do it in an instance, if we cache all the malware URLs in memory, that would require a lot of space to hold them. Or another use case could be to identify if a user typed string has any reference to a place in USA. Like “museum in Washington” – in this string, Washington is a name of a place in USA. Should we keep all the places in USA in memory and then lookup? How big the cache size would be? Is it effective to do it without any database support?
This is where we need to move away from basic map data structure and look for answers in more advanced data structure like bloomfilter. You can consider bloomfilter, like any other java collection where you can put items in it and ask it whether an item already present in it or not (like a HashSet). If Bloomfilter mentions that it does not contain the item, then definitely that item is not present. But if it mentions that it has seen the item, then that may be wrong. If we are careful enough, we can design a bloomfilter such that the probability of the wrong is controlled.
Explanation
Bloomfilter is designed as an array(A) of m bits. Initially all these bits are set to 0.
To add item:
In order to add any item, it needs to be feed through k hash functions. Each hash function will generate a number which can be treated as a position of the bit array (hash modulo array length can give us the index of the array) and we shall set that value of that position to 1. For example – first hash function(hash1) on item I produce a bit position x, similarly second and third hash functions produce position y and z.
So we shall set:
A[x]=A[y]=A[z] = 1
To find item:
Similar process will be repeated, item will be hashed three times through three different hash functions. Each hash function will produce an integer which will be treated as a position of the array. We shall inspect those x,y, z positions of the bit array and see if they are set to 1 or not. If no, for sure no one ever tried to add this item into bloomfilter, but if all the bits are set, it could be a false positive.
Things to tune
From the above explanation, it becomes clear that to design a good bloomfilter we need to keep track of the following things
- Good hash functions that can generate wide range of hash values as quickly as possible
- The value of m (size of the bit array) is very important. If the size is too small, all the bits will be set to 1 quickly and false positives will grow largely.
- Number of hash functions(k) is also important so that the values get distributed evenly.
If we can estimate how many items we are planning to keep in the bloom filter, we can calculate the optimal values of k and m. Skipping the mathematical details, the formula to calculate k and m are enough for us to write a good bloomfilter.
Formula to determine m (number of bits for the bloom filter) is as bellow:
m = - nlogp / (log2)^2;
where p = desired false positive probability
Formula to determine k (number of hash functions) is as bellow:
k = m/n log(2) ;
where k = number of hash functions, m=number of bits and n= number of items in the filter
Hashing
Hashing is an area which affects the performance of bloomfilter. We need to choose a hash function that is effective yet not time consuming. In the paper, “Less Hashing, Same Performance: Building a Better Bloom Filter” it is discussed how we can use two hash functions to generate K number of hash functions. First we need to calculate two hash function h1(x) and h2(x). Next, we can use these two hash functions to simulate k hash functions of the nature
gi(x) = h1(x) + ih2(x);
where i can range from {1..k}
Google guava library uses this trick in their bloomfilter implementation, the hashing logic is outlined here :
long hash64 = …; //calculate a 64 bit hash function //split it in two halves of 32 bit hash values int hash1 = (int) hash64; int hash2 = (int) (hash64 >>> 32); //Generate k different hash functions with a simple loop for (int i = 1; i <= numHashFunctions; i++) { int nextHash = hash1 + i * hash2; }
Applications
It is clear from the mathematical formulas that to apply bloomfilter to solve a problem, we need to understand the domain very well. Like we can apply bloomfilter to hold all the cities name in usa. This number is deterministic and we have prior knowledge, so we can determine n (total number of elements to be added to the bloomfilter). Fix p(probability of false positive) according to business requirements. In that case, we have a perfect cache which is memory efficient and lookup time is very low.
Implementations
Google guava library has an implementation of Bloomfilter. Check how the constructor of this class, which asks for expected items and false positive rate.
import com.google.common.hash.BloomFilter; import com.google.common.hash.Funnels; //Create Bloomfilter int expectedInsertions = ….; double fpp = 0.03; // desired false positive probability BloomFilter<CharSequence> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charset.forName("UTF-8")), expectedInsertions,fpp)
Resources:
- http://en.wikipedia.org/wiki/Bloom_filter
- http://billmill.org/bloomfilter-tutorial/
- http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf