Garbage Collection: increasing the throughput
The inspiration for this post came after stumbling upon “Pig in the Python” definition in the memory management glossary. Apparently, this term is used to explain the situation where GC repeatedly promotes large objects from generation to generation. The effect of doing so is supposedly similar to that of a python swallowing its prey in whole only to become immobilised during digestion.
For the next 24 hours I just could not get the picture of choking pythons out of my head. As the psychiatrists say, the best way to let go of your fears is to speak about them. So here we go. But instead of the pythons, the rest of the story will be about garbage collection tuning. I promise.
Garbage Collection pauses are well known by their potential of becoming a performance bottleneck. Modern JVMs do ship with advanced garbage collectors, but as I have experienced, finding optimal configuration for a particular application is still darn difficult. To even stand a chance in manually approaching the issue, one would need to understand exact mechanics of garbage collection algorithms. This post might be able to help you in this regard, as I am going to use an example to demonstrate how small changes in JVM configuration can affect the throughput of your application.
Example
The application we use to demonstrate the GC impact on throughput is a simple one. It consists of just two threads:
- PigEater – simulating a situation where the python keeps eating one pig after another. The code achieves this via adding 32MB of bytes into a java.util.List and sleeping 100ms after each attempt.
- PigDigester – simulating an asynchronous digesting process. The code implements digestion by just nullifying that list of pigs. As this is a rather tiring process, then this thread sleeps for 2000ms after each reference cleaning.
Both threads will run in a while loop, continuing to eat and digest until the snake is full. This happens at around 5,000 pigs eaten.
package eu.plumbr.demo; public class PigInThePython { static volatile List pigs = new ArrayList(); static volatile int pigsEaten = 0; static final int ENOUGH_PIGS = 5000; public static void main(String[] args) throws InterruptedException { new PigEater().start(); new PigDigester().start(); } static class PigEater extends Thread { @Override public void run() { while (true) { pigs.add(new byte[32 * 1024 * 1024]); //32MB per pig if (pigsEaten > ENOUGH_PIGS) return; takeANap(100); } } } static class PigDigester extends Thread { @Override public void run() { long start = System.currentTimeMillis(); while (true) { takeANap(2000); pigsEaten+=pigs.size(); pigs = new ArrayList(); if (pigsEaten > ENOUGH_PIGS) { System.out.format("Digested %d pigs in %d ms.%n",pigsEaten, System.currentTimeMillis()-start); return; } } } } static void takeANap(int ms) { try { Thread.sleep(ms); } catch (Exception e) { e.printStackTrace(); } } }
Now lets define the throughput of this system as the “number of pigs digested per second”. Taking into account that the pigs are stuffed into the python after each 100ms, we see that theoretical maximal throughput this system can thus reach up to 10 pigs / second.
Configuring the GC example
Lets see how the system behaves using two different configuration. In all situations, the application was run using a dual-core Mac (OS X 10.9.3) with 8G of physical memory.
First configuration:
- 4G of heap (-Xms4g –Xmx4g)
- Using CMS to clean old (-XX:+UseConcMarkSweepGC) and Parallel to clean young generation -XX:+UseParNewGC)
- Has allocated 12,5% of the heap (-Xmn512m) to young generation, further restricting the sizes of Eden and Survivor spaces to equally sized.
Second configuration is a bit different:
- 2G of heap (-Xms2g –Xmx2g)
- Using Parallel GC to conduct garbage collection both in young and tenured generations(-XX:+UseParallelGC)
- Has allocated 75% of the heap to young generation (-Xmn1536m)
Now it is time to make bets, which of the configurations performed better in terms of throughput (pigs eaten per second, remember?). Those of you laying your money on the first configuration, I must disappoint you. The results are exactly reversed:
- First configuration (large heap, large old space, CMS GC) is capable of eating 8.2 pigs/second
- Second configuration (2x smaller heap, large young space, Parallel GC) is capable of eating 9.2 pigs/second
Now, let me put the results in perspective. Allocating 2x less resources (memory-wise) we achieved 12% better throughput. This is something so contrary to common knowledge that it might require some further clarification on what was actually happening.
Interpreting the GC results
The reason in what you face is not too complex and the answer is staring right at you when you take a more closer look to what GC is doing during the test run. For this, you can use the tool of your choice, I peeked under the hood with the help of jstat, similar to the following:
jstat -gc -t -h20 PID 1s
Looking at the data, I noticed that the first configuration went through 1,129 garbage collection cycles (YGCT+FGCT) which in total took 63.723 seconds:
Timestamp S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT 594.0 174720.0 174720.0 163844.1 0.0 174848.0 131074.1 3670016.0 2621693.5 21248.0 2580.9 1006 63.182 116 0.236 63.419 595.0 174720.0 174720.0 163842.1 0.0 174848.0 65538.0 3670016.0 3047677.9 21248.0 2580.9 1008 63.310 117 0.236 63.546 596.1 174720.0 174720.0 98308.0 163842.1 174848.0 163844.2 3670016.0 491772.9 21248.0 2580.9 1010 63.354 118 0.240 63.595 597.0 174720.0 174720.0 0.0 163840.1 174848.0 131074.1 3670016.0 688380.1 21248.0 2580.9 1011 63.482 118 0.240 63.723
The second configuration paused in total of 168 times (YGCT+FGCT) for just 11.409 seconds.
Timestamp S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT 539.3 164352.0 164352.0 0.0 0.0 1211904.0 98306.0 524288.0 164352.2 21504.0 2579.2 27 2.969 141 8.441 11.409 540.3 164352.0 164352.0 0.0 0.0 1211904.0 425986.2 524288.0 164352.2 21504.0 2579.2 27 2.969 141 8.441 11.409 541.4 164352.0 164352.0 0.0 0.0 1211904.0 720900.4 524288.0 164352.2 21504.0 2579.2 27 2.969 141 8.441 11.409 542.3 164352.0 164352.0 0.0 0.0 1211904.0 1015812.6 524288.0 164352.2 21504.0 2579.2 27 2.969 141 8.441 11.409
Considering that the work needed to carry out in both cases was equivalent in regard that – with no long-living objects in sight the duty of the GC in this pig-eating exercise is just to get rid of everything as fast as possible. And using the first configuration, the GC is just forced to run ~6.7x more often resulting in ~5,6x longer total pause times.
So the story fulfilled two purposes. First and foremost, I hope I got the picture of a choking python out of my head. Another and more significant take-away from this is – that tuning GC is a tricky exercise at best, requiring deep understanding of several underlying concepts. Even with the truly trivial application used in this blog post, the results you will be facing can have significant impact on your throughput and capacity planning. In real-world applications the differences are even more staggering. So the choice is yours, you can either master the concepts, or, focus on your day-to-day work and let Plumbr to find out suitable GC configuration according to your needs.
Reference: | Garbage Collection: increasing the throughput from our JCG partner Nikita Salnikov Tarnovski at the Plumbr Blog blog. |
Correct me if i’m wrong, but i think there is a race condition here:
32 pigsEaten+=pigs.size();
33 pigs = new ArrayList();
If the other thread will add a pig in between these lines, this thread would never die.