Java on Steroids: 5 Super Useful JIT Optimization Techniques
Java developer? Optimize your production monitoring. See the source code, call stack, and variable state behind all logged errors, warnings, and exceptions – Try Takipi.
What are some of the most useful JVM JIT optimizations and how to use them?
Even when you’re not actively planning for it, the JVM has quite a few tricks up its sleeve to help your code perform better. Some require literally nothing from you for them to work, and others could use a little help along the way.
For this post, we’ve had a chat with Moshe Tsur, Takipi’s R&D Team Lead, and got the opportunity to share some of his tips about the JVM’s Just In Time (JIT) compiler.
Let’s see what’s going on behind the scenes.
Write once, run anywhere, optimize just in time
Most people know that their Java source code is turned to bytecode by the javac compiler, and then run by the JVM which then compiles it to Assembly and feeds it to the CPU.
Less people know that the rabbit hole goes deeper. The JVM has 2 different operation modes.
The generated bytecode is an accurate representation of the original Java source code, without any optimizations. When the JVM translates it to Assembly, things get trickier and the magic kicks in:
- Interpreted mode – Where the JVM reads and runs the bytecode itself, as is.
- Compiled mode (bytecode to assembly) – Where the JVM loosens its grip and there’s no bytecode involved.
The link between the two is the JIT compiler. As you might have already guessed, interpreted mode is much more slower than directly running on the CPU without a middleman.
Interpreted mode gives the Java platform a precious opportunity to collect information about the code and how it actually behaves in practice – giving it a chance to learn how it can optimize the resulting Assembly code.
After enough running time, the JIT compiler is able to perform the optimization in the Assembly level. Mainly, minimizing unnecessary jumps that add significant overhead on for the application, based on the actual behavior of the code. Furthermore, this process doesn’t stop after the “warmup phase” of the application and can happen multiple times to optimize the Assembly even further.
At Takipi, where we’re building a Java agent that monitors servers in production, we take overhead very seriously. Every little bit of code is optimized and along the way we’ve had the chance to use and learn about some cool JIT features.
Here are 5 of the more useful examples:
1. Null Check Elimination
In many cases, adding certain conditional checks into the code makes a lot of sense during development. Just like the infamous check for null. It’s no surprise that nulls get special treatment, since it’s the cause for the top exception that happens in production.
Nonetheless, explicit null checks can be eliminated from optimized Assembly code in most cases, and it holds true for any kind of condition that happens rarely. In the rare case that the condition does happen, the JIT uses a mechanism called “uncommon trap” that lets it know that it need to go back and fix the optimization by pulling back in the set of instructions that needs to be carried out when it happens – without surfacing an actual NullPointerException.
The reason these checks are a cause for concern for optimized code is that they’re likely to introduce jumps into the assembly code, slowing down its execution.
Let’s take a look at a simple example:
private static void runSomeAlgorithm(Graph graph) { if (graph == null) { return; } // do something with graph }
If the JIT sees that graph is never called with null, the compiled version would look as if it reflects code that was written without the null check:
private static void runSomeAlgorithm(Graph graph) { // do something with graph }
Bottom line: We don’t need to do anything special to enjoy this optimization, and when a rare condition does kick in, the JVM would know to “deoptimize” and get the desired result.
2. Branch Prediction
Similarly to Null Check Elimination, there’s another JIT optimization technique that helps it decide whether certain lines of code are “hotter” than others and happen more often.
We already know that if a certain condition rarely or never holds, it’s likely that the “uncommon trap” mechanism will kick in and eliminate it from the compiled Assembly.
If there’s a different balance, both branches of an IF condition hold, but one happens more than the other, the JIT compiler can reorder them according to the most common one and significantly reduce the number of Assembly jumps.
Here’s how it works in practice:
private static int isOpt(int x, int y) { int veryHardCalculation = 0; if (x >= y) { veryHardCalculation = x * 1000 + y; } else { veryHardCalculation = y * 1000 + x; } return veryHardCalculation; }
Now, assuming that most of the time x < y, the condition will be flipped to statistically reduce the number of Assembly jumps:
private static int isOpt(int x, int y) { int veryHardCalculation = 0; if (x < y) { // this would not require a jump veryHardCalculation = y * 1000 + x; return veryHardCalculation; } else { veryHardCalculation = x * 1000 + y; return veryHardCalculation; } }
In case you’re not convinced, we’ve actually ran it through a little test and pulled out the optimized Assembly code:
0x00007fd715062d0c: cmp %edx,%esi 0x00007fd715062d0e: jge 0x00007fd715062d24 ;*if_icmplt ; - Opt::isOpt@4 (line 117)
As you can see, the jump instruction is jge (jump if greater or equals), corresponding with the else branch of the condition.
Bottom line: Another out-of-the-box JIT optimization that we get to enjoy. We’ve also covered Branch Prediction in a recent post about some of the most interesting Stackoverflow Java answers.
In the example shown there, we see how branch prediction is used to explain why running operations on a sorted array is much faster under certain conditions that help branch prediction to kick in.
3. Loop Unrolling
As you might have noticed, the JIT compiler constantly tries to eliminate Assembly jumps from the compiled code.
This is why loops smell like trouble.
Each iteration is actually an Assembly jump back to the beginning of the instruction set. With loop unrolling, the JIT compiler opens up the loop and just repeats the corresponding Assembly instructions one after another.
Theoretically, javac could do something similar when translating Java to bytecode, but the JIT compiler has better knowledge of the actual CPU that the code will be running on and knows how to fine tune the code in a much more efficient way.
For example, let’s take a look at a method that multiplies a matrix by a vector:
private static double[] loopUnrolling(double[][] matrix1, double[] vector1) { double[] result = new double[vector1.length]; for (int i = 0; i < matrix1.length; i++) { for (int j = 0; j < vector1.length; j++) { result[i] += matrix1[i][j] * vector1[j]; } } return result; }
The unrolled version would look like this:
private static double[] loopUnrolling2(double[][] matrix1, double[] vector1) { double[] result = new double[vector1.length]; for (int i = 0; i < matrix1.length; i++) { result[i] += matrix1[i][0] * vector1[0]; result[i] += matrix1[i][1] * vector1[1]; result[i] += matrix1[i][2] * vector1[2]; // and maybe it will expand even further - e.g. 4 iterations, thus // adding code to fix the indexing // which we would waste more time doing correctly and efficiently } return result; }
Repeating the same action over and over again without the jump overhead:
.... 0x00007fd715060743: vmovsd 0x10(%r8,%rcx,8),%xmm0 ;*daload ; - Opt::loopUnrolling@26 (line 179) 0x00007fd71506074a: vmovsd 0x10(%rbp),%xmm1 ;*daload ; - Opt::loopUnrolling@36 (line 179) 0x00007fd71506074f: vmulsd 0x10(%r12,%r9,8),%xmm1,%xmm1 0x00007fd715060756: vaddsd %xmm0,%xmm1,%xmm0 ;*dadd ; - Opt::loopUnrolling@38 (line 179) 0x00007fd71506075a: vmovsd %xmm0,0x10(%r8,%rcx,8) ;*dastore ; - Opt::loopUnrolling@39 (line 179) ....
Bottom line: The JIT compiler does a great job in unrolling loops, as long as you keep their contents simple without any unnecessary complications.
It’s also not recommended to try to optimize this process and unroll them yourself, the outcome will probably cause more problems than it solves.
4. Inlining Methods
Next up, another jump killer optimization. A huge source for Assembly jumps are in fact method calls. When possible, the JIT compiler would try to inline them and eliminate the jumps of going there and back, the need to send arguments, and returning a value – transferring its whole content to the calling method.
It’s also possible to fine tune the way the JIT compiler inlines methods with 2 JVM arguments:
- -XX:MaxInlineSize – The maximum size of the bytecode of a method that can be inlined (done for any method, even if not being executed frequently). Default is about ~35 bytes.
- -XX:FreqInlineSize – The maximum size of the bytecode of a method that is considered a hot spot (executed frequently), that should be inlined. Default varies between platforms.
For example, let’s take a look at a method that calculated the coordinates of a simple line:
private static void calcLine(int a, int b, int from, int to) { Line l = new Line(a, b); for (int x = from; x <= to; x++) { int y = l.getY(x); System.err.println("(" + x + ", " + y + ")"); } } static class Line { public final int a; public final int b; public Line(int a, int b) { this.a = a; this.b = b; } // Inlining public int getY(int x) { return (a * x + b); } }
The optimized inlined version would eliminate the jump, the send of the arguments l and x, and the returning of y:
private static void calcLine(int a, int b, int from, int to) { Line l = new Line(a, b); for (int x = from; x <= to; x++) { int y = (l.a * x + l.b); System.err.println("(" + x + ", " + y + ")"); } }
Bottom line: Inlining is a super useful optimization but it will only kick in if you keep your methods as simple as possible with a minimal number of lines. Complex methods are less likely to get inlined, so this is one point where you can help the JIT compiler do its job.
5. Thread fields and Thread Local Storage (TLS)
As it turns out, thread fields are much faster than regular variables. The thread object is stored on the actual CPU register, making its fields a very efficient storage space.
With thread local storage, you’re able to create variables stored on the Thread object.
Here’s how you can access the Thread Local Storage, with a simple example of a request counter:
private static void handleRequest() { if (counter.get() == null) { counter.set(0); } counter.set(counter.get() + 1);
In the corresponding Assembly code, we can see that the data is placed directly on the register, allowing for faster data access than, say, a static class variable:
0x00007fd71508b1ec: mov 0x1b0(%r15),%r10 ;*invokestatic currentThread ; - java.lang.ThreadLocal::get@0 (line 143) ; - Opt::handleRequest@3 (line 70) 0x00007fd71508b1f3: mov 0x50(%r10),%r11d ;*getfield threadLocals ; - java.lang.ThreadLocal::getMap@1 (line 213) ; - java.lang.ThreadLocal::get@6 (line 144) ; - Opt::handleRequest@3 (line 70)
Bottom line: Certain sensitive data types might be better stored on Thread Local Storage for faster access and retrieval.
Final Thoughts
The JVMs JIT compiler is one of the fascinating mechanisms on the Java platform. It optimizes your code for performance, without giving away its readability. Not only that, beyond the “static” optimization methods of inlining, it also makes decisions based on the way that the code performs in practice.
We hope you’ve enjoyed learning about some of the JVMs most useful optimization techniques, and would be happy to hear your thoughts in the comments section below.
Java developer? Optimize your production monitoring. See the source code, call stack, and variable state behind all logged errors, warnings, and exceptions – Try Takipi.
Threads fields are very fast but using too much may cause overhead on cpu. Any opinion on this?
Your example for Thread local storage would be much better if it included the definition of counter.
Your loop unolling example was easy to understand, but it overlooked another performance optimization: caching the lengths of arrays. If you create a variable to store the length of Matrix and Vector, you’ll get that much more perfornance out of your program. Some optimizing compilers will catch that, but it’s best not to rely on it