Printing Generated Assembly Code From The Hotspot JIT Compiler
Sometimes when profiling a Java application it is necessary to understand the assembly code generated by the Hotspot JIT compiler. This can be useful in determining what optimisation decisions have been made and how our code changes can affect the generated assembly code. It is also useful at times knowing what instructions are emitted when debugging a concurrent algorithm to ensure visibility rules have been applied as expected. I have found quite a few bugs in various JVMs this way.
This blog illustrates how to install a Disassembler Plugin and provides command line options for targeting a particular method.
Installation
Previously it was necessary to obtain a debug build for printing the assembly code generated by the Hotspot JIT for the Oracle/SUN JVM. Since Java 7, it has been possible to print the generated assembly code if a Disassembler Plugin is installed in a standard Oracle Hotspot JVM. To install the plugin for 64-bit Linux follow the steps below:
- Download the appropriate binary, or build from source, from https://kenai.com/projects/base-hsdis/downloads
- On Linux rename linux-hsdis-amd64.so to libhsdis-amd64.so
- Copy the shared library to $JAVA_HOME/jre/lib/amd64/server
You now have the plugin installed!
Test Program
To test the plugin we need some code that is both interesting to a programmer and executes sufficiently hot to be optimised by the JIT. Some details of when the JIT will optimise can be found here. The code below can be used to measure the average latency between two threads by reading and writing volatile fields. These volatile fields are interesting because they require associated hardware fences to honour the Java Memory Model.
import static java.lang.System.out; public class InterThreadLatency { private static final int REPETITIONS = 100 * 1000 * 1000; private static volatile int ping = -1; private static volatile int pong = -1; public static void main(final String[] args) throws Exception { for (int i = 0; i < 5; i++) { final long duration = runTest(); out.printf("%d - %dns avg latency - ping=%d pong=%d\n", i, duration / (REPETITIONS * 2), ping, pong); } } private static long runTest() throws InterruptedException { final Thread pongThread = new Thread(new PongRunner()); final Thread pingThread = new Thread(new PingRunner()); pongThread.start(); pingThread.start(); final long start = System.nanoTime(); pongThread.join(); return System.nanoTime() - start; } public static class PingRunner implements Runnable { public void run() { for (int i = 0; i < REPETITIONS; i++) { ping = i; while (i != pong) { // busy spin } } } } public static class PongRunner implements Runnable { public void run() { for (int i = 0; i < REPETITIONS; i++) { while (i != ping) { // busy spin } pong = i; } } } }
Printing Assembly Code
It is possible to print all generated assembly code with the following statement.
java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly InterThreadLatency
However this can put you in the situation of not being able to see the forest for the trees. It is generally much more useful to target a particular method. For this test, the run() method will be optimised and generated twice by Hotspot. Once for the OSR version, and then again for the standard JIT version. The standard JIT version follows.
java -XX:+UnlockDiagnosticVMOptions '-XX:CompileCommand=print,*PongRunner.run' InterThreadLatency Compiled method (c2) 10531 5 InterThreadLatency$PongRunner::run (30 bytes) total in heap [0x00007fed81060850,0x00007fed81060b30] = 736 relocation [0x00007fed81060970,0x00007fed81060980] = 16 main code [0x00007fed81060980,0x00007fed81060a00] = 128 stub code [0x00007fed81060a00,0x00007fed81060a18] = 24 oops [0x00007fed81060a18,0x00007fed81060a30] = 24 scopes data [0x00007fed81060a30,0x00007fed81060a78] = 72 scopes pcs [0x00007fed81060a78,0x00007fed81060b28] = 176 dependencies [0x00007fed81060b28,0x00007fed81060b30] = 8 Decoding compiled method 0x00007fed81060850: Code: [Entry Point] [Constants] # {method} 'run' '()V' in 'InterThreadLatency$PongRunner' # [sp+0x20] (sp of caller) 0x00007fed81060980: mov 0x8(%rsi),%r10d 0x00007fed81060984: shl $0x3,%r10 0x00007fed81060988: cmp %r10,%rax 0x00007fed8106098b: jne 0x00007fed81037a60 ; {runtime_call} 0x00007fed81060991: xchg %ax,%ax 0x00007fed81060994: nopl 0x0(%rax,%rax,1) 0x00007fed8106099c: xchg %ax,%ax [Verified Entry Point] 0x00007fed810609a0: sub $0x18,%rsp 0x00007fed810609a7: mov %rbp,0x10(%rsp) ;*synchronization entry ; - InterThreadLatency$PongRunner::run@-1 (line 58) 0x00007fed810609ac: xor %r11d,%r11d 0x00007fed810609af: mov $0x7ad0fcbf0,%r10 ; {oop(a 'java/lang/Class' = 'InterThreadLatency')} 0x00007fed810609b9: jmp 0x00007fed810609d0 0x00007fed810609bb: nopl 0x0(%rax,%rax,1) ; OopMap{r10=Oop off=64} ;*goto ; - InterThreadLatency$PongRunner::run@15 (line 60) 0x00007fed810609c0: test %eax,0xaa1663a(%rip) # 0x00007fed8ba77000 ;*goto ; - InterThreadLatency$PongRunner::run@15 (line 60) ; {poll} 0x00007fed810609c6: nopw 0x0(%rax,%rax,1) ;*iload_1 ; - InterThreadLatency$PongRunner::run@8 (line 60) 0x00007fed810609d0: mov 0x74(%r10),%r9d ;*getstatic ping ; - InterThreadLatency::access$000@0 (line 3) ; - InterThreadLatency$PongRunner::run@9 (line 60) 0x00007fed810609d4: cmp %r9d,%r11d 0x00007fed810609d7: jne 0x00007fed810609c0 0x00007fed810609d9: mov %r11d,0x78(%r10) 0x00007fed810609dd: lock addl $0x0,(%rsp) ;*putstatic pong ; - InterThreadLatency::access$102@2 (line 3) ; - InterThreadLatency$PongRunner::run@19 (line 65) 0x00007fed810609e2: inc %r11d ;*iinc ; - InterThreadLatency$PongRunner::run@23 (line 58) 0x00007fed810609e5: cmp $0x5f5e100,%r11d 0x00007fed810609ec: jl 0x00007fed810609d0 ;*if_icmpeq ; - InterThreadLatency$PongRunner::run@12 (line 60) 0x00007fed810609ee: add $0x10,%rsp 0x00007fed810609f2: pop %rbp 0x00007fed810609f3: test %eax,0xaa16607(%rip) # 0x00007fed8ba77000 ; {poll_return} 0x00007fed810609f9: retq ;*iload_1 ; - InterThreadLatency$PongRunner::run@8 (line 60) 0x00007fed810609fa: hlt 0x00007fed810609fb: hlt 0x00007fed810609fc: hlt 0x00007fed810609fd: hlt 0x00007fed810609fe: hlt 0x00007fed810609ff: hlt [Exception Handler] [Stub Code] 0x00007fed81060a00: jmpq 0x00007fed8105eaa0 ; {no_reloc} [Deopt Handler Code] 0x00007fed81060a05: callq 0x00007fed81060a0a 0x00007fed81060a0a: subq $0x5,(%rsp) 0x00007fed81060a0f: jmpq 0x00007fed81038c00 ; {runtime_call} 0x00007fed81060a14: hlt 0x00007fed81060a15: hlt 0x00007fed81060a16: hlt 0x00007fed81060a17: hlt OopMapSet contains 1 OopMaps #0 OopMap{r10=Oop off=64}
An Interesting Observation
The red highlighted lines of assembly code above are very interesting. When a
volatile field is written, under the Java Memory Model the write must be sequentially consistent, i.e. not appear to be reordered due to optimisations normally applied such as staging the write to the store buffer. This can be achieved by inserting the appropriate memory barriers. In the case above, Hotspot has chosen to enforce the ordering by issuing a MOV instruction (register to memory address – i.e. the write) followed by a LOCK ADD instruction (no op to the stack pointer as a fence idiom) that has ordering semantics. This is less than ideal on an x86 processor. The same action could have been performed more efficiently and correctly with a single LOCK XCHG instruction for the write. This makes me wonder if there are some significant compromises in the JVM to make it portable across many architectures, rather than be the best it can on x86.