Logging stop-the-world pauses in JVM
Different events can cause the JVM to pause all the application threads. Such pauses are called Stop-The-World (STW) pauses. The most common cause for an STW pause to be triggered is garbage collection (example in github) , but different JIT actions (example), biased lock revocation (example), certain JVMTI operations , and many more also require the application to be stopped.
The points at which the application threads may be safely stopped are called, surprise, safepoints. This term is also often used to refer to all the STW pauses.
It is more or less common that GC logs are enabled. However, this does not capture information on all the safepoints. To get it all, use these JVM options:
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime
If you are wondering about the naming explicitly referring to GC, don’t be alarmed – turning on these options logs all of the safepoints, not just garbage collection pauses. If you run a following example (source in github) with the flags specified above
public class FullGc { private static final Collection<Object> leak = new ArrayList<>(); private static volatile Object sink; public static void main(String[] args) { while (true) { try { leak.add(new byte[1024 * 1024]); sink = new byte[1024 * 1024]; } catch (OutOfMemoryError e) { leak.clear(); } } } }
you should see entries similar to the following in your standard output:
Application time: 0.3440086 seconds Total time for which application threads were stopped: 0.0620105 seconds Application time: 0.2100691 seconds Total time for which application threads were stopped: 0.0890223 seconds
Reading it is easy (especially comparing to GC logs) – from the above you can see that the application managed to do useful work for the first 344 milliseconds, paused all threads for 62 milliseconds, following with 210ms of useful work and another 89ms pause.
You can use these flags in combination with GC flags, for example when running the same program again with -XX:+PrintGCDetails added, the output would look similar to the following:
[Full GC (Ergonomics) [PSYoungGen: 1375253K->0K(1387008K)] [ParOldGen: 2796146K->2049K(1784832K)] 4171400K->2049K(3171840K), [Metaspace: 3134K->3134K(1056768K)], 0.0571841 secs] [Times: user=0.02 sys=0.04, real=0.06 secs] Total time for which application threads were stopped: 0.0572646 seconds, Stopping threads took: 0.0000088 seconds
Based on the above, we see that the application threads were forced to stop for 57 ms to do a garbage collection. Out of which 8 microseconds were spent waiting for all the application threads to arrive to a safepoint. But if we take another example (source in github) and run it with the same options, we may see the following output:
Total time for which application threads were stopped: 0.0001273 seconds, Stopping threads took: 0.0000196 seconds Total time for which application threads were stopped: 0.0000648 seconds, Stopping threads took: 0.0000174 seconds
From this, we cannot actually infer what caused the pause, because there was no garbage collecting activity. If you wish to understand the additional safepoints in more details, here is a couple of JVM arguments for you:
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1
Enabling them makes the JVM log additional information to the standard output, looking similar to the following:
5.141: RevokeBias [ 13 0 2 ] [ 0 0 0 0 0 ] 0 Total time for which application threads were stopped: 0.0000782 seconds, Stopping threads took: 0.0000269 seconds
The information about the safepoints is displayed according to the following sequence :
- The timestamp in milliseconds since the start of the JVM (5.141 in the example above)
- The name of the operation that triggered the pause (RevokeBias).
- If you see a “no vm operation”, it means that this is a so-called “guaranteed safepoint”. By default, the JVM triggers a safepoint each second to process all queued-up operations that are not urgent. The flag GuaranteedSafepointInterval can be used to change this behavior (set it to 0 to disable it altogether)
- The number of threads that were stopped at safepoint (13)
- The number of threads running at the beginning of the safepoint (0)
- The number of threads that were blocked when the vm operation began its execution (2)
- Timings for various stages of arriving at safepoint and executing the operation (all zeroes)
So now we see that the use of biased locks may result in a lot of STW pauses happening, even though they only take tens of microseconds. It is not uncommon to disable them right away in these massively parallel days.
All in all, you can really save yourself some trouble by having this data recorded at all times. To do so, specify these JVM options:
-XX:+LogVMOutput -XX:LogFile=vm.log
And all the JVM’s output will be stored to the vm.log file. Interpreting and acting based on these logs, however, is a very broad topic quite outside the scope of this article, but I may write a post or two about it in the future, so if interested, stay tuned – either via our Twitter subscription or through the RSS feed.
If you want some further reading right away, there’s a great blog post by Nitsan Wakart: and, of course, the OpenJDK source code.
Reference: | Logging stop-the-world pauses in JVM from our JCG partner Gleb Smirnov at the Plumbr Blog blog. |