Local variables inside a loop and performance
Overview
Sometimes a question comes up about how much work allocating a new local variable takes. My feeling has always been that the code becomes optimised to the point where this cost is static i.e. done once, not each time the code is run.
Recently Ishwor Gurung suggested considering moving some local variables outside a loop. I suspected it wouldn’t make a difference but I had never tested to see if this was the case.
The test
This is the test I ran:
public static void main(String... args) { for (int i = 0; i < 10; i++) { testInsideLoop(); testOutsideLoop(); } } private static void testInsideLoop() { long start = System.nanoTime(); int[] counters = new int[144]; int runs = 200 * 1000; for (int i = 0; i < runs; i++) { int x = i % 12; int y = i / 12 % 12; int times = x * y; counters[times]++; } long time = System.nanoTime() - start; System.out.printf("Inside: Average loop time %.1f ns%n", (double) time / runs); } private static void testOutsideLoop() { long start = System.nanoTime(); int[] counters = new int[144]; int runs = 200 * 1000, x, y, times; for (int i = 0; i < runs; i++) { x = i % 12; y = i / 12 % 12; times = x * y; counters[times]++; } long time = System.nanoTime() - start; System.out.printf("Outside: Average loop time %.1f ns%n", (double) time / runs); }
and the output ended with:
Inside: Average loop time 3.6 ns
Outside: Average loop time 3.6 ns
Inside: Average loop time 3.6 ns
Outside: Average loop time 3.6 ns
Increasing the time the test takes to 100 million iterations made little difference to the results.
Inside: Average loop time 3.8 ns
Outside: Average loop time 3.8 ns
Inside: Average loop time 3.8 ns
Outside: Average loop time 3.8 ns
Replacing the modulus and multiplication with >>, &, + I
got
int x = i & 15; int y = (i >> 4) & 15; int times = x + y;
prints
Inside: Average loop time 1.2 ns
Outside: Average loop time 1.2 ns
Inside: Average loop time 1.2 ns
Outside: Average loop time 1.2 ns
While modulus is relatively expensive the resolution of the test is to 0.1 ns or less than 1/3 of a clock cycle. This would show any difference between the two tests to an accuracy of this.
Using Caliper
As @maaartinus comments, Caliper is a micro-benchmarking library so I was interested in how much slower it might be that doing the code by hand.
public static void main(String... args) { Runner.main(LoopBenchmark.class, args); } public static class LoopBenchmark extends SimpleBenchmark { public void timeInsideLoop(int reps) { int[] counters = new int[144]; for (int i = 0; i < reps; i++) { int x = i % 12; int y = i / 12 % 12; int times = x * y; counters[times]++; } } public void timeOutsideLoop(int reps) { int[] counters = new int[144]; int x, y, times; for (int i = 0; i < reps; i++) { x = i % 12; y = i / 12 % 12; times = x * y; counters[times]++; } } }
The first thing to note is the code is shorter as it doesn’t include timing and printing boiler plate code. Running this I get on the same machine as the first test.
0% Scenario{vm=java, trial=0, benchmark=InsideLoop} 4.23 ns; σ=0.01 ns @ 3 trials 50% Scenario{vm=java, trial=0, benchmark=OutsideLoop} 4.23 ns; σ=0.01 ns @ 3 trials benchmark ns linear runtime InsideLoop 4.23 ============================== OutsideLoop 4.23 ============================= vm: java trial: 0
Replacing the modulus with shift and and
0% Scenario{vm=java, trial=0, benchmark=InsideLoop} 1.27 ns; σ=0.01 ns @ 3 trials 50% Scenario{vm=java, trial=0, benchmark=OutsideLoop} 1.27 ns; σ=0.00 ns @ 3 trials benchmark ns linear runtime InsideLoop 1.27 ============================= OutsideLoop 1.27 ============================== vm: java trial: 0
This is consistent with the first result and only about 0.4 – 0.6 ns slower for one test. (about two clock cycles), and next to no difference for the shift, and, plus test. This may be due to the way calliper samples the data but doesn’t change the outcome.
It is worth nothing that when running real programs, you typically get longer times than a micro-benchmark as the program will be doing more things so the caching and branch predictions is not as ideal. A small over estimate of the time taken may be closer to what you can expect to see in a real program.
Conclusion
This indicated to me that in this case it made no difference. I still suspect the cost of allocating local variables is don’t once when the code is compiled by the JIT and there is no per-iteration cost to consider.
Reference: Can synchronization be optimised away? from our JCG partner Peter Lawrey at the Vanilla Java blog.
The declaration is just to tell the compiler that the variable exists and what scope it has. The “executable code” is only on the right side of the equals sign. I would be **very** surprised if the declaration location made a difference at all. There could be a difference if it changed the number of allocations and deallocations. Since you are using only primitives and doing the same work (i.e. the right side of the equals sign) in both tests, the results are always identical. (of course, all of this depends on the language you are using: while in C… Read more »
I bet there will be much more difference with memory allocation inside / outside the loop. For example, use an object, a kind of “Point” to store your data. ie: Outside: { int[] counters = new int[144]; Point p = new Point(); int times; for (int i = 0; i < reps; i++) { p.x = i % 12; p.y = i / 12 % 12; times = p.x * p.y; counters[times]++; } } Inside: { int[] counters = new int[144]; int times; for (int i = 0; i < reps; i++) { Point p = new Point(); p.x =… Read more »
Don’t prematurely optimise at the expense of correct scoping. If you don’t need these variables outside the loop then they should be declared inside it. Leave optimisation to the compiler / JIT until you have identified a bottleneck through profiling!
oh really? that escalated quickly