Memory Access Patterns Are Important
- Temporal: Memory accessed recently will likely be required again soon.
- Spatial: Adjacent memory is likely to be required soon.
- Striding: Memory access is likely to follow a predictable pattern.
- Walk through memory in a linear fashion being completely predictable.
- Pseudo randomly walk round memory within a restricted area then move on. This restricted area is what is commonly known as an operating system page of memory.
- Pseudo randomly walk around a large area of the heap.
public class TestMemoryAccessPatterns { private static final int LONG_SIZE = 8; private static final int PAGE_SIZE = 2 * 1024 * 1024; private static final int ONE_GIG = 1024 * 1024 * 1024; private static final long TWO_GIG = 2L * ONE_GIG; private static final int ARRAY_SIZE = (int)(TWO_GIG / LONG_SIZE); private static final int WORDS_PER_PAGE = PAGE_SIZE / LONG_SIZE; private static final int ARRAY_MASK = ARRAY_SIZE - 1; private static final int PAGE_MASK = WORDS_PER_PAGE - 1; private static final int PRIME_INC = 514229; private static final long[] memory = new long[ARRAY_SIZE]; static { for (int i = 0; i < ARRAY_SIZE; i++) { memory[i] = 777; } } public enum StrideType { LINEAR_WALK { public int next(final int pageOffset, final int wordOffset, final int pos) { return (pos + 1) & ARRAY_MASK; } }, RANDOM_PAGE_WALK { public int next(final int pageOffset, final int wordOffset, final int pos) { return pageOffset + ((pos + PRIME_INC) & PAGE_MASK); } }, RANDOM_HEAP_WALK { public int next(final int pageOffset, final int wordOffset, final int pos) { return (pos + PRIME_INC) & ARRAY_MASK; } }; public abstract int next(int pageOffset, int wordOffset, int pos); } public static void main(final String[] args) { final StrideType strideType; switch (Integer.parseInt(args[0])) { case 1: strideType = StrideType.LINEAR_WALK; break; case 2: strideType = StrideType.RANDOM_PAGE_WALK; break; case 3: strideType = StrideType.RANDOM_HEAP_WALK; break; default: throw new IllegalArgumentException("Unknown StrideType"); } for (int i = 0; i < 5; i++) { perfTest(i, strideType); } } private static void perfTest(final int runNumber, final StrideType strideType) { final long start = System.nanoTime(); int pos = -1; long result = 0; for (int pageOffset = 0; pageOffset < ARRAY_SIZE; pageOffset += WORDS_PER_PAGE) { for (int wordOffset = pageOffset, limit = pageOffset + WORDS_PER_PAGE; wordOffset < limit; wordOffset++) { pos = strideType.next(pageOffset, wordOffset, pos); result += memory[pos]; } } final long duration = System.nanoTime() - start; final double nsOp = duration / (double)ARRAY_SIZE; if (208574349312L != result) { throw new IllegalStateException(); } System.out.format("%d - %.2fns %s\n", Integer.valueOf(runNumber), Double.valueOf(nsOp), strideType); } }
Results
Intel U4100 @ 1.3GHz, 4GB RAM DDR2 800MHz, Windows 7 64-bit, Java 1.7.0_05 =========================================== 0 - 2.38ns LINEAR_WALK 1 - 2.41ns LINEAR_WALK 2 - 2.35ns LINEAR_WALK 3 - 2.36ns LINEAR_WALK 4 - 2.39ns LINEAR_WALK 0 - 12.45ns RANDOM_PAGE_WALK 1 - 12.27ns RANDOM_PAGE_WALK 2 - 12.17ns RANDOM_PAGE_WALK 3 - 12.22ns RANDOM_PAGE_WALK 4 - 12.18ns RANDOM_PAGE_WALK 0 - 152.86ns RANDOM_HEAP_WALK 1 - 151.80ns RANDOM_HEAP_WALK 2 - 151.72ns RANDOM_HEAP_WALK 3 - 151.91ns RANDOM_HEAP_WALK 4 - 151.36ns RANDOM_HEAP_WALK Intel i7-860 @ 2.8GHz, 8GB RAM DDR3 1333MHz, Windows 7 64-bit, Java 1.7.0_05 ============================================= 0 - 1.06ns LINEAR_WALK 1 - 1.05ns LINEAR_WALK 2 - 0.98ns LINEAR_WALK 3 - 1.00ns LINEAR_WALK 4 - 1.00ns LINEAR_WALK 0 - 3.80ns RANDOM_PAGE_WALK 1 - 3.85ns RANDOM_PAGE_WALK 2 - 3.79ns RANDOM_PAGE_WALK 3 - 3.65ns RANDOM_PAGE_WALK 4 - 3.64ns RANDOM_PAGE_WALK 0 - 30.04ns RANDOM_HEAP_WALK 1 - 29.05ns RANDOM_HEAP_WALK 2 - 29.14ns RANDOM_HEAP_WALK 3 - 28.88ns RANDOM_HEAP_WALK 4 - 29.57ns RANDOM_HEAP_WALK Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, Linux 3.4.6 kernel 64-bit, Java 1.7.0_05 ================================================= 0 - 0.91ns LINEAR_WALK 1 - 0.92ns LINEAR_WALK 2 - 0.88ns LINEAR_WALK 3 - 0.89ns LINEAR_WALK 4 - 0.89ns LINEAR_WALK 0 - 3.29ns RANDOM_PAGE_WALK 1 - 3.35ns RANDOM_PAGE_WALK 2 - 3.33ns RANDOM_PAGE_WALK 3 - 3.31ns RANDOM_PAGE_WALK 4 - 3.30ns RANDOM_PAGE_WALK 0 - 9.58ns RANDOM_HEAP_WALK 1 - 9.20ns RANDOM_HEAP_WALK 2 - 9.44ns RANDOM_HEAP_WALK 3 - 9.46ns RANDOM_HEAP_WALK 4 - 9.47ns RANDOM_HEAP_WALK
Analysis
perf stat -e L1-dcache-loads,L1-dcache-load-misses java -Xmx4g TestMemoryAccessPatterns $ Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1': 1,496,626,053 L1-dcache-loads 274,255,164 L1-dcache-misses # 18.32% of all L1-dcache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2': 1,537,057,965 L1-dcache-loads 1,570,105,933 L1-dcache-misses # 102.15% of all L1-dcache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3': 4,321,888,497 L1-dcache-loads 1,780,223,433 L1-dcache-misses # 41.19% of all L1-dcache hits likwid-perfctr -C 2 -g L2CACHE java -Xmx4g TestMemoryAccessPatterns $ java -Xmx4g TestMemoryAccessPatterns 1 +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 5.94918e+09 | | CPU_CLK_UNHALTED_CORE | 5.15969e+09 | | L2_TRANS_ALL_REQUESTS | 1.07252e+09 | | L2_RQSTS_MISS | 3.25413e+08 | +-----------------------+-------------+ +-----------------+-----------+ | Metric | core 2 | +-----------------+-----------+ | Runtime [s] | 2.15481 | | CPI | 0.867293 | | L2 request rate | 0.18028 | | L2 miss rate | 0.0546988 | | L2 miss ratio | 0.303409 | +-----------------+-----------+ java -Xmx4g TestMemoryAccessPatterns 2 +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 1.48772e+10 | | CPU_CLK_UNHALTED_CORE | 1.64712e+10 | | L2_TRANS_ALL_REQUESTS | 3.41061e+09 | | L2_RQSTS_MISS | 1.5547e+09 | +-----------------------+-------------+ +-----------------+----------+ | Metric | core 2 | +-----------------+----------+ | Runtime [s] | 6.87876 | | CPI | 1.10714 | | L2 request rate | 0.22925 | | L2 miss rate | 0.104502 | | L2 miss ratio | 0.455843 | +-----------------+----------+ java -Xmx4g TestMemoryAccessPatterns 3 +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 6.49533e+09 | | CPU_CLK_UNHALTED_CORE | 4.18416e+10 | | L2_TRANS_ALL_REQUESTS | 4.67488e+09 | | L2_RQSTS_MISS | 1.43442e+09 | +-----------------------+-------------+ +-----------------+----------+ | Metric | core 2 | +-----------------+----------+ | Runtime [s] | 17.474 | | CPI | 6.4418 | | L2 request rate | 0.71973 | | L2 miss rate | 0.220838 | | L2 miss ratio | 0.306835 | +-----------------+----------+
perf stat -e dTLB-loads,dTLB-load-misses java -Xmx4g TestMemoryAccessPatterns $ Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1': 1,496,128,634 dTLB-loads 310,901 dTLB-misses # 0.02% of all dTLB cache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2': 1,551,585,263 dTLB-loads 340,230 dTLB-misses # 0.02% of all dTLB cache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3': 4,031,344,537 dTLB-loads 1,345,807,418 dTLB-misses # 33.38% of all dTLB cache hits
likwid-perfctr -C 2 -t intel -g LOAD_HIT_PRE_HW_PF:PMC0 java -Xmx4g TestMemoryAccessPatterns $ java -Xmx4g TestMemoryAccessPatterns 1 +--------------------+-------------+ | Event | core 2 | +--------------------+-------------+ | LOAD_HIT_PRE_HW_PF | 1.31613e+09 | +--------------------+-------------+ java -Xmx4g TestMemoryAccessPatterns 2 +--------------------+--------+ | Event | core 2 | +--------------------+--------+ | LOAD_HIT_PRE_HW_PF | 368930 | +--------------------+--------+ java -Xmx4g TestMemoryAccessPatterns 3 +--------------------+--------+ | Event | core 2 | +--------------------+--------+ | LOAD_HIT_PRE_HW_PF | 324373 | +--------------------+--------+
Research is advancing on algorithmic approaches that work in harmony with cache sub-systems. One area I find fascinating is Cache Oblivious Algorithms. The name is a bit misleading but there are some great concepts here for how to improve software performance and better execute in parallel. This article is a great illustration of the performance benefits that can be gained.
Conclusion
To achieve great performance it is important to have sympathy for the cache sub-systems. We have seen in this article what can be achieved by accessing memory in patterns which work with, rather than against, these caches. When designing algorithms and data structures, it has now much more important to consider cache-misses, probably even more so than counting steps in the algorithm. This is not what we were taught in algorithm theory when studying computer science. The last decade has seen some fundamental changes in technology. For me the two most significant are the rise of multi-core, and now big-memory systems with 64-bit address spaces.