Memory Access Patterns Are Important
- Temporal: Memory accessed recently will likely be required again soon.
- Spatial: Adjacent memory is likely to be required soon.
- Striding: Memory access is likely to follow a predictable pattern.
- Walk through memory in a linear fashion being completely predictable.
- Pseudo randomly walk round memory within a restricted area then move on. This restricted area is what is commonly known as an operating system page of memory.
- Pseudo randomly walk around a large area of the heap.
001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109 110 111 112 | public class TestMemoryAccessPatterns { private static final int LONG_SIZE = 8 ; private static final int PAGE_SIZE = 2 * 1024 * 1024 ; private static final int ONE_GIG = 1024 * 1024 * 1024 ; private static final long TWO_GIG = 2L * ONE_GIG; private static final int ARRAY_SIZE = ( int )(TWO_GIG / LONG_SIZE); private static final int WORDS_PER_PAGE = PAGE_SIZE / LONG_SIZE; private static final int ARRAY_MASK = ARRAY_SIZE - 1 ; private static final int PAGE_MASK = WORDS_PER_PAGE - 1 ; private static final int PRIME_INC = 514229 ; private static final long [] memory = new long [ARRAY_SIZE]; static { for ( int i = 0 ; i < ARRAY_SIZE; i++) { memory[i] = 777 ; } } public enum StrideType { LINEAR_WALK { public int next( final int pageOffset, final int wordOffset, final int pos) { return (pos + 1 ) & ARRAY_MASK; } }, RANDOM_PAGE_WALK { public int next( final int pageOffset, final int wordOffset, final int pos) { return pageOffset + ((pos + PRIME_INC) & PAGE_MASK); } }, RANDOM_HEAP_WALK { public int next( final int pageOffset, final int wordOffset, final int pos) { return (pos + PRIME_INC) & ARRAY_MASK; } }; public abstract int next( int pageOffset, int wordOffset, int pos); } public static void main( final String[] args) { final StrideType strideType; switch (Integer.parseInt(args[ 0 ])) { case 1 : strideType = StrideType.LINEAR_WALK; break ; case 2 : strideType = StrideType.RANDOM_PAGE_WALK; break ; case 3 : strideType = StrideType.RANDOM_HEAP_WALK; break ; default : throw new IllegalArgumentException( "Unknown StrideType" ); } for ( int i = 0 ; i < 5 ; i++) { perfTest(i, strideType); } } private static void perfTest( final int runNumber, final StrideType strideType) { final long start = System.nanoTime(); int pos = - 1 ; long result = 0 ; for ( int pageOffset = 0 ; pageOffset < ARRAY_SIZE; pageOffset += WORDS_PER_PAGE) { for ( int wordOffset = pageOffset, limit = pageOffset + WORDS_PER_PAGE; wordOffset < limit; wordOffset++) { pos = strideType.next(pageOffset, wordOffset, pos); result += memory[pos]; } } final long duration = System.nanoTime() - start; final double nsOp = duration / ( double )ARRAY_SIZE; if (208574349312L != result) { throw new IllegalStateException(); } System.out.format( "%d - %.2fns %s\n" , Integer.valueOf(runNumber), Double.valueOf(nsOp), strideType); } } |
Results
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | Intel U4100 @ 1.3GHz, 4GB RAM DDR2 800MHz, Windows 7 64-bit, Java 1.7.0_05 =========================================== 0 - 2.38ns LINEAR_WALK 1 - 2.41ns LINEAR_WALK 2 - 2.35ns LINEAR_WALK 3 - 2.36ns LINEAR_WALK 4 - 2.39ns LINEAR_WALK 0 - 12.45ns RANDOM_PAGE_WALK 1 - 12.27ns RANDOM_PAGE_WALK 2 - 12.17ns RANDOM_PAGE_WALK 3 - 12.22ns RANDOM_PAGE_WALK 4 - 12.18ns RANDOM_PAGE_WALK 0 - 152.86ns RANDOM_HEAP_WALK 1 - 151.80ns RANDOM_HEAP_WALK 2 - 151.72ns RANDOM_HEAP_WALK 3 - 151.91ns RANDOM_HEAP_WALK 4 - 151.36ns RANDOM_HEAP_WALK Intel i7-860 @ 2.8GHz, 8GB RAM DDR3 1333MHz, Windows 7 64-bit, Java 1.7.0_05 ============================================= 0 - 1.06ns LINEAR_WALK 1 - 1.05ns LINEAR_WALK 2 - 0.98ns LINEAR_WALK 3 - 1.00ns LINEAR_WALK 4 - 1.00ns LINEAR_WALK 0 - 3.80ns RANDOM_PAGE_WALK 1 - 3.85ns RANDOM_PAGE_WALK 2 - 3.79ns RANDOM_PAGE_WALK 3 - 3.65ns RANDOM_PAGE_WALK 4 - 3.64ns RANDOM_PAGE_WALK 0 - 30.04ns RANDOM_HEAP_WALK 1 - 29.05ns RANDOM_HEAP_WALK 2 - 29.14ns RANDOM_HEAP_WALK 3 - 28.88ns RANDOM_HEAP_WALK 4 - 29.57ns RANDOM_HEAP_WALK Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, Linux 3.4.6 kernel 64-bit, Java 1.7.0_05 ================================================= 0 - 0.91ns LINEAR_WALK 1 - 0.92ns LINEAR_WALK 2 - 0.88ns LINEAR_WALK 3 - 0.89ns LINEAR_WALK 4 - 0.89ns LINEAR_WALK 0 - 3.29ns RANDOM_PAGE_WALK 1 - 3.35ns RANDOM_PAGE_WALK 2 - 3.33ns RANDOM_PAGE_WALK 3 - 3.31ns RANDOM_PAGE_WALK 4 - 3.30ns RANDOM_PAGE_WALK 0 - 9.58ns RANDOM_HEAP_WALK 1 - 9.20ns RANDOM_HEAP_WALK 2 - 9.44ns RANDOM_HEAP_WALK 3 - 9.46ns RANDOM_HEAP_WALK 4 - 9.47ns RANDOM_HEAP_WALK |
Analysis
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | perf stat -e L1-dcache-loads,L1-dcache-load-misses java -Xmx4g TestMemoryAccessPatterns $ Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1' : 1,496,626,053 L1-dcache-loads 274,255,164 L1-dcache-misses # 18.32% of all L1-dcache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2' : 1,537,057,965 L1-dcache-loads 1,570,105,933 L1-dcache-misses # 102.15% of all L1-dcache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3' : 4,321,888,497 L1-dcache-loads 1,780,223,433 L1-dcache-misses # 41.19% of all L1-dcache hits likwid-perfctr -C 2 -g L2CACHE java -Xmx4g TestMemoryAccessPatterns $ java -Xmx4g TestMemoryAccessPatterns 1 +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 5.94918e+09 | | CPU_CLK_UNHALTED_CORE | 5.15969e+09 | | L2_TRANS_ALL_REQUESTS | 1.07252e+09 | | L2_RQSTS_MISS | 3.25413e+08 | +-----------------------+-------------+ +-----------------+-----------+ | Metric | core 2 | +-----------------+-----------+ | Runtime [s] | 2.15481 | | CPI | 0.867293 | | L2 request rate | 0.18028 | | L2 miss rate | 0.0546988 | | L2 miss ratio | 0.303409 | +-----------------+-----------+ java -Xmx4g TestMemoryAccessPatterns 2 +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 1.48772e+10 | | CPU_CLK_UNHALTED_CORE | 1.64712e+10 | | L2_TRANS_ALL_REQUESTS | 3.41061e+09 | | L2_RQSTS_MISS | 1.5547e+09 | +-----------------------+-------------+ +-----------------+----------+ | Metric | core 2 | +-----------------+----------+ | Runtime [s] | 6.87876 | | CPI | 1.10714 | | L2 request rate | 0.22925 | | L2 miss rate | 0.104502 | | L2 miss ratio | 0.455843 | +-----------------+----------+ java -Xmx4g TestMemoryAccessPatterns 3 +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 6.49533e+09 | | CPU_CLK_UNHALTED_CORE | 4.18416e+10 | | L2_TRANS_ALL_REQUESTS | 4.67488e+09 | | L2_RQSTS_MISS | 1.43442e+09 | +-----------------------+-------------+ +-----------------+----------+ | Metric | core 2 | +-----------------+----------+ | Runtime [s] | 17.474 | | CPI | 6.4418 | | L2 request rate | 0.71973 | | L2 miss rate | 0.220838 | | L2 miss ratio | 0.306835 | +-----------------+----------+ |
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 | perf stat -e dTLB-loads,dTLB-load-misses java -Xmx4g TestMemoryAccessPatterns $ Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1' : 1,496,128,634 dTLB-loads 310,901 dTLB-misses # 0.02% of all dTLB cache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2' : 1,551,585,263 dTLB-loads 340,230 dTLB-misses # 0.02% of all dTLB cache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3' : 4,031,344,537 dTLB-loads 1,345,807,418 dTLB-misses # 33.38% of all dTLB cache hits |
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 | likwid-perfctr -C 2 -t intel -g LOAD_HIT_PRE_HW_PF:PMC0 java -Xmx4g TestMemoryAccessPatterns $ java -Xmx4g TestMemoryAccessPatterns 1 +--------------------+-------------+ | Event | core 2 | +--------------------+-------------+ | LOAD_HIT_PRE_HW_PF | 1.31613e+09 | +--------------------+-------------+ java -Xmx4g TestMemoryAccessPatterns 2 +--------------------+--------+ | Event | core 2 | +--------------------+--------+ | LOAD_HIT_PRE_HW_PF | 368930 | +--------------------+--------+ java -Xmx4g TestMemoryAccessPatterns 3 +--------------------+--------+ | Event | core 2 | +--------------------+--------+ | LOAD_HIT_PRE_HW_PF | 324373 | +--------------------+--------+ |
Research is advancing on algorithmic approaches that work in harmony with cache sub-systems. One area I find fascinating is Cache Oblivious Algorithms. The name is a bit misleading but there are some great concepts here for how to improve software performance and better execute in parallel. This article is a great illustration of the performance benefits that can be gained.
Conclusion
To achieve great performance it is important to have sympathy for the cache sub-systems. We have seen in this article what can be achieved by accessing memory in patterns which work with, rather than against, these caches. When designing algorithms and data structures, it has now much more important to consider cache-misses, probably even more so than counting steps in the algorithm. This is not what we were taught in algorithm theory when studying computer science. The last decade has seen some fundamental changes in technology. For me the two most significant are the rise of multi-core, and now big-memory systems with 64-bit address spaces.