What every Programmer should know about the memory system
In the traditional view, memory was a flat monolithic structure with a fixed access latency. Today’s computer systems use page-mode addressing and smart memory scheduling which makes this flat view look rather naive. In my opinion, there are four concepts every programmer must be familiar with: pages, banks, row conflicts, and FR-FCFS.
Pages
Data in the DRAM is stored at a page granularity. The size of a typical page is between 4K to 16K. In theory, this size is independent of the OS pages which are typically 4KB each. In practice, most DRAMs use pages bigger than the OS pages in order to improve locality in the row buffers (described below).
Banks
To reduce access latency, memory is split into multiple equal-sized units called banks. Most DRAM chips today have 8 to 16 banks. Each bank stores tens of thousands of pages.
The following is a practical example of how a 4 GB (32-bit) physical address will be split by the DRAM (credit to my friend Khubaib). Bits 12-0 identify the byte within the 8KB page. Bits 16-13 identify which of the 16 memory channels the system should use for this address. Bits 20-17 identify which of the 16 banks the address will be found in and Bits 32-21 identify which row within the bank is accessed. Channel bits are used to split banks across DRAM modules in order to increase the bandwidth to DRAM.
A memory bank can only service one request at a time. Any other accesses to the same bank must wait for the previous access to complete, known as a bank-conflict. In contrast, memory access to different banks can proceed in parallel (known as bank-level parallelism).
Row-Buffer
Each DRAM bank has one row-buffer, a structure which provides access to the page which is open at the bank. Before a memory location can be read, the entire page containing that memory location is opened and read into the row buffer. The page stays in the row buffer until it is explicitly closed. If an access to the open page arrives at the bank, it can be serviced immediately from the row buffer within a single memory cycle. This scenario is called a row-buffer hit (typically less than ten processor cycles). However, if an access to another row arrives, the current row must be closed and the new row must be opened before the request can be serviced. This is called a row-buffer conflict. A row-buffer conflict incurs substantial delay in DRAM (typically 70+ processor cycles).
FR-FCFS
Memory accesses are not sent to the DRAM in the order they are sent by the core. Instead, they are buffered in the on-chip memory request buffers and dispatched to the DRAM in the order specified by the memory scheduler. To maximize throughput of the memory system, memory schedulers today employ a policy called the First Row-First Come First Serve (FR-FCFS). The idea is to prioritize accesses to the page which is currently open in order to minimize the number of row conflicts. While this helps with memory throughput, it further increases the penalty of a row buffer conflict. As a rule of thumb, a row buffer conflict takes at least 3x the time of a row-buffer hit. It is important to note that all known memory scheduling algorithms schedule requests for each bank completely independently and, thus, have no impact on bank-level parallelism.
What can programmers do?
Avoid memory accesses: I cannot stress this enough. Memory may seem free in terms of storage but it costs a lot more in performance if the working data set size becomes bigger than the cache or the available DRAM. It is ultimately a good idea to reduce the memory footprint as much as reasonably possible, especially on embedded devices.
Bank conflicts: In general, application programmers work with virtual addresses while the DRAM operates on physical addresses. Thus, programmers cannot reduce hazards like bank conflicts as they have no control over the placement of data in physical memory. However, there are some special scenarios where the programmers can help. For example, most ISAs have now introduced virtual pages that are as big as 1GB. Programmers are allowed to do their own mapping within these large pages. This gives the programmers a lot more control over what data goes to which bank and they can use this control to increase bank-level parallelism.
Row-Buffer Conflicts: Programmers can also help by increasing the spatial locality in the row buffer, .i., try to make consecutive memory accesses as close to each other as possible. This often implies packing data which is likely to be accessed together in consecutive memory locations, e.g., building a struct of arrays is often better performance. By the way, this also implies that linked data structures that promote random accesses should be avoided as much as possible because each row buffer conflict has a latency greater than 100 computes.
Experimental Data
I did an experiment in this post on an ARM Cortex A8 which showed that sequential accesses cost only 20 processor cycles but random accesses cost close to 180 cycles. The same experiment on an i7 led to about a 2x boost from sequential accesses.
Conclusion
Programmers can save battery and increase performance via DRAM-aware programming. Such optimizations may not be practical in all scenarios but understanding the basic DRAM structure can help programmers identify existing or potential performance bottlenecks, specially on more constrained platforms like the iPhone.
Reference: What every Programmer should know about the memory system from our JCG partner Aater at the Future Chips blog.