Architecture:Memory Subsystem
Memory Subsystem
The memory system serves as a repository of information in a microprocessor system. The processing unit retrieves information stored in memory, operates on the information, and returns new information back to memory. The memory system is constructed from basic semiconductor DRAM units called modules or banks.
There are several properties of memory, including speed, capacity, and cost that play an important role in the overall system performance. The speed of a memory system is the key performance parameter in the design of the microprocessor system. The latency (L) of the memory is defined as the time delay from when the processor first requests data from memory until the processor receives the data. Bandwidth is defined as the rate at which information can be transferred to and from the memory system. Memory bandwidth and latency are related to the number of outstanding requests (R) that the memory system can service:
Bandwidth plays an important role in keeping the processor busy with work. However, technology trade-offs to optimize latency and improve bandwidth often conflict with the need to increase the capacity and reduce the cost of the memory system.
Cache Memory
Cache memory, or simply cache, is a fast memory constructed using semiconductor SRAM. In modern computer systems, there is usually a hierarchy of cache memories. The top level (Level-1 or L1) cache is closest to the processor and the lowest level is closest to the main memory. Each lower-level cache is about 3 to 5 times slower than its predecessor level. The purpose of a cache hierarchy is to satisfy most of the processor memory accesses in one or a small number of clock cycles. The L1 cache is often split into an instruction cache and a data cache to allow the processor to perform simultaneous accesses for instructions and data. Cache memories were first used in the IBM mainframe computers in the 1960s. Since 1985, cache memories have become a standard feature for virtually all microprocessors.
Cache memories exploit the principle of locality of reference. This principle dictates that some memory locations are referenced more frequently than others based on two program properties. Spatial locality is the property that an access to a memory location increases the probability that the nearby memory location will also be accessed. Spatial locality is predominantly based on sequential access to program code and structured data. To exploit spatial locality, memory data are placed into cache in multiple-word units called cache lines. Temporal locality is the property that access to a memory location greatly increases the probability that the same location will be accessed in the near future. Together, the two properties assure that most memory references will be satisfied by the cache memory.
Set associativity refers to the flexibility in placing a memory data into a cache memory. If a cache design allows each memory data to be placed into any of N cache locations, it is referred to as an N-way set-associative cache. A one-way set-associative cache is also called a direct-mapped cache, as is shown in Figure 66.4(a). A two-way set-associative cache, as shown in Figure 66.4(b), allows a memory data to
reside in one of two locations in the cache. Finally, an extreme design called fully associative cache allows a memory data to be placed anywhere in the cache.
Cache misses occur when the data requested does not reside in any of the possible cache locations. Misses in caches can be classified into three categories: conflict, compulsory, and capacity. Conflict misses are misses that would not occur for fully associative caches with least recently used (LRU) replacement. Compulsory misses are misses required in cache memories for initially referencing a memory location. Capacity misses occur when the cache size is not sufficient to keep data in the cache between references. Complete cache miss definitions are provided in Ref. [4].
The latency in cache memories is not fixed and depends on the delay and frequency of cache misses. A performance metric that accounts for the penalty of cache misses is effective latency. Effective latency depends on the two possible latencies, hit latency (LHIT), the latency experienced for accessing data residing in the cache and miss latency (LMISS), the latency experienced when accessing data not residing in the cache. Effective latency also depends on the hit rate (H), the percentage of memory accesses that are hits in the cache, and miss rate (M or 1 – H), percentage of memory accesses that miss in the cache. Effective latency in a cache system is calculated as In addition to the base cache design and size issues, there are several other dimensions of cache design that affect the overall cache performance and miss rate in a system. The main memory update method dictates when the main memory will be updated by store operations. In a write-through cache, each write is immediately reflected to the main memory. In a write-back cache, the writes are reflected to main memory only when the respective cached data is purged from cache to make room for other memory data. Cache allocation designates whether cache locations are allocated on writes and/or reads. Lastly, cache replacement algorithms for associative structures can be designed in various ways to extract additional cache performance. These include LRU, least frequently used (LFU), random, and FIFO (first in, first out). These cache management strategies attempt to exploit the properties of locality. Traditionally, when caches service misses they would block all new requests. However, nonblocking cache can be designed to service multiple miss requests simultaneously, thus alleviating delay in accessing memory data.
In addition to the multiple levels of cache hierarchy, additional memory buffers can be used to improve cache performance. Two such buffers are a streaming/prefetch buffer and a victim cache [2]. Figure 66.5 illustrates the relation of the streaming buffer and victim cache to the primary cache of a memory system. A streaming buffer is used as a prefetching mechanism for cache misses. When a cache miss occurs, the streaming buffer begins prefetching successive lines starting at the miss target. A victim cache is typically a small fully associative cache loaded only with cache lines that are removed from the primary cache. In the case of a miss in the primary cache, the victim cache may hold the missed data. The use of a victim cache can improve performance by reducing the number of conflict misses. Figure 66.5 illustrates how cache accesses are processed through the streaming buffer into the primary cache on cache requests and from the primary cache through the victim cache to the secondary level of memory on cache misses.
Overall, cache memory is constructed to hold the most important portions of memory. Techniques using either hardware or software can be used to select which portions of main memory to store in cache. However, cache performance is strongly influenced by program behavior and numerous hardware design alternatives.
Virtual Memory
Cache memory illustrated the principle that the memory address of data can be separate from a particular storage location. Similar address abstractions exist in the two-level memory hierarchy of main memory and disk storage. An address generated by a program is a called a virtual address which needs to be translated into a physical address or location in main memory. Virtual memory management is a mechanism which provides the programmers with a simple, uniform method to access both main and secondary memories. With virtual memory management, the programmers are given a virtual space to hold all the instructions and data. The virtual space is organized as a linear array of locations. Each location has an address for convenient access. Instructions and data have to be stored somewhere in the real system; these virtual space locations must correspond to some physical locations in the main and secondary memory. Virtual memory management assigns (or maps) the virtual space locations into the main and secondary memory locations. The mapping of virtual space locations to the main and secondary memory is managed by the virtual memory management mechanism. The programmers are not concerned with the mapping.
The most popular memory management scheme today is demand paging virtual memory management where each virtual space is divided into pages indexed by the page number (PN). Each page consists of several consecutive locations in the virtual space indexed by the page index (PI). The number of locations in each page is an important system design parameter called page size. Page size is usually defined as a power of two so that the virtual space can be divided into an integer number of pages. Pages are the basic unit of virtual memory management. If any location in a page is assigned to the main memory, the other locations in that page are also assigned to main memory. This reduces the size of the mapping information.
The part of the secondary memory to accommodate pages of the virtual space is called the swap space. Both the main memory and the swap space are divided into page frames. Each page frame can host a page of the virtual space. The mapping record in the virtual memory management keeps track of the association between pages and page frames.
When a virtual space location is requested, the virtual memory management looks up the mapping record. If the mapping record shows that the page containing requested virtual space location is in main memory, the management performs the access without any further complication. Otherwise a secondary memory access has to be performed. Accessing the secondary memory is a complicated task and is usually performed as an operating system service. When a page is mapped into the secondary memory, the virtual memory management has to request a service in the operating system to transfer the requested page into the main memory, update its mapping record, and then perform the access. The operating system service thus performed is called the page fault handler.
The core process of virtual memory management is a memory access algorithm. A one-level memory access algorithm is illustrated in Figure 66.6. At the start of the memory access, the algorithm receives a virtual address in a memory address register (MAR), looks up the mapping record, requests an operating system service to transfer the required page if necessary, and performs the main memory access. The mapping is recorded in a data structure called the page table located in main memory at a designated location marked by the page table base register (PTBR).
Each page is mapped by a page table entry, which occupies a fix number of bytes in the page table. Thus one can easily multiply the page number by the size of each PTE to form a byte index into the page table. The byte index of the PTE is then added to the PTBR to form the physical address (PAPTE) of the required PTE. Each PTE includes two fields: a hit/miss bit and a page frame number. If the hit/ miss (H/M) bit is set (hit), the corresponding page is in main memory. In this case, the page frame hosting the requested page is pointed to by the page frame number (PFN). The final physical address (PAD) of the requested data is then formed by concatenating the PFN and PI. The data is returned and
placed in the memory buffer register (MBR) and the processor is informed of the completed memory access. Otherwise (miss), a secondary memory access has to be performed. In this case, the page frame number should be ignored. The fault handler has to be invoked to access the secondary memory. The hardware component that performs the address translation part of the memory access algorithm is called the memory management unit (MMU).
The complexity of the algorithm depends on the mapping structure. A very simple mapping structure is used in this section to focus on the basic principles of the memory access algorithms. However, more complex two-level schemes are often used when the page table becomes a large portion of the main memory. There are also requirements for such designs in a multiprogramming system, where there are multiple processes active at the same time. Each processor has its own virtual space and therefore its own page table. As a result, these system need to keep multiple page tables at the same time. It usually takes too much main memory to accommodate all the active page tables. Again, the natural solution to this problem is to provide additional levels of mapping where a second-level page table is used to map the main page table. In such designs, only the second-level page table is stored in a reserved region of main memory, while the first page table is mapped just like the code and data between the secondary storage and the main memory.
Translation Lookaside Buffer
Hardware support for virtual memory management generally includes a translation lookaside buffer (TLB) to accelerate the translation of virtual addresses into physical addresses. A TLB is a cache structure, which contains the frequently used page table entries for address translation. With a TLB, address translation can be performed in a single clock cycle when TLB contains the required page table entries (TLB hit). The full address translation algorithm is performed only when the required page table entries are missing from the TLB (TLB miss).
Complexities arise when a system includes both virtual memory management and cache memory. The major issue is whether address translation is done before accessing the cache memory. In virtual cache systems, the virtual address directly accesses cache. In a physical cache system, the virtual address is translated into a physical address before cache access. Figure 66.7 illustrates both the virtual and physical cache approaches.
A physical cache system typically overlaps the cache memory access and the access to the TLB. The overlap is possible when the virtual memory page size is larger than the cache capacity divided by the degree of cache associativity. Essentially, since the virtual page index is the same as the physical address index, no translation for the lower indexes of the virtual address is necessary. These lower index bits can be used to access the cache storage while the page number (PN) bits go through the TLB. The PFN bits that come out of the TLB are compared with the tag bits of the cache storage output, as shown in Figure 66.6, to determine the hit/miss status of cache memory access. Thus the cache storage can be accessed in parallel with the TLB.
Virtual caches have their pros and cons. Typically, with no TLB logic between the processor and the cache, access to cache can be achieved at lower cost in virtual cache systems. This is particularly true in multiaccess-per-cycle cache systems, where a multiported TLB is needed for a physical cache design. However, the virtual cache alternative introduces virtual memory consistency problems. The same virtual address from two different processes mean different physical memory locations. Solutions to this form of aliasing are to attach process identifier to the virtual address or to flush cache contents on context switches. Another potential alias problem is that different virtual addresses of the same process may be mapped into the same physical address. In general, there is no easy solution to this second form of aliasing; typical solutions involve reverse translation of physical addresses to virtual addresses.
Physical cache designs are not always limited by the delay of the TLB and cache access. In general there are two solutions to allow large physical cache design. The first solution, employed by companies with past commitments to page size, is to increase the set associativity of cache. This allows the lower index portion of the address to be used immediately by the cache in parallel with virtual address translation.
However, large set associativity is very difficult to implement in a cost-effective manner. The second solution, employed by companies without past commitment is to use a larger page size. The cache can be accessed in parallel with the TLB access similar to the other solution. In this solution there are fewer address bits that are translated through the TLB, potentially reducing the overall delay. With larger page sizes, virtual caches do not have advantage over physical caches in terms of access time. With the advance of semiconductor fabrication processes, in particular the increasing levels of metals, it has become increasingly inexpensive to have highly set-associative caches and multiported TLB. As a result, physical caches have become much more favored solution today.
Comments
Post a Comment