Architecture:Performance-Enhancing Hardware Techniques
Performance-Enhancing Hardware Techniques
Pipelining
In the 1970s, only supercomputers and mainframe computers were pipelined. Today, virtually all microprocessors are pipelined. In fact, pipelining has been a major reason why microprocessors today outperform supercomputer CPUs built less than 10 years ago. Pipelining is a technique to coordinate parallel processing of operations [2]. This technique has been used in assembly lines of major industries for more than a century. The idea is to have a line of workers specializing in different pieces of work required to finish a product. A conveying belt carries products through the line of workers. Each worker performs a small piece of work on each product. Each product is finished after it is processed by all the workers in the assembly line.
The obvious advantage of pipelining is to allow one worker to immediately start working on a new product after finishing the work on a current product. The same methodology is applied to instruction processing in microprocessors. Figure 66.8(a) shows an example five-stage pipeline dividing instruction execution into fetch (F), decode (D), execute (E), memory (M), and write-back (W) operations, each requiring various stage-specific logic. Between each stage is a stage register (SR), or pipeline register, used to hold the instruction information necessary to control the instruction. A very basic principle of pipelining is that the work performed by each stage must take about the same amount of time. Otherwise, the efficiency will be significantly reduced because one stage becomes a bottleneck of the entire pipeline. That is, the time duration of the slowest pipeline stage determines the overall clock frequency of the processor. Owing to this constraint and the often-time slower memory speeds, some of the principle five stages are often divided into smaller stages. For instance, the memory stage may be divided into three stages, allowing memory accesses to be pipelined and the overall processor clock speed to be a fraction of the memory access latency.
The time required to finish N instructions in a pipeline with K stages can be calculated. Assume a cycle time of T for the overall instruction completion, and an equal T/K processing delay at each stage. With a pipeline scheme, the first instruction completes the pipeline after T, and there will be a new instruction out of the pipeline per stage delay T/K. Therefore the delays of executing N instructions without (1) and with (2) pipelining, respectively, are
There is an initial delay in the pipeline execution model before each stage has operations to execute. The initial delay is usually called pipeline startup delay, and is equal to total execution time of one instruction (T). The speedup of a pipelined machine relative to a nonpipelined machine is calculated as
When N is much larger than the number of pipe stages k, the ideal speedup approaches k. This is an intuitive result since there are k parts of the machine working in parallel, allowing the execution to go about k times faster in ideal conditions.The overlap of sequential instructions in a processor pipeline is shown in Figure 66.8(b). The processing of each instruction is shown as a five-stage process that goes from left to right; each stage is marked by the initial of its name: F(etch), D(ecode), E(xecute), M(emory), W(riteback). A new instruction is initiated in every clock cycle. The instruction pipeline becomes full after the pipeline delay of k = 5 cycles. Although the pipeline configuration executes operations in each stage of the processor, two important mechanisms are constructed to insure correct functional operation between dependent instructions in the presence of data hazards. Data hazards occur when instructions in the pipeline generate results that are necessary for later instructions that are already started in the pipeline. In the pipeline configuration of Figure 66.8(a), register operands are initially retrieved during the decode stage. However, the execute and memory stage can define register operands and contain the correct current value but are not able to update the register file until the later write-back execution stage. Forwarding (or register bypassing) is the action of retrieving the correct operand value for an executing instruction between the initial register file access and any pending instruction’s register file updates. Interlocking is the action of stalling an operation in the pipeline when conditions cause necessary register operand results to be delayed. It is necessary to stall early stages of the machine so that the correct results are used, and the machine does not proceed with incorrect values for source operands. The primary causes of delay in pipeline execution are initiated due to instruction fetch delay and memory latency.
Branch Handling
Branch instructions pose serious problems for pipelined processors by causing hardware to fetch and execute incorrect instructions until the branch instructions are completed. Executing incorrect instructions can result in severe performance degradation through the introduction of wasted cycles into the execution process.
There are several methods for dealing with pipeline stalls caused by branch instructions. The simplest performance scheme handles branches by marking branch instructions as either taken or not-taken. The compiler marks each branch by using a special branch opcode for each case. The designation allows the pipeline to fetch subsequent instructions according to the compiler’s choice of opcode. However, the fetched instruction may need to be discarded and the instruction fetch restarted when the branch outcome is inconsistent with the compiler’s designation.
Delayed branching is a scheme which treats the set of sequential instructions following a branch as delay slots. The delay-slot instructions are executed whether or not the branch instruction is taken. Limitations on delayed-branches are caused by the compiler and program characteristics being unable to identify numerous instructions that execute independent of the branch direction. Improvements have been introduced to provide nullifying branches, which include a predicted direction for the branch. When the prediction is incorrect, the delay-slot instructions are nullified.
A more modern approach to reducing branch penalties uses hardware to dynamically predict the outcome of a branch. Branch prediction strategies reduce overall branch penalties by allowing the hardware to continue processing instructions along the predicted control path, thus eliminating wasted
cycles. Efficient execution can be maintained while branch targets are correctly predicted. However, performance penalty is incurred when a branch is mispredicted. Branch target buffer is a cache structure that is accessed in parallel with the instruction fetch. It records the past history of branch instructions so that a prediction can be made while the branch is fetched again. The prediction method adapts the branch prediction to the run-time program behavior, generating a high prediction accuracy. The target address of each branch is also saved in the buffer so that the target instruction can be fetched immediately if a branch is predicted taken.
Several methodologies of branch target prediction have been constructed [3]. Figure 66.9 illustrates the relation of several general branch prediction schemes to the processor pipeline. The table in Figure 66.9(a) retains history information for each branch. The history includes the previous branch directions for making predictions on future branch directions. The simplest history is last-taken which uses 1 bit to recall whether the branch condition was taken or not-taken during its most recent execution. A more effective branch predictor, as shown in Figure 66.9(b), uses a 2-bit saturating state history counter to determine the future branch outcome. Two bits rather than one bit allows each branch to be tagged as strongly or weakly taken or not taken. Every correct prediction reinforces the prediction, while an incorrect prediction weakens it. It takes two consecutive mispredictions to reverse the direction (whether taken or not taken) of the prediction.
Recently, more complex two-level adaptive branch prediction schemes have been built which use two levels of branch history to make predictions [6]. An example of such a predictor is shown in Figure 66.9(c). The first level is the branch outcome history of the last branches encountered. The second level is the branch behavior for the last occurrences of a specific pattern of branch histories, commonly maintained as a 2-bit saturating counters. There are alternative ways of constructing both levels of adaptive branch prediction schemes, the mechanisms can contain information that is either based on individual branches, groups (set-based), and all (global). Individual information contains the branch history for each branch instruction. Set-based information groups branches according to their instruction address, thereby form- ing sets of branch history. Global information uses a global history containing all branch outcomes. The second level containing branch behaviors can also be constructed using any of the three types. In general, the first-level branch history pattern is used as an index into the second-level branch history.
Dynamic Instruction Execution
A major limitation of pipelining techniques is the use of in-order instruction execution. When an instruction in the pipeline stalls, no further instructions are allowed to proceed to insure proper execution of in-flight instruction. This problem is especially serious for multiple issue machines, where each stall cycle potentially costs work of multiple instructions. However, in many cases, an instruction could execute properly if no data dependence exists between the stalled instruction and the instruction waiting to execute. Dynamic instruction execution, also known as out-of-order execution, is an approach that uses hardware to rearrange the instruction execution to reduce the effect of stalls. The concept of dynamic execution uses hardware to detect dependences in the in-order instruction stream sequence and rearrange the instruction sequence in the presence of detected dependences and stalls.
Today, most modern superscalar microprocessors use dynamic execution to increase the number of instructions executed per cycle. Such microprocessors use basically the same concept: all instructions pass through an issue stage in order, are executed out of order, and are retired in order. There are several functional elements of this common sequence which have developed into computer architecture concepts. The first functional concept is scoreboarding. Scoreboarding is a technique for allowing instructions to execute out of order when there are available resources and no data dependences. Scoreboarding originates from the CDC 6600 machine’s issue logic, named the scoreboard. The overall goal of scoreboarding is to execute every instruction as early as possible.
A more aggressive approach to dynamic execution is Tomasulo’s algorithm. This scheme was employed in the IBM 360/91 processor. Although there are many variation in this scheme, the key concept of avoiding write-after-read (WAR) and write-after-write (WAW) dependences during dynamic execution is attributed to Tomasulo. In Tomasulo’s scheme, the functionality of the scoreboarding is provided by reservation stations. Reservations stations buffer the operands of instructions waiting to issue as soon as they become available. The concept is to issue new instructions immediately when all source operands become available instead of accessing such operands through the register file. As such, waiting instructions designate the reservation station entry that will provide their input operands. This action removes WAW dependences caused by successive writes to the same register by forcing instructions to be related by dependences instead of by register specifiers. In general, renaming of register specifiers for pending operands to the reservation station entries is called register renaming. Overall, Tomasulo’s scheme combines scoreboarding and register renaming. Tomasulo [7] provides complete details of his scheme in the context of a floating-point unit design. Patt et al. [8] first described the extensions required to implement complete modern CPUs with Tomasulo’s algorithm.
Comments
Post a Comment