Architecture:Instruction Level Parallelism

Instruction Level Parallelism

EPIC processors are equipped with instruction set architecture mechanisms designed to facilitate compiler’s effort to arrange for the parallel execution of many operations in each clock cycle. These mechanisms, referred to as instruction level parallelism (ILP) features, are new instruction set architecture concepts for improving microprocessor performance.

Predicated Execution

Branch instructions are recognized as a major impediment to exploiting ILP. Branches force the compiler and hardware to make frequent predictions of branch directions in an attempt to find sufficient parallelism. Branch prediction strategies reduce this problem by allowing the compiler and hardware to continue processing instructions along the predicted control path, thus eliminating these wasted cycles. However, misprediction of these branches can result in severe performance degradation through the introduction of wasted cycles into the instruction stream.

Predicated execution provides an effective means to eliminate branches from an instruction stream. Predicated execution refers to the conditional execution of an instruction based on the value of a boolean source operand, referred to as the predicate of the instruction. This architectural support allows the compiler to use an if-conversion algorithm to convert conditional branches into predicate defining instructions, and instructions along alternative paths of each branch into predicated instructions [9]. Predicated instructions are fetched regardless of their predicate operand value. Instructions whose predicate operands are true are executed normally. Conversely, instructions whose predicate operands are false are nullified, and thus are prevented from modifying the processor state. Predicated execution allows the compiler to trade instruction fetch efficiency for the capability to expose ILP to the hardware along multiple execution paths.

Predicated execution offers the opportunity to improve branch handling in microprocessors. Eliminating frequently mispredicted branches may lead to a substantial reduction in branch prediction misses. As a result, the performance penalties associated with mispredicting these branches are removed. Eliminating branches also reduces the need to handle multiple branches per cycle for wide-issue processors. Finally, predicated execution provides an efficient interface for the compiler to expose multiple execution paths to the hardware. Without compiler support, the cost of maintaining multiple execution paths in hardware can be prohibitive.

The essence of predicated execution is the ability to suppress the modification of the processor state based upon some execution condition. There must be a way to express this condition and a way to express when the condition should affect execution. Full predication cleanly supports this through a combination of instruction set and micro-architecture extensions. These extensions can be classified as support for suppression of execution and expression of condition. The result of the condition which determines if an instruction should modify state is stored in a set of 1-bit registers. These registers are collectively referred to as the predicate register file. The values in the predicate register file are associated with each instruction in the extended instruction set through the use of an additional source operand. This operand specifies which predicate register will determine whether the operation should modify processor state. If the value in the specified predicate register is 1, or true, the instruction is executed normally; if the value is 0, or false, the instruction is suppressed.

Predicate register values may be set using predicate define instructions, as described in the HPL Playdoh architecture [10]. There is a predicate define instruction for each comparison opcode in the original instruction set. The major difference with conventional comparison instructions is that these predicate defines have up to two destination registers and that their destination registers are predicate registers. The instruction format of a predicate define is shown below.

Architecture-0283

This instruction assigns values to Pout1 and Pout2 according to a comparison of src1 and src2 specified by <cmp>. The comparison <cmp> can be: equal (eq), not equal (ne), greater than (gt), etc. A predicate <type> is specified for each destination predicate. Predicate defining instructions are also predicated, as specified by Pin.

The predicate <type> determines the value written to the destination predicate register given the result of the comparison and of the input predicate, Pin. For each combination of comparison result and Pin, one of three actions may be performed on the destination predicate. It can write 1, write 0, or leave it unchanged. There are six predicate types which are particularly useful, the unconditional (U), OR, and AND type predicates and their complements. Table 66.1 contains the truth table for these predicate definition types. Unconditional destination predicate registers are always defined, regardless of the value of Pin and the result of the comparison. If the value of Pin is 1, the result of the comparison is placed in the predicate register (or its compliment for U). Otherwise, a 0 is written to the predicate register. Unconditional predicates are utilized for blocks which are executed along only one path in the if-converted code region.

The OR type predicates are useful when execution of a block can be enabled along multiple paths, such as logical AND (&&) and OR (||) constructs in C. OR type destination predicate registers are set if Pin is 1 and the result of the comparison is 1 (0 for OR), otherwise the destination predicate register is unchanged. Note that OR type predicates must be explicitly initialized to 0 before they are defined and used. However, after they are initialized multiple OR type predicate defines may be issued simultaneously and in any order on the same predicate register. This is true since the OR type predicate either writes a 1 or leaves the register unchanged which allows implementation as a wired logical OR condition. AND type predicates, are analogous to the OR type predicate. AND type destination predicate registers are cleared if Pin is 1 and the result of the comparison is 0 (1 for AND), otherwise the destination predicate register is unchanged.

Figure 66.11 contains a simple example illustrating the concept of predicated execution. Figure 66.11(a) shows a common programming if-then-else construction. The related control flow representation of that

Architecture-0284

Architecture-0285

programming code is illustrated in Figure 66.11(b). Using if-conversion, the code in Figure 66.11(b) is then transformed into the code shown in Figure 66.11(c). The original conditional branch is translated into a pred_eq instructions. Predicate register p1 is set to indicate if the condition (A = B) is true, and p2 is set if the condition is false. The “then” part of the if statement is predicated on p1 and the “else” part is predicated on p2. The pred_eq simply decides whether the addition or subtraction instruction is performed and ensure that one of the two parts is not executed.

Speculative Execution

The amount of ILP available within basic blocks, defined as consecutive instruction sequences without branching in or out, is extremely limited in most programs. As such, processors must optimize and schedule instructions across basic block boundaries to achieve higher performance. In addition, future processors must contend with both long latency load operations and long latency cache misses. When load data is need by subsequent dependent instructions, the processor execution must wait until the cache access is complete.

In these situations, our-of-order processors dynamically perform branch prediction and reorder the instruction stream to execute nondependent instructions. As a result, they have ability of exploiting parallelism between instructions before and after a correctly predicted branch instruction. However, this approach requires complex circuitry at the cost of chip die space as well as additional power consumption. Similar performance gains can be achieved using static compile-time speculation methods without complex, power hungry out-of-order logic. Speculative execution, a technique for executing an instruction before knowing its execution is required is an important technique for exploiting ILP in programs. Speculative execution is best known for its use in hiding memory latency.

A compiler can utilize speculative code motion to achieve higher performance in several ways. First, in regions of code where insufficient ILP exists to fully utilize the processor resources, useful instructions from other regions may be executed. Second, instructions at the beginning of long dependence chains may be executed early to reduce the computation’s critical path. Finally, long latency instructions may be initiated early to overlap their execution with other useful operations. Figure 66.12 illustrates a simple example of code before and after a speculative compile-time transformation is performed to execute a load instruction above a conditional branch.

Figure 66.12(a) shows how the branch instruction and its implied control flow define a control dependence that restricts the load operation from being scheduled earlier in the code. Cache miss latencies would halt the processor unless out-of-order execution mechanisms were used. However, with speculation support, Figure 66.12(b) can be used to hide the latency of the load operation.

The solution requires the load to be speculative or nonexcepting. A speculative load will not signal exceptions such as address alignment errors or address space access violations. Essentially, the load remains silent for these exceptions. The additional check instruction in Figure 66.12(b) enables these signals to be detected when the execution does reach the original location of the load. When the other

Architecture-0286

path of branch’s execution is taken, such silent signals are meaningless and can be ignored. Using this mechanism, the load can be placed above all existing control dependences, providing the compiler with the ability to hide load latency. Details of compiler speculation can be found in Ref. [11].

Comments

Popular posts from this blog

SRAM:Decoder and Word-Line Decoding Circuit [10–13].

ASIC and Custom IC Cell Information Representation:GDS2

Timing Description Languages:SDF