Low Power Microarchitecture Techniques:Case Study 1: EC-Based Microarchitectures
Case Study 1: EC-Based Microarchitectures
As mentioned in the previous section, this technique reduces the power consumption of a superscalar, out- of-order microprocessor through dynamic work reuse. To achieve this, when an instruction is reexecuted, the goal is to reuse as much as possible from the computations performed during previous executions. Obviously, the work performed in the fetch and decode stages is identical each time a specific trace is executed. Using a special register file structure, the work performed by the rename and issue stages can also be reused.
To reuse the work and reduce the branch misprediction path, an EC is inserted deep in the pipeline, after the issue stage. To avoid the performance penalty incurred by an extra pipeline stage between the issue and the execution stages, instructions are issued in parallel to both the execution stage and the EC. The conceptual microarchitecture is illustrated in Figure 19.6.
Normally, instructions are fetched from the I-Cache through the fetch stage and then decoded. In the next stage, physical registers are assigned for each logical register, avoiding potential false dependencies. The resulting instructions are placed in the issue window for dependency checking. A number of independent instructions are issued to the execution stage and, in parallel, added to a fill buffer to create program traces. When enough instructions are placed in this fill buffer, the entire program sequence is stored in the EC in the issue order, for later potential reuse.
In this setting, the branch mispredict path can be significantly shortened by feeding the execution units directly from the EC whenever possible. Initially, when the EC is empty, instructions are launched from the Issue Window, while a trace is built in parallel in the EC (trace-segment build phase). Upon a mispredict (or a trace completion condition), a search is performed to identify a possible next trace starting at that point, and should a hit occur, the instructions continue to be executed from the EC, on the alternative execution path. When operating on this alternative execution path, the processor behaves essentially like a VLIW core with instructions being fetched from the execution cache and sent directly to the execution engine. If a miss is encountered on a trace search, the pipeline front-end must be launched again and a new trace is built.
EC Internal Structure
Similar to the conventional trace-cache implementations, this microarchitecture divides the program into traces of instructions that are stored in a order different from the one given by their original addresses.
This design allows to implicitly encode information about the reordering work done in the fetch and issue stages through the actual order in which instructions are stored. However, the cache chosen in the proposed architecture is structurally different from the trace-cache typically used for increasing the fetch bandwidth.
When stored in issue order, instructions lose their original, logical order and they can be retrieved only on a sequential basis. However, in order to allow for traces to be reused, the start address of each trace needs to correspond to a physical address in the memory space. Instructions from two consecutive traces cannot be interleaved, so at each change of trace the processor must restart trace execution in order. Specifically, at each trace end, a trace look-up step must be performed. While most of the time the performance penalty associated with this look-up can be hidden (the look-up being started in advance), there are certain conditions when this is not possible. Together with the need for an in-order start of each trace, this leads to some performance penalty associated with each trace change.
To minimize the overall performance impact of this design, the created traces must be as long as possible. While most trace-cache designs proposed in the literature limit the traces to at the most three basic blocks, it is desirable to include as many instructions as possible in a trace if there is no mispredict encountered. However, as they get longer, the number of traces that can be accommodated in a reasonably sized cache decreases. This leads to a decrease in the hit rate at trace look-up, so a higher utilization of the front-end pipeline is achieved. While this does not significantly impact the overall performance, it increases the power consumption since both the front-end of the pipeline and the EC (in trace-build mode) are simultaneously utilized. The block architecture of the EC is presented in Figure 19.7.
The EC structure consists of a tag array (TA) and a corresponding data array (DA). The TA is an associative cache, addressed using the translated program counter. It is used for trace look-up and it should be as fast as possible to reduce the performance overhead associated with searching for a new trace. The SET_ID value obtained from the TA points to the set in the DA that contains the trace start. The DA is a multiway set associative cache composed of multiple memory banks. A comparison with the TRACE_ID is performed for each block in the set to identify the correct starting block for the next trace. The next chunk of instructions is located in one of the blocks of the following set, and so on (see Figure 19.7). A special end-of-trace marker identifies the end of the trace.
By knowing beforehand which set will be accessed next, a new look-up for each subsequent access can be avoided. Furthermore, knowing the next set allows the use of multiple memory banks to implement the DA. While one of the banks is used, the others can be turned off, resulting in further energy savings. Thus, the entire array is used only when accessing the first instructions in a trace. On all subsequent accesses, the energy consumed by the line decoder and by the unused banks can be used. Depending on the application, the relative number of accesses made to only one bank with respect to the total number of accesses can vary between 1:2 (e.g., gcc) and 3:4 (e.g., for floating-point benchmarks).
Inside each block, an arbitrary number of issue units is stored (Figure 19.8). An issue unit consists of independent instructions that can be issued in parallel to the functional units. Since issue units are recorded during the trace-building phase (when the front end of the pipeline is used) and then reused in all subsequent executions of the trace, the processor will take the same optimizing decisions each time it executes the code. All instructions coming from the issue window are first assembled into traces using the fill buffer and then recorded in the EC. The fill buffer can accommodate two DA blocks, and when enough instructions are available to fill a block, they are written to the EC. When reading from the EC, one issue unit is issued at a time using a similar mechanism. If enough space is available in the fill buffer, a block is read from the DA and added to the buffer.
Each block contains several issue units, and thus it needs more than one clock cycle to be sent to the execution core. This organization allows for hiding a large part of the latency incurred by accessing the internal cache because an access can be started before actually being forced to stall the pipeline. At the same time, using longer cache lines increases the power required for each access; however, it also helps in reducing the total number of accesses.
A least recently used (LRU) policy is used for freeing up blocks in each set from the DA when a new trace-building phase is initiated. To terminate the creation of a trace, the trace-building algorithm takes into account several criteria like: trace length, occurring mispredicts, jumps, and the ability of finding another existing trace starting at the current point.
When creating a new trace, instructions are added until the trace grows beyond a certain maximum length or when a branch mispredict occurs and the execution must resume from a different address. When fetching instructions from the EC, the execution is abandoned on trace end (detected when attempting to fetch more instructions from the DA) or on a mispredict (detected by the write back stage).
A trace look-up is then performed and, if this search misses, the front-end of the pipeline is restarted. Should a hit occur, instructions are issued on the alternative execution path directly from the EC.
The Register File
Placing the above-described EC deep in the pipeline, after the issue stage, allows for reusing the work done by all the units belonging to the front-end stages. This also implies that register renaming is not performed on the instructions issued directly from the EC. Being different at each trace run, the values held by the register file cannot be stored in the EC and reused. However, the register renaming operation is only performed in trace-creation mode. This operating mode assumes that the virtual-to-architected register mapping is the same at the beginning of each trace. Some architectural changes need to be made to the register pool and control unit to ensure that this condition can be satisfied. The logic structure needed for implementing each register is presented in Figure 19.9.
This structure employs a special pool of physical registers (organized as a circular buffer) for renaming every logical register of the ISA. Unlike in a typical register file, where an architected register can be renamed to any physical register, an architected register can be renamed by using only the physical registers of the corresponding pool. However, each subsequent “write” goes to a different physical register of the pool using a deterministic algorithm. This approach solves the problem of potential false data dependencies and can be used for implementing register renaming. Because of its very predictable behavior, this mechanism is similar to what was proposed for implementing modulo-scheduling [20].
When going through the rename stage, each instruction is allocated a physical register as destination, other than the one holding the last known value for the corresponding architected register. Having different physical destinations, instructions can write the result as soon as it is available, setting the V (valid) bit. This bit is used to detect the register file locations, which hold valid data. If all the source registers for an instruction have this bit set, the instruction can be issued to the execution unit. The bit is cleared when the physical register is allocated as a destination in the rename stage and it is set when the value is written in the write back stage. The actual value in the physical register (register 0: N-1) is only accessed when we have to read or write the value from the architected register.
The speculated (S) bits mark the committed values and they are cleared only after the instruction is retired. Whenever a rollback condition occurs, the index is reverted to the “last committed” value for the architected register. A physical register cannot be assigned as a destination for a new instruction (in the register renaming stage) if the associated S bit is set. If this happens, there are not enough physical registers to perform renaming at this moment and the rename stage stalls.
The N indices (values POS 0 – POS N-1) are initialized with consecutive values (0, 1, 2, . . . , N-1) and represent the logical order of the registers in the circular queue. IDX is a pointer in this queue, representing the most recent register used for writing. All accesses (to the value field or to the status bits V and S) are associative, comparing the renaming information against the POS tags. The physical registers with tag 0 (POS = 0) hold the actual value when a trace starts execution and the POS tags remain constant until the trace ends, representing the logical order (and the allocation order) of the circular queue.
Register Renaming
When the instruction reaches the renaming stage, physical registers must be assigned as its source and destination registers. The IDX value is the index of the physical register holding the last value written to that architected register. It is incremented (modulo N) for each “write” operation so successive writes to the same architected register will actually use different physical registers. For each “read”, the IDX value is read and assigned to the instruction as a physical register indicator.
The S bit is checked for the corresponding physical register and, if it is found set, the pipeline is stalled. Otherwise, S is set and the V bit is reset to mark the value as not yet available. V will be set when the result is computed and written back to the register, while S will be deleted later, when the instruction is retired.
Each trace generation is done with an initial value of IDX = 0, meaning that the correct value for the register is stored in the location marked by POS = 0. If this condition is respected, all the subsequent executions of a trace can be done without further renaming the registers. The caveat is that this requires a checkpoint to be performed when a trace execution ends: the POS values must be recomputed for the circular buffer, so each time it starts with the latest value for that architected register. This can be done by subtracting the IDX value from the POS, but this would require a separate adder for each physical register. However, since the physical order of the registers is not relevant and it does not have to match the logical one, the same effect can be obtained by XOR-ing the IDX with each POS value. By doing so, all registers have different tags ranging between 0 and N − 1 and the register holding the last value receives POS = 0. An example of register renaming is presented in Appendix A.
As can be seen, all renaming information is already present. Since the execution starts with all valid values for the architected registers in the physical register zero and the first “writes” uses register one as a destination, they will not destroy the old value. Of course, when reexecuting the trace, physical register zero in each pool may be different from that of the last time. The associative scheme ensures that all the traces will see the same register configuration each time they are executed. Another valid option here is to physically copy the values toward the origin of each register queue when a trace ends. However, by using the associative approach, one can avoid copying the values and the potential latency associated with such an operation.
Comments
Post a Comment