Low Power Microarchitecture Techniques:Case Study 2: GALS Microarchitectures.
Case Study 2: GALS Microarchitectures
In this section we will start with a fairly typical out-of-order, superscalar architecture and analyze the impact of various microarchitecture design decisions on the power–performance trade-offs available in a multiple clock processor. To this end, let us assume a 16-stage pipeline that implements a four-way superscalar, out-of-order processor.
The underlying microarchitecture organization is shown in Figure 19.10. Groups of up to four aligned instructions are brought from the Level 1 instruction cache in the fetch stages at the current PC address, while the next PC is predicted using a G-share branch predictor [21]. The instructions are then decoded in the next three pipeline stages (named here decode) while registers are renamed in the rename stages. After the dispatch stages, instructions are steered according to their type, toward the integer, floating point, or memory clusters of the pipeline. The ordering information that needs to be preserved for in- order retirement is also added here. In register read, the read operation completes and the source operand values are sent to the execution core together with the instruction opcode.
Instructions are placed in a distributed Issue Buffer (similar to the one used by Alpha 21264) and reordered according to their data dependencies. Independent instructions are sent in parallel to the out-of-order
execution core. The execution can take one or more clock cycles (depending on the type of functional unit that executes the instruction) and the results are written back to the register file in the write back stages. Finally, the instructions are reordered for in-order retirement, according to the tags received during dispatch.
Of extreme importance for our design exploration is the choice of various design knobs that impact the overall power–performance trade-offs in GALS processors. Since the primary focus is on the microarchitecture level, we chose to omit several lower-level issues in this study.
• Local clock generation. Each clock domain in a GALS system needs its own local clock generator; ring oscillators have been proposed as a viable clock generation scheme [22,23].
• Failure modeling. A system with multiple clock domains is prone to synchronization failures; we do not attempt to model these since their probabilities are rather small for the communication mechanisms considered (but nonzero) [23,24] and this type of microarchitecture does not target mission-critical systems.
Instead, we are focusing on the following microarchitecture design knobs:
• The choice of the communication scheme among frequency islands.
• The granularity chosen for the frequency islands.
• The dynamic control strategy for adjusting voltage/speed of clock domains so as to achieve better power efficiency.
The Choice of Clock Domain Granularity
To assess the impact of introducing a mixed-clock interface on the overall performance of the baseline pipeline, let us assume that the pipeline is broken into several synchronous blocks. The natural approach—minimize the communication over the synchronous blocks’ boundaries—does not necessarily
work here. An instruction must pass through all the pipeline stages to be completed. Thus, other criteria for determining these synchronous islands must be found.
One possible criterion is to minimize clock skew, thus allowing for faster local clocks. In Ref. [25], the authors propose a model for the skew of the on-chip clock signal. By applying the model to an Alpha 21264 microprocessor, one can evaluate the contribution of different microarchitectural and physical elements in increasing the skew and thus in limiting the clock frequency.
As can be seen in Figure 19.11, the main components affecting clock skew are system parameter variations (supply voltage Vdd, load capacitance CL, and local temperature T), especially, the variations in CL. Since the microarchitecture described in Figure 19.10 exhibits a large variation in the number of pipeline registers clocked, a possible placement of the asynchronous interfaces is shown dotted. Figure 19.12 shows the “speed- up coefficient” for each of the main structures, that is, the overall speedup achieved when the domain’s local clock frequency is increased by 1%. As can be seen, across this set of benchmarks, the most significant speedup can be achieved by increasing the clock speed in the fetch or memory, followed by Integer and FP partitions. Thus, these modules should be placed in separate clock domains if possible, since individually they could provide significant performance increase if speeded-up.
To break the execution core, we can use the partitioning scheme proposed in Refs. [26,27]. By starting with a processor with separate clusters for integer, floating point, and memory execution units (much
like the Alpha 21264 design) we can naturally separate these clusters into three synchronous modules. The drawback of this scheme is that it increases significantly the latency of forwarding a result across the asynchronous interface toward another functional unit. This effect can be seen mainly in load-use operations that are executed in separate clock domains, imposing a significant penalty on the overall performance in some programs.
To limit the latency of reading or writing data from the registers, the register read and the write back stages must be placed together, in the same synchronous partition as the register file. Following the same rationale, the rename and retire stages both need to access the rename table, so they must be placed in the same partition. Following these design choices, we can now split the pipeline into at least four clock regions. The first one is composed of the fetch stage, together with all the branch prediction and instruction cache logic.
The two decode stages can be included either in the first clocking region or in the second one—all the instructions that pass from fetch to decode will be passed down the pipeline to rename.
To limit the load capacitance variations and also considering the bitwidth increase of the pipeline after Decode, we can introduce an asynchronous boundary here. The second clocking region will be organized around the renaming mechanism and it will also contain the reorder buffer and the retire logic. Given the variation in the register width for the rest of the pipeline, an asynchronous boundary can also be introduced after dispatch. The third clocking region must be organized around the register file, including the register read and write back stages. Finally, the out-of-order part of the pipeline (the Issue logic and the execution units) is split into separate clusters that amount to three different clock regions. The forwarding paths can thus be internal—toward a unit with the same type and placed in the same clock region—or external—toward other clock regions.
Choice of Interdomain Communication Scheme
One of the most important aspects of implementing a GALS microprocessor is choosing an asynchronous communication protocol. For high-performance processors, the bandwidth and latency of the internal communication are both important and a trade-off is harder to identify. Several mechanisms have been proposed for asynchronous data communication between synchronous modules in a larger design [23]. The conventional scheme to tackle such problems is the extensive use of synchronizers—a double- latching mechanism that conservatively delays a potential read, waiting for data signals to stabilize as shown in Figure 19.13(a). Even though data are produced before time step 2, the synchronizer enforces its availability at the consumer only at time step 4. This makes classical synchronizers rather unattractive, as their use decreases performance and the probability of failure for the whole system rises with the number of synchronized signals.
Pausable clocks (Figure 19.13[b]) have been proposed as a scheme that relies on stretching the clock periods on the two communicating blocks until the data are available or the receiver is ready to accept them [28]. If T is greater than an arbitrary threshold, then the read can proceed, otherwise the active
edge 2 of the consumer clock is delayed. While the latency is better in this approach, it assumes that asynchronous communication is infrequent. Stretching the clock is reflected in the performance of each synchronous block and thus, it is most effective when the two blocks use a similar clock frequency. It can also be an effective mechanism when the whole block must wait anyway until data are available.
Another approach is to use arbiters for detecting any timing violation condition—Figure 19.13(c). In this case, data produced at time step 1 may be available at time step 2 if T is larger than a certain threshold. While the mechanism is conceptually similar to that of synchronizers, it offers a smaller latency.
Asynchronous FIFO queues have been proposed in Ref. [29], using either synchronizers or arbiters. Such an approach works well under the assumption that the FIFO is neither completely full, nor com- pletely empty. The scheme retains the extra latency introduced by the use of synchronizers, but improves the bandwidth through pipelining. For the nominal operation of this structure (when the FIFO is neither empty, nor full), a potential read is serviced using a cell different from the one handling the next write, so both can be performed without synchronization.
All these mechanisms reduce the error probability to very low levels, but they cannot ensure that meta- stability will never occur. However, as Ginosar [24] showed recently, the error rate can be reduced as much as it is desired. Typically, the mean time to failure is of the order of hundreds of years, at least an order of magnitude higher than the time between soft error occurrences [30] or the expected life of the product.
Choice of Dynamic Control Strategy
One of the main advantages offered by the GALS approach is the ability to run each synchronous module at a different clock frequency. If the original pipeline stages are not perfectly balanced, the synchronous blocks that we obtain after the partitioning can naturally be clocked at different frequencies. For example, if the longest signal path belongs to the Register Renaming module, in the GALS approach, we could potentially run the Execution Core at a higher clock frequency than the fully synchronous design.
Furthermore, even if we start with a perfectly balanced design (or we resize transistors to speed up longer signal paths), we can slow down synchronous blocks that are off the critical path, while keeping the others running at nominal speed. The slower clock domains could also operate at a lower supply voltage, thus producing additional power savings. Since energy consumption is quadratically dependent on Vdd, reducing it can lead to significant energy benefits, while latency (D) is increased accordingly.
where α is a technology-dependent factor, which is 1.2–1.6 for current technologies [31] and Vt the threshold voltage.
To exploit nonuniform program profiles and noncriticality of various workloads, different schemes have been proposed for selecting the optimal frequency and voltage supply in a GALS processor. In Ref. [26], a simple threshold-based algorithm is used for selecting the best operating point for modules that have a normal and a low-power mode. The algorithm monitors the average occupancy of each issue window and can decide to switch the module to a low-power mode when this occupancy drops below a predefined threshold, or ramp the voltage up when a high threshold is exceeded. For each issue window (integer, floating point, and memory), the algorithm is:
1. if (occupancy > MODULE_UP_THRESHOLD) && (module_speed
== LOW_SPEED)
2. module_speed = HIGH_SPEED;
3. if (occupancy < MODULE_DOWN_THRESHOLD) && (module_speed == HIGH_SPEED)
4. module_speed = LOW_SPEED;
A more complex model is proposed in Ref. [27]. Here, an attack-decay algorithm is assumed for selecting the best-operating point for processors that offer a wide range of frequencies and supply voltages. The algorithm monitors the instruction window occupancy and, based on its variation, decides whether the frequency should be increased or decreased. Any significant variation triggers a rapid change of the clock frequency to counter it. For small or no variations, the clock frequency is decayed continuously, while monitoring performance (Appendix B).
The instruction window occupancy is not the only significant aspect that can be considered for deciding a switch. Even though an instruction window could have high occupancy, this could be due to a bottleneck in another cluster. If load operations are delayed, it is very likely that instructions will accumulate in the integer cluster as well. However, speeding up the clock in the integer domain will not improve the performance. In this case, taking decisions based only on local issue queue occupancy will not help and the number of interdomain data dependencies (that is, the number of pending dependencies to or from another clock domain) may be more significant than the issue window occupancy.
Furthermore, both Refs. [26,27] allow DVS just for the execution core, assuming that the clock speed of the front-end is critical for the overall performance, and thus should not be reduced. However, there are large sequences of code, where the usable parallelism (defined here in terms of IPC—instructions committed per clock cycle) is significantly smaller than the theoretical pipeline throughput. In these cases, it makes sense to reduce the speed of the front-end since it produces more instructions than can be processed by the back-end.
With these observations, we can modify the previously described methods to include both information about the number of interdomain dependencies and DVS algorithm for the front end of the pipeline. Thus, we obtain:
A similarly modified algorithm can be derived from the attack-decay approach, using the same combined metric and allowing for variations in the front-end frequency.
Comments
Post a Comment