Low Power Microarchitecture Techniques:Techniques for Low-Power Processor Design.
Techniques for Low-Power Processor Design
Traditionally, performance concerns have taken priority over energy costs or power consumption. Power efficiency has been addressed mainly at the technology level, through lower supply voltages, smaller transistors, silicon-on-insulator (SOI) technology, better packaging, etc. Nevertheless though, power dissipation has become one of the primary design constraints for modern processors, and thus microarchitecture designers must now take it into consideration as well.
At the microarchitectural level, the methods to reduce the dynamic power consumption fall under two fundamental categories: capacitance reduction and dynamic voltage/frequency scaling (DVS). These address specific aspects of the dynamic power consumption equation
Limiting the Switched Capacitance
The first class of methods aims at reducing the work performed by the processor during each clock cycle, and, in consequence, the switched capacitance Ceff. Most of these techniques come somewhat against the current industry trend which is to add more hardware and perform more speculative work. All this additional hardware is frequently blamed for the inefficiency of modern processors, since it consumes a lot of power, while offering limited performance benefit.
The first and most obvious method to create a low-power processor is to return to basics and limit the amount of on-chip logic to only useful program computation. Such processors do not include any hardware for scheduling instructions or for data/control speculation. Many of them are single-issue processors, and, depending on the performance point targeted by these designs, they can be pipelined (e.g., various ARM implementations [2]) or nonpipelined (e.g., various i8051 compatible micro- controllers from various companies [3]).
A related category is composed of multiple-issue processors built under the very long instruction word (VLIW) paradigm. Such designs are capable of executing several instructions in parallel, but they do not offer any scheduling capability. Basic instructions are assembled into long instruction words that are capable of specifying actions for multiple functional units at the same time. The only action expected from the processor is to fetch and execute one such word on each cycle, without checking data or control dependencies. The compiler has full knowledge about the hardware capabilities and timings, and must make sure that all the instructions are scheduled such that no dependency is broken. A typical VLIW architecture is presented in Figure 19.2.
Such processors promise high power efficiency, since they only dedicate hardware resources to actual instruction execution. At the same time, they can offer very high peak performance since a lot of small execution units can be placed on a single chip. The drawback is that they require extremely smart compilers to fill all available execution slots. Starting from traditional, sequential programs, it is very hard to automatically find enough independent operations to keep all the functional units occupied. Thus, the peak performance of these devices is usually very different from the achieved performance. Furthermore, the compiler must know everything about the microarchitecture to create decent instruction schedules, and the binary program becomes tied to a particular hardware implementation. Thus, they cannot offer binary compatibility across a larger processor family or across several microarchitecture generations. These two problems have severely limited the spread of VLIW processors, and they are currently relegated mostly to DSP-style applications. However, owing to their inherent low-power consumption and high potential performance, they can provide very good solutions for special-purpose devices.
The incredible success of legacy ISAs such as Intel x86 has proven the importance of binary compatibility. Binary compatibility has also proven more important than the benefits brought by RISC ISAs such as Alpha [4] or PA-RISC [5]. Processors using these instruction sets do not require complicated hardware wrappers and decoders, being able to dedicate more resources to actual program execution. For economical reason though, these additional logic blocks are maintained at the expense of power and complexity.
An attempt to improve the power efficiency by getting rid of these hardware wrappers has been made by Transmeta [6], customizing a low-power VLIW processor for general-purpose computation. Crusoe and Efficeon are essentially VLIW processors surrounded by a software layer that does just-in-time compilation for x86 ISA programs. While these processors are able to run legacy x86 binaries and offer low-power consumption, their performance varies greatly with the application behavior. They take a steep penalty whenever translation has to be performed, and work well only if the code locality is very good. Even though statically scheduled (in-order) processors consume less power per clock cycle, their dynamically scheduled (out-of-order) counterparts are currently much more popular. This is mainly caused by performance reasons: out-of-order processors are typically faster, and they are also less demanding about compiler quality. They also tend to perform better as clock frequencies increase, since they can schedule around long latency memory instructions. Thus, several techniques have been proposed
for reducing the power consumption in large superscalar, out-of-order processors.
Traditionally, a superscalar processor contains a large number of resources of various types so that it can accommodate different instruction mixes. For example, to achieve a sustained throughput of three instructions per clock cycle, the execution cores of Pentium 4 and Opteron include many more execution units (9 and 10, respectively). Even though all these resources are infrequently used, they still consume power and this inefficiency can be targeted by various methods.
Guarded evaluation [7] has been proposed as a static technique for reducing the power required by a combinational circuit when some of its input values are not necessary through successive time steps. This method works by stopping the propagation of signal transitions through unused circuits, effectively limiting the dynamic power consumption (Figure 19.3). A slightly more general version of this technique is clock gating, proposed in Ref. [8] for saving the power wasted by units that are temporarily not used. Clock gating also targets the propagation of signal transitions and works by shutting off the clock signal in the latches placed in front of the target circuit. In various forms, these techniques are now used widely to limit the power consumed in areas of the design that are not exercised during specific computations.
A different research direction targets the speculative nature of these superscalar, out-of-order engines. While speculation can dramatically increase overall performance, the ultimate result is dependent on the accuracy of these predictions. A prediction which ultimately proves to be wrong is very costly in terms of power consumption and it does not help at all in terms of performance (it may or may not impact the performance depending upon the actual implementation). Thus, it has been proposed to use confi- dence estimators and limit the amount of speculation to only the cases when the probability of success is very high. These techniques decrease the overall power consumption by reducing the amount of work that is ultimately thrown away when a prediction proves to be wrong. Such mechanisms have been proposed for both data and control speculation [9,10].
Another technique which targets the assumption that performance can be increased by throwing more hardware at the problem is resource scaling [11]. This method targets highly complex processor implementations, which use a number of resources to extract a high level of instruction-level parallelism (ILP). While some applications might exercise the entire set of resources (usually scientific or media applications); in other cases the inherent level of ILP is limited and most of the resources remain unused (control-bound applications). Such applications would maintain a relatively similar execution pattern on much simpler processors. The resource scaling technique takes advantage of this fact and turns off (through clock gating) some of the units available, saving power while maintaining a comparable performance.
All the ideas mentioned so far attempt to increase efficiency by reducing the amount of power spent on useless computation. However, even under the idealized assumption that everything works perfectly and no useless work is performed, the dynamically scheduled processor still requires more power per instruction than the simpler VLIW implementation. Such processors use long pipelines, performing operations like branch predictions, register renaming, reordering, etc. The concept of work reuse [12–14] tries to bring the superscalar, out-of-order processor closer to the efficiency of a VLIW.
In Pentium 4, a special trace-cache has been placed in the pipeline, after the x86 decoding stages. By storing decoded instructions (uops) in the trace-cache, the whole decode stage can be shut down for
significant periods of time while the rest of the execution engine continues working. When a hit in the trace-cache occurs, instructions do not need to be decoded again and can be fed into the pipeline directly from the trace-cache.
An extension of this technique [14] uses an execution cache (EC) placed deep in the pipeline to shorten the critical instruction path (Figure 19.4). If the cache is placed after the issue stage, instructions that are fetched, decoded, and have already had registers renamed can be stored in issue order (instead of in program order) in the EC. Assuming that the EC efficiency is very good, most of the time instructions are executed out of this cache. Instructions are retrieved from the EC in issue order, and they can be sent directly to the execution engine in a VLIW-like fashion. Using clock gating, the front-end of the pipeline is shut off, reducing the effective work that needs to be performed for each retired instruction. However, when instructions are not found in the EC, the front-end resumes its role as a scheduler, creating traces for further reuse. This technique will be described in detail in Section 19.4.
DVS
The second class of methods relies on the observation that users need top performance only in infrequent cases. Modern processors offer more than adequate performance on most applications, being overdesigned for the few situations when more power is required. This holds especially true for processors that go into personal devices such as PDAs, cell phones, even laptop and desktop computers. Many of the applications run on these devices are bound by user input, so the performance of the underlying processor is largely irrelevant. Another class of applications that has become very popular contains media applications. In such cases, the optimal performance is actually predefined based on the media type and a faster processor will not improve the user experience.
A related, but slightly different case, is represented by mobile devices. Even though more performance would be desirable in some cases, most of the time they are limited by battery capacity. Thus, a compromise can be struck by giving up on some of the potential performance for a longer battery life.
All processors that are currently intended for mobile applications support techniques such as frequency and voltage scaling [6,15,16] to reduce their power requirements on those applications where high performance is not required. DVS is a very effective technique for reducing power consumption since significant reductions can be obtained at the expense of relatively small performance drops. While the performance is only impacted by the reduction in clock frequency, the power also goes down quadratically with the voltage supply. As can be seen in Table 19.1, power consumption decreases much faster than the overall performance. Such processors offer several operating points, characterized by different frequencies and Vdd values. Depending on the application behavior, power supply availability, cooling capabilities, etc., one of these available points is selected by the operating system or by the firmware. Table 19.1 presents the SpeedStep capability implemented by the Pentium III family in the 0.18 µm process technology. This processor offers two operating levels: a high-performance mode and low-power mode, used primarily for extending battery life in mobile devices.
A special case of DVS can be applied to globally asynchronous, locally synchronous (GALS) processors. Such an architecture is composed of synchronous blocks that communicate with each other only on demand, using an asynchronous or mixed-clock communication scheme. Through the use of a locally generated clock signal within each individual domain, such architectures make it possible to take advantage of industry-standard synchronous design methodology, while still offering some of the benefits of asynchronous circuits. Thus, they do not require a global clock distribution network and deskewing
circuitry. Also, they allow the supply voltages and clock frequencies to be scaled per domain, offering significantly more flexibility. However, the overhead introduced by communicating data across clock domain boundaries may become a fundamental drawback, limiting the performance of these systems. Thus, the choice of granularity for these synchronous blocks or islands must be very carefully done in order to prevent the interdomain communication from becoming a significant bottleneck. At the same time, the choice of the interdomain communication scheme as well as of the on-the-fly mechanisms for per domain DVS become critical when analyzing overall power–performance trends. A processor micro- architecture using the GALS+DVS techniques is presented in Section 19.5.
Reducing the Static Power Consumption
A third class of methods targets a different aspect of the power consumption in modern micro- processors: the static power. While dynamic power is consumed when a transistor switches between “on” and “off ”, static power is leaked through the transistor junctions even when it does not perform any useful work.
Traditionally, the static power has been several orders of magnitude smaller than the dynamic power consumption, thus being largely ignored by processor designers. However, as transistors become smaller with each new process technology, the isolator layers shrink and leakage currents increase. Furthermore, the number of such leaky transistors placed on a chip increases with each micro- architecture generation. As a result, the static power consumption has become a first-class concern, equally important when compared to the dynamic power consumption for modern-processor micro- architecture design.
All techniques proposed for limiting the static power target the leakage current. One such solution is to stack multiple “off ” transistors on any path between Vdd and ground, exponentially reducing the current that leaks on the path. Such an example is presented in Figure 19.5.
Power gating [17] relies on the presence of these sleep transistors, placed besides the target logic modules. Depending on whether the sleep transistors are placed toward the Vdd or the ground rails, the method uses a gated Vdd or a gated ground, respectively.
While this method works very well for combinational circuits, special care must be taken when it is applied to sequential logic. In this case the modified voltage levels can interfere with the memory elements, destroying the state of the circuit when the sleep transistor is turned off. This problem is discussed in Ref. [18], where a special design called DRI cache is proposed for the L1 caches of a microprocessor. The cache stores additional information, allowing it to detect timeout conditions and turn-off sections which are not exercised by the current program. Furthermore, the DRAM cells use sleep transistors and can be gated to reduce the static power consumption. The design uses the gated ground methodology to retain the cache content while different sections are turned off.
A related method for reducing the static power consumption is the drowsy cache [19]. The microarchi- tectural decisions are taken in a fashion similar to the DRI-cache design, the hardware deciding when the cache lines can be turned off without impacting the overall performance. However, the circuit implementation is more complex: it relies on adaptive body biasing, a technique that dynamically modifies the effective Vt of the memory cells. The higher Vt translates into a lower leakage current, reducing the static power for cache lines that are turned off.
Comments
Post a Comment