Dynamic Voltage Scaling for Low-Power Hard Real-Time Systems:Background
Introduction
Energy consumption has become a dominant design constraint for modern VLSI systems, especially for mobile-embedded systems that operate with a limited energy source such as batteries. For these systems, the battery lifetime is often a key product differentiating factor, making the reduction of energy consumption an important optimization goal. Even for nonportable VLSI systems such as high-performance microprocessors, the energy consumption is still an important design constraint, because large heat dissipations in high-end microprocessors often result in the device thermal degradation, system malfunction, or in some cases, nonrecoverable crash. To solve these problems effectively, power-aware design techniques are necessary over a wide range of hardware and software design abstractions, including circuit, logic, architecture, compiler, operating system, and application levels.
Dynamic voltage scaling (DVS) [1], which can be applied in both hardware and software desgin abstractions, is one of the most effective design techniques in minimizing the energy consumption of VLSI systems. Since the energy consumption E of complementary metal oxide semiconductors (CMOS) circuits has a quadratic dependency on the supply voltage, lowering the supply voltage reduces the energy consumption significantly. When a given application does not require the peak performance of a VLSI system, the clock speed (and its corresponding supply voltage) can be dynamically adjusted to the lowest level that still satisfies the performance requirement, saving the energy consumed without perceivable performance degradations. This is the key principle of a DVS technique.
For example, consider a task with a deadline of 25 ms, running on a processor with the 50 MHz clock speed and 5.0 V supply voltage. If the task requires 5 ´ 105 cycles for its execution, the processor executes the task in 10 ms and becomes idle for the remaining 15 ms. (We call this type of an idle interval the slack time.) However, if the clock speed and the supply voltage are lowered to 20 MHz and 2.0 V, it finishes the task at its deadline (=25 ms), resulting in 84% energy reduction.
Since lowering the supply voltage also decreases the maximum achievable clock speed [2], various DVS algorithms for real-time systems have the goal of reducing supply voltage dynamically to the lowest possible level while satisfying the tasks’ timing constraints. For real-time systems where timing constraints must be strictly satisfied, a fundamental energy-delay tradeoff makes it more challenging to dynamically adjust the supply voltage so that the energy consumption is minimized while not violating the timing requirements. In this paper, we focus on DVS algorithms for hard real-time systems.
For hard real-time systems, there are two types of voltage-scheduling approaches depending on the voltage scaling granularity: intratask DVS (IntraDVS) and intertask DVS (InterDVS). The intratask DVS algorithms [3,4] adjust the voltage within an individual task boundary, while the intertask DVS algorithms determine the voltage on a task-by-task basis at each scheduling point. The main difference between the two approaches is whether the slack times are used for the current task or for the tasks that follow. InterDVS algorithms distribute the slack times from the current task for the following tasks, while IntraDVS algorithms use the slack times from the current task for the current task itself.
The effectiveness of a DVS algorithm largely depends on two steps: slack identification and slack distribution. In the slack distribution step, the goal is to identify idle intervals as much as possible in advance. Identified slack intervals make it possible to scale the supply voltage for energy minimi- zation. Most of existing techniques take advantages of a priori knowledge on programs or task sets (e.g., program structure, task set specification) as well as dynamic workload variations during run time. The goal of the slack distribution step is to assign the most appropariate amount of the slack for the next code segment/task to be executed. Since the appropriate slack amount for the next segment/task is determined by many factors including future execution behaviors, optimally distrib- uting the identified slack is a challenging problem. Existing DVS algorithms are mostly different in these two steps.
Our main purpose of this paper is to survey and present representative DVS techniques proposed for both IntraDVS and InterDVS in a unified fashion. We present taxonomies of IntraDVS algorithms and InterDVS algorithms, respectively. Within each category, we describe key techniques for the slack identification and slack distribution steps. Two-layered introduction of DVS algorithms would help readers understand the overview of DVS as well as important details.
The rest of the paper is organized as follows. Before IntraDVS and InterDVS algorithms are explained, we briefly review power-related background concepts and describe the characteristics of variable-voltage processors in Section 18.2. We present intra- and intertask voltage-scheduling algorithms, respectively, in Sections 18.3 and 18.4. We conclude with a summary in Section 18.5.
Background
Energy-Delay Relationship
Modern microprocessors are implemented using CMOS circuits. To examine the tradeoff between energy and performance in variable voltage processors, we first describe the physical characteristics of CMOS circuits, especially in terms of energy consumption and circuit delay.
The power PCMOS dissipated on a CMOS logic can be decomposed into two types, static power Pstatic and dynamic power Pdynamic [5]. In the ideal case, CMOS circuits do not dissipate static power, since in steady state there is no open path from source to ground. However, there are always leakage currents and short- circuit currents which lead to a static power consumption. In the past, the static power took only a tiny fraction of the total power consumption. However, the leakage power is expected to exceed dynamic power consumption in future technology as the minimum feature size is dropped below 65 nm [5]. In this paper, we focus on dynamic power dissipation which is still a major power consumer in current VLSI systems.
Hence, the clock frequency should be scaled along with the supply voltage. As a consequence, reducing the supply voltage yields energy saving but leads to performance degradation. Real-time scheduling and energy minimization are therefore tightly coupled problems, and should be tackled in conjunction for better results.
Variable-Voltage Processor Models
The ideal model of a variable speed processor is able to run at a continuous range of clock frequencies and voltages. Moreover, since the goal is to consume as little energy as possible, for a given clock frequency there is a unique optimal supply voltage. This is the lowest voltage for which the circuit delay still permits the given clock frequency. The supply voltage and the energy consumption per cycle are therefore uniquely determined by the clock frequency. In the following, instead of using the absolute clock frequency as a basis to describe the processor clock and supply settings, the term processor speed will be used. The processor speed is the relative value of the clock frequency f compared to a reference clock frequency fref, which is usually also the maximum clock frequency sf = f/fref. A processor running at half speed will thus have the clock frequency half the reference frequency, with all the resultant consequences in terms of supply voltage, power, and energy consumption.
Using the equations introduced in Section 18.2.1, the voltage and power dissipation at frequency f can be written in terms of their reference values saturation index g is approximated by 2.0 in the classical MOSFET model. More accurate models [5] show that g is closer to 1.3, yet this does not affect the power dissipation is a convex function of the processor speed. In fact, since tasks execute clock cycles, it makes more sense to talk about the energy consumption per clock cycle for a certain frequency, ef , than to talk about the power dissipation Pf For g = 2, ef depends quadratically on the processor speed sf . This is the commonly used model in voltage- scheduling research. Finally, the energy of a task that executes a certain number cycles Nf at each frequency by Eideal since it does not consider the effects of speed switching on energy. The ideal model of the variable speed processor can switch between clock frequencies and supply voltages without any time or energy overhead. A more realistic model of a variable speed processor has to address two real problems: first, the range of available processor speeds is limited and discrete. Second, there are speed transition overheads both in time and energy. We will now look at these problems in more detail.
In practice, the range of available processor speeds can only be discrete. This comes from the fact that the core clock frequency is generated internally by a phase-locked loop (PLL) or delay loop logic (DLL) using an external, fixed frequency clock. The internally generated frequency is a multiple of the external frequency. The supply voltage follows then the steps imposed by the available clock frequency steps. However, even on a discrete range of speeds, one can simulate a continuous range of speeds. The virtual clock frequency can be obtained by running different parts of a given task at different real clock frequencies. To simulate a desired frequency fv, it is enough to use two real frequencies, one higher frequency fH > fv and one lower frequency fL < fv . A task requiring N clock cycles will then run NH clock cycles at fH, and NL (=N - NH) clock cycles at fL. To determine NH and NL, it is enough to check that the time covered by running N clock cycles at fv is equal to the time covered by running NH cycles at Note that the virtual frequency obtained using this splitting may be slightly higher than the desired virtual frequency. This difference is negligible for tasks using a sufficiently large number of clock cycles. If switching between the two frequencies takes a relatively important interval of time t H®L , one may take this into account: N /fv = NH /fH + (N – NH)/fL + t H®L . From the viewpoint of the energy consumption, it is optimal if one chooses fH and fL to be the closest bounding frequencies for fv [6,7]. Using this execution model, for a discrete range of speeds, the real energy function becomes then a piecewise-linear convex function [7]. Between any two adjacent real frequencies, the energy varies in a linear manner.
In modern processors, the clock signal accounts for a large part of energy consumption. To reduce jitter, noise, and energy consumption, the high speed core clock signals are today generated on-chip, using PLL or DLL. An external slow, and thus low energy clock signal is used by the on-chip PLL/DLL to generate the fast core clock. Changing the frequency of the PLL output signal has certain latency, since the loop has to adjust to the new frequency. This means that during the time in which the PLL relocks, the processor has to stall. So there is a certain time overhead when switching between speeds. The voltage supply design may also contribute to the speed switching overhead. This happens for the architectures where the processor must stall until the supply voltage stabilizes. Of course, if both the supply voltage and clock frequency change simultaneously, only the slowest of the two operations will affect the switch latency. However, many processors are designed such that they can keep executing instructions at constant rate while the voltage switches between two levels, with the working clock frequency determined by the lowest voltage. Moreover, depending on the number of speed switches relative to the performed tasks, the time overhead may be small enough to be considered negligible.
Variable Voltage Processor Examples
We briefly describe five examples of variable voltage processors. The first one is the result of an academic research project at UC Berkeley, and the rest are industry developments by Transmeta, AMD, and Intel. The Transmeta and AMD approaches include both hardware features and software managers for power efficiency. This makes them rather transparent to the software developer. The Intel and Berkeley solutions are focused on the hardware support, offering full control to the software developer.
UC Berkeley’s lpARM [1]. The lpARM processor, developed at UC Berkeley, is a low-power, ARM core-based architecture, capable of run-time voltage and clock frequency changes. The prototype described in Ref. [1] (0.6 µ technology) is, reportedly, able to run at clock frequencies in the 5 to 80 MHz range, with 5 MHz increments. The supply voltage is adjustable in the 1.2–3.8 V range.
Transmeta Crusoe’s LongRun [8]. Crusoe is a Transmeta processor family (TM5x00), with a VLIW (Very Long Instruction Word) core and x86 Code Morphing software that provides x86-compatibility. Besides four power management states, these processors support run-time voltage and clock frequency hopping. Frequency can change in steps of 33 MHz and the supply voltage in steps of 25 mV, within the hardware’s operating range. The number of available speeds depends thus on the model. The TM5600 model, for example, operates in normal mode between 300–667 MHz and 1.2–1.6 V, meaning 11 different speed settings. The corresponding power consumption varies between 1.5 and 5.5 W. The speed is decided using feedback from the Code Morphing algorithm, which reports the utilization. The LongRun manager employs this feedback to compute and control the optimal clock speed and voltage. Note that this is a fine grain control, transparent to the programmer. The algorithms we present in this paper require direct control over the processor speed, and would substitute or augment LongRun. Nevertheless, the Crusoe architecture is a successful example of a variable voltage processor, widely used in low-power systems. A comparison with a conventional mobile x86 processor using Intel SpeedStep, running a software DVD player shows the TM5600 to consume almost one third of the power of the mobile x86 (6 W for TM5600 versus 17 W for the mobile x86).
AMD’s PowerNow! [9]. AMD introduced PowerNow!, a technology for on-the-fly independent con- trol of voltage and frequency. Their embedded processors from the AMD-K6-2E+ and AMD-K6-IIIE+ families are all implementing PowerNow!. AMD PowerNow! is able to support 32 different core voltage settings ranging from 0.925 to 2.00 V with voltage steps of 25 or 50 mV. Clock frequency can change in steps of 33 or 50 MHz, from an absolute low of 133 or 200 MHz, respectively. The voltage and frequency changes are controlled through a special block, the enhanced power management (EPM) block. At a speed change, an EPM timer ensures stable voltage and PLL frequency, operation which can take at most 200 µs. During this time, instruction processing stops. A comparison with a Pentium III 600+ using Intel SpeedStep shows that the AMD’s processor with PowerNow! consumes around 50% less power than the Pentium with SpeedStep (3 W for AMD-K6-2E+ versus 7 W for Pentium III 600+).
Intel’s SpeedStep [10]. Intel’s SpeedStep is probably the earliest solution from the ones presented here, and consequently the weakest one. Besides normal operation, SpeedStep defines the following low-power states: Sleep, Deep Sleep, and Deeper Sleep. It only specifies two speeds, orthogonal with the power states, a Maximum Performance Mode (fast clock, high voltage, high power) and a Battery Optimized Mode (slower clock, lower voltage, power efficient). For instance, Mobile Intel Pentium 4-M Processor [10] uses 1.3 and 1.2 V for the two speeds, while the clock frequencies are 1.8 (or as low as 1.4 GHz depending on the model) and 1.2 GHz, respectively. The power consumption of the Mobile Pentium 4 is anywhere between 30 (Maximum Performance 1.8 GHz) and 2.9 W (in Deeper Sleep, 1 V). Switching between speeds requires going to Deep Sleep, change the voltage and frequency, and wake up again, procedure which requires at least 40 µs.
Intel’s XScale [11]. Intel has recently come out with XScale, an ARM core-based architecture that supports on-the-fly clock frequency and supply voltage changes. The frequency can be changed directly, by writing values in a register, while the voltage has to be provided from and controlled via an off-chip source. The XScale core specification allows 16 different clock settings, and four different power modes (one ACTIVE and three other). The actual meaning of these settings are dependent on the application specific standard product (ASSP). For instance, the 80,200 processor supports clock frequencies up to 733 MHz, adjustable in steps of 33–66 MHz. The core voltage can vary between 0.95 and 1.55 V. Switching between speeds takes around 30 µs, and the power consumption for the 80,200 (core plus pin power) is anywhere between 1 W (at maximum speed) and a few µW (in sleep mode).
These examples show that variable speed processors become more and more common. They usually have a discrete range of voltages and clock frequencies, and exhibit latency when switching between speeds. Voltage-scheduling algorithms targeting energy efficiency have to take into account these char- acteristics of real processors. The scheduling algorithms presented in this paper make good use of the hardware capabilities of such processors, especially in hard real-time environments.
Comments
Post a Comment