Dynamic Voltage Scaling for Low-Power Hard Real-Time Systems:A Taxonomy of Intra DVS
A Taxonomy of Intra DVS
The main feature of IntraDVS algorithm is how to select the program points where the voltage and clock will be scaled. Depending on the selection mechanism, we can classify IntraDVS algorithms into five categories as shown in Figure 18.1.
Segment-based IntraDVS techniques partition a task into several segments [12,13]. After executing a segment, they adjust the clock speed and supply voltage exploiting the slack times from the executed segments of a program. The segment-based IntraDVS approach was improved into collaborative IntraDVS techniques [14]. In the collaborative IntraDVS, OS and compiler work together to scale the supply voltage within a task. At the off-line stage, the compiler partitions an application into several segments and instruments the application with temporal hints. During run time, the operating system periodically changes the processor’s clock frequency and voltage based on actual execution behaviors (combined with the inserted temporal hints).
Path-based IntraDVS techniques use the control flow information to find slack times [3,15,16]. They select all the program locations where the remaining workload is changed, and insert voltage scaling codes into the identified program locations at compile time. The inserted voltage scaling code modifies the supply voltage when executed, exploiting slack times coming from run-time variations of different execution paths.
In designing a path-based IntraDVS algorithm, two key issues exist. The first issue is how to predict the remaining execution cycles. Depending on the prediction method, several kinds of IntraDVS algorithms have been proposed, using the remaining worst-case execution path (RWEP-IntraDVS) [3], using the remaining average-case execution path (RAEP-IntraDVS) [15], and using the remaining optimal case execution path (ROEP-IntraDVS) [16]. The RAEP-IntraDVS and the ROEP-IntraDVS generate more energy-efficient schedules over the RWEP-IntraDVS, because they exploit the profile information of a task execution. While the remaining execution cycles are predicted based on the most frequent execution path in the RAEP-IntraDVS [15], the ROEP-IntraDVS uses the optimal predicted execution cycles based on the profile information [16].
The second issue of path-based IntraDVS is how to determine the voltage scaling points in the program code. The optimal points are the earliest points where we can detect the run-time slack. While the original
path-based IntraDVS techniques find the voltage scaling points using the control flow information of a program, the parametric IntraDVS and look-ahead IntraDVS techniques identify the voltage scaling points using the data flow information of the program as well as the control flow information to find earlier voltage scaling points [17,18].
Memory-aware IntraDVS utilizes the CPU idle times due to external memory stalls. While the compiler- driven IntraDVS [19] identifies the program regions where the CPU is mostly idle due to memory stalls at compile time, the event-driven IntraDVS [20,21] uses several performance-monitoring events to capture the CPU idle time at run time.
Stochastic IntraDVS uses the stochastic information on the program’s execution time [4,22]. This technique is motivated by the idea that, from the energy consumption perspective, it is usually better to “start at low speed and accelerate execution later when needed” than to “start at high speed and reduce the speed later when the slack time is found” in the program execution. It finds a speed schedule that minimizes the expected energy consumption while still meeting the deadline. A task starts executing at a low speed and then gradually accelerates its speed to meet the deadline. Since an execution of a task might not follow the worst-case execution path (WCEP), it can happen that high-speed regions are avoided.
The main limitation of these IntraDVS techniques is that they have no global view of the task set in multitask environments. On the basis of an observation that a cooperation between IntraDVS and InterDVS could result in more energy-efficient systems, the hybrid IntraDVS technique selects either the intra mode or the inter mode when slack times are available during the execution of a task [23]. At the inter mode, the slack time identified during the execution of a task is transferred to the following other tasks. Therefore, the speed of the current task is not changed by the slack time produced by itself. At the intra mode, the slack time is used for the current task, reducing its own execution speed.
Segment-Based IntraDVS
The segment-based IntraDVS technique partitions a task into several segments and considers them as sequentially executed tasks [12]. The voltage-scheduling algorithm of segment-based IntraDVS can be summarized as follows:
ACC is the accumulated execution time from the first segment to the (i–1)th segment. The target clock frequency ftar is computed efficiently to exploit the slack time Tslack.
Figure 18.2 illustrates the parameters defined above, assuming that the execution of the second segment was just completed. Steps 1 and 2 are performed at compile time, while step 3 is performed at run time. Using this technique, each segment can utilize the slack times generated from the previously executed segments.
is the average case execution time of the jth segment under the clock frequency fref.
Although the last approach usually works better than two other approaches, it should be used only when the processor provides the clock frequency under which the remaining worst-case workload can be completed before the deadline. Therefore, the following condition should be satisfied:
where fmax is the maximum clock speed provided by the variable voltage processor.
A key problem of the segment-based IntraDVS is how to divide an application into segments. Automatically partitioning an application code is not trivial. For example, consider the problem of determining the granularity of speed changes. Ideally, the more frequently the voltage is changed, the more efficiently the application can exploit dynamic slacks, saving more energy. However, there is energy and time overhead associated with each speed adjustment. Therefore, we should determine how far apart any two voltage scaling points should be. Since the distance between two consecutive voltage scaling points varies depending on the execution path taken at run time, it is difficult to determine the length of voltage scaling intervals statically.
One solution is to use both the compiler and the operating system to adapt performance and reduce energy consumption of the processor. The collaborative IntraDVS [14] uses such an approach and provides a systematic methodology to partition a program into segments considering branch, loop, and procedure call. The compiler does not insert voltage scaling codes between segments but annotates the application program with so-called power management hints (PMH) based on program structure and estimated worst-case performance. A PMH conveys path-specific run-time information about a program’s progress to the operating system. It is very low cost instrumentation that collects path-specific information for the operating system about how the program is behaving relative to the worst-case performance. The operating system periodically invokes a power management point (PMP) to change the processor’s performance based on the timing information from the PMH. This collaborative approach has the advantage that the lightweight hints can collect accurate timing information for the operating system without actually changing the performance. Further, the periodicity of performance/energy adaptation can be controlled independently of PMH to better balance the high overhead of adaptation.
We can also partition a program based on the workload type. For example, the required decoding time for each frame in an MPEG decoder can be separated into two parts [25]: a frame-dependent (FD) part and a frame-independent (FI) part. The FD part varies greatly according to the type of the incoming frame, whereas the FI part remains constant regardless of the frame type. The computational workload for an incoming frame’s FD part (W P ) can be predicted by using a frame-based history, i.e., maintaining a moving-average of the FD time for each frame type. The FI time is not predicted since it is constant for a given video sequence (W ). Because the total predicted workload is (W P +W ), given a deadline D, the program starts with the cock speed fFD as follows:
For the MPEG decoder, since the FD part (such as the IDCT and motion compensation steps) is executed before the FI part (such as the frame dithering and frame display steps), the FD time prediction error is recovered inside that frame itself, i.e., during the FI part, so that the decoding time of each frame can be maintained satisfying the given frame rate. This is possible because the workload of the FI part is constant for a given video stream and easily obtained after decoding the first frame. When a mispre- diction occurs (which can be detected by comparing the predicted FD time (W P ) with the actual FD time (WA )), an appropriate action must be taken during the FI part to compensate for the misprediction.
If the actual FD time was smaller than the predicted value, there will be an idle interval before the deadline. Hence, we can scale down the voltage level during the FI part’s processing. On the other hand, if the actual FD time was larger than the predicted value, a corrective action must be taken to meet the deadline. This is accomplished by scaling up the voltage and frequency during the FI part so as to make up for the lost time. Since the elapsed time consumed during the FD part is WFD start with the following cock speed fFI:
Path-Based IntraDVS
At a specific program point, two kinds of slack times can be identified: backward slack and forward slack. While the backward slack is generated from the early completion of the executed program segments, the forward slack is generated when the change of remaining workload is estimated. Though the segment- based IntraDVS utilizes the backward slack times, the path-based IntraDVS exploits the forward slack times based on the program’s control flow.
Consider a hard real-time program P with the deadline of 2 s shown in Figure 18.3(a). The control flow graph (CFG) GP of the program P is shown in Figure 18.3(b). In GP , each node represents a basic block of P and each edge indicates the control dependency between basic blocks. The number within each node indicates the number of execution cycles of the corresponding basic block. The back edge from b5 to bwh models the while loop of the program P. The WCEC of this program is 2 ´ 10 cycles.
Path-based IntraDVS [3] adjusts the clock speed within the task depending on the control flow. For example, when the program control flow follows the execution path p1 = (b1, b2, bif , b6, b8) of Figure 18.3(b), the clock speed is initially determined to complete the WCEP before the deadline, i.e., 100 MHz. However, we can reduce the clock speed at the edge (b1, b2) because we know this control flow does not follow the WCEP. In the path-based IntraDVS algorithm, we identify appropriate program locations where the clock speed should be adjusted, and inserts clock and voltage scaling codes to the selected program locations at compile time. The branching edges of the CFG, i.e., branch or loop statements, are the candidate locations for inserting voltage scaling codes because the remaining execution cycles are changed at those locations. The path-based IntraDVS consists of two key steps: (1) one to predict the execution path of application program at compile time and (2) the other to adjust the clock speed depending on the real execution path taken at run time. In the first step, using the predicted execution path, we calculate the remaining predicted execution cycles (RPEC) d(bi) at a basic block bi which is a branching node in GP as follows:
For a loop, we use the following equation to predict the remaining execution cycles for the loop L:
where c(HL) and c(BL) mean the execution cycles of the header and the body of the loop L, respectively.* Npred (L) is the predicted number of loop iterations and postL denotes the successor node of the loop, which is executed just after the loop termination.
At run time, if the actual execution deviates from the (predicted) reference path (say, by a branch instruction), the clock speed can be adjusted depending on the difference between the remaining execu- tion cycles of the reference path and that of the newly deviated execution path. If the new execution path takes significantly longer to complete its execution than the reference execution path, the clock speed should be raised to meet the deadline constraint. In contrast, if the new execution path can finish its execution earlier than the reference execution path, the clock speed can be lowered to save energy.
For run-time clock speed adjustment, voltage scaling codes are inserted into the selected program locations at compile time. The branching edges of the CFG, i.e., branch or loop statements, are the candidate locations for inserting voltage scaling codes. They are called voltage scaling points (VSPs), because the clock speed and voltage are adjusted at these points. There are two types of VSPs, the B-type VSP and L-type VSP. The B-type VSP corresponds to a branch statement while the L-type VSP maps into a loop statement. VSPs can be also categorized into Up-VSPs or Down-VSPs, where the clock speed is raised or lowered, respectively. At each VSP (bi, bj), the clock speed is determined using d(bi) and d(bj) as follows:
where f (bi) and f (bj) are the clock speeds at the basic blocks bi and bj, respectively. Tj is the remaining time until the deadline from the basic block bj and r(bi, bj ) the speed update ratio of the edge (bi , bj).
For a loop, if the actual number of loop iterations is Nactual, the clock speed is changed after the loop as follows:
IntraDVS Using Worst-Case Timing Information
Among the prediction policies for the remaining execution cycles, the simplest and most conservative one is to use the remaining worst-case execution cycles (RWEC) for RPEC. This technique is called as RWEP- IntraDVS [3]. For the prediction function P, the following function is used to calculate the RWEC dw:
For example, in Figure 18.3(b), d(b ) = c(b ) + max(d(b ),d(b )) = 6 ´ 107 cycles. The predicted total exe- cution cycles calculated by this equation is same as WCEC and there are only Down-VSPs in a CFG. So the speed is dropped at all VSPs and there is no possibility of a deadline miss.
For the L-type VSP, we should use the maximum loop iteration number Nw as follows:
IntraDVS Using Profile Information
Although the RWEP-IntraDVS reduces the energy consumption significantly while guaranteeing the deadline, this is a pessimistic approach because it always predicts that the longest path will be executed. If we use the average-case execution path (ACEP) as a reference path, a more efficient voltage schedule can be generated. To find the ACEP, we should utilize the profile information on the program execution.
RAEP-IntraDVS
The ACEP is an execution path that will be most likely to be executed. The average-case remaining execution cycles da are defined by the following equation when bj and bk are the immediate successor nodes of a branching node bi :
It can be easily shown that using the ACEP instead of the WCEP is generally more energy-efficient. For a typical program, about 80% of the program execution occurs in only 20% of its code, which is called the hot paths [26]. To achieve high-energy efficiency, an IntraDVS algorithm should be optimized so that these hot paths are energy-efficient. If we use one of the hot paths as a reference path, the speed change graph for the hot paths will be a near flat curve with little changes in the clock speed, which gives the best energy efficiency under a given amount of work.
However, it is not always energy-efficient to use the ACEP as a reference path. This is because we consider only the probability of the execution path to decide the reference path. Especially, when the average-case execution cycle is significantly smaller than the WCEC, the energy consumption of RAEP- IntraDVS can be larger than that of RWEP-IntraDVS if the WCEP is actually executed at run time. If we modify the definition of the average-case remaining execution cycle as follows, we can consider both the probability and the execution cycles of the remaining execution path:
Using the profile information, we can compute the optimal voltage schedule. Consider a simple CFG with only one branching node bi . Basic blocks bj and bk are the immediate successor nodes of bi . Following the execution of bi , basic blocks bj or bk is executed with the probabilities pj and pk , respectively. Then, we can represent the average energy consumption as follows:
We call d (bi) of Eq. (18.18) the length of the ROEP starting from bi . If we set the speed of each basic block bi to d (bi )/Ti , where Ti is the remaining time to the deadline from bi , we can get the optimal voltage schedule. (The detailed proof is found in Ref. [16].) Since this technique uses an virtual optimal execution path as a reference path, it is called ROEP-IntraDVS. One special feature of ROEP-IntraDVS is that none of feasible execution paths is a reference path. Therefore, each branching edge becomes a voltage scaling point and it is necessary to select the final VSPs from the candidate VSPs considering the voltage scaling overhead.
IntraDVS Using Data Flow Analysis
The original path-based IntraDVS techniques select the voltage scaling points using the control flow information (i.e., branch and loop) of a target program. For example, in Figure 18.4(a), the path-based IntraDVS algorithm finds a B-type VSP where the remaining RWEC are reduced, because the basic block b4 is not executed. However, we can infer the direction of the branch earlier than when the basic block b2 is executed because the values of x and y are not changed after the basic block b1. Figure 18.4(b) shows the modified program which adjusts the clock speed and the supply voltage between b1 and b2. The program in Figure 18.4(b) consumes less energy than the one in Figure 18.4(a), because the clock speed is more uniform across multiple basic blocks if x + y £ 0. This example shows that we can improve the energy performance of IntraDVS further if we can move voltage scaling points to the earlier part of the program.
Parametric IntraDVS
The parametric IntraDVS technique [17] scales voltage/frequency based upon the parameterization of the RWEC of a task. The parametric RWEC formula of the task can be determined by a static analysis of a program. For example, in Figure 18.5(a), the original IntraDVS technique uses the variable l to keep track of the number of loop iterations. At the L-type VSP, the clock speed is reduced if l is smaller than the maximum number M of loop iterations. However, if we analyze the loop using a parametric WCET analysis technique such as in Ref. [27], we can know that the basic block b2 is executed by [c /k] times. Therefore, if [c /k] < M, the voltage can be lowered. If we know the values of c and k in advance, we can reduce the clock speed before the loop is executed. The parametric IntraDVS technique can insert the P-type VSP before the loop as shown in Figure 18.5(b).
Look-ahead IntraDVS
It is not trivial to derive the number of loop iterations from a program in the parametric IntraDVS. The look-ahead IntraDVS (LaIntraDVS) [18] technique uses the data flow analysis technique to identify earlier voltage scaling points. Using data flow analysis, we can decide program locations where each variable is defined and used. For LaIntraDVS, we need the following post-processing steps after the voltage scaling points are selected by the original IntraDVS algorithm:
1. Given an original voltage scaling point s, we identify the branch condition C(s) which is the necessary condition for s to be executed at run time. Using the variables in the expression of C(s), we compose a set of condition variables V(s). For example, in Figure 18.4(a), the branch condition for the B-type voltage scaling point is C(s) = (x + y > 0). The variables in C(s) are x and y (i.e., V(s) = {x, y}).
2. The look-ahead point set IL(s, vi) is identified for each variable vi in V(s) using a data flow analysis technique. The look-ahead points are the earliest program points where we can get the value of vi that will not be changed until s (i.e., IL(s, x)= {b1} and IL(s, y) = {b1}).
3. Using the look-ahead point sets, the look-ahead voltage scaling points, after which the value of branch condition C(s) does not change, are identified.
4. We insert the voltage scaling codes at the look-ahead voltage scaling points. In Figure 18.4(a), the clock speed is adjusted at the B-type VSP as follows:
If the condition variables are expressed with other variables in the identified look-ahead point, we can recursively examine the look-ahead point further using the multistep LaIntraDVS, which tries to find earlier look-ahead points using the data flow of the variables. In this case, we need to consider the overhead of the compensation code which should be inserted at the final look-ahead VSP to reflect the instructions to be executed at the intermediate look-ahead points. (The look-ahead IntraDVS technique can be considered as a more general version of Walsh’s parametric IntraDVS technique [17], because the LaIntraDVS technique handles both the branch structure and the loop structure, while the parametric IntraDVS technique considered only the loop structure. In addition, LaIntraDVS uses the multistep approach.)
Memory-Aware IntraDVS
The memory-aware IntraDVS differs from the path-based IntraDVS in the type of CPU slacks being exploited. While the path-based IntraDVS takes advantage of the difference between the predicted execution path and the real execution path of applications, the memory-aware IntraDVS exploits slacks from the memory stalls. The idea is to identify the program regions in which the CPU is mostly idle due to memory stalls, and slow them down for energy reduction. If the system architecture supports the overlapped execution of the CPU and memory operations, such a CPU slow down will not result in a serious system performance degradation, hiding the slow CPU speed behind the memory hierarchy accesses which are on the critical path. There are two kinds of approaches to identify the memory-bound regions: analyzing a program at compile time and monitoring run-time hardware events.
Compiler-Directed IntraDVS
We can divide the total program execution time T into three portions:
When the CPU clock speed is reduced, Wm does not change, while Wc and Wb increase. Assume that the program in the reduced CPU clock speed behaves exactly the same for every program step as the program in the normal speed, but only executed in “slow motion”.‡ Then, if we scale down the clock speed by the factor of h, the execution time will be increased as follows:
The compiler-directed IntraDVS partitions a program into multiple program regions. It assigns different slowdown factors to different selected regions so as to maximize the overall energy savings without violating the global performance penalty constraint. The application program is partitioned not to introduce too much overhead due to switches between different voltages/frequencies. That is, the granularity of the region needs to be large enough to compensate for the overhead of voltage and frequency adjustments.
Event-Driven IntraDVS
It is very difficult to calculate the exact CPU idle time of a program in a static manner such as during the compilation time because on/off-chip latencies are severely affected by dynamic behavior, such as cache statistics and different access overheads for different external devices. So, these unpredictable dynamic behaviors should be captured at run time.
Event-driven IntraDVS [21] makes use of run-time information about the external memory access statistics to perform CPU voltage and frequency scaling with the goal of minimizing the energy con- sumption while controlling the performance penalty. The technique relies on dynamically constructed regression models that allow the CPU to calculate the expected workload and slack time for the next time slot. This is achieved by estimating and exploiting the ratio of the total off-chip access time to the total on-chip computation time. To capture the CPU idle time, several performance monitoring events (such as ones collected by the performance-monitoring unit [PMU] of the XScale processor) can be used. Using the performance-monitoring events, we can count the number of instructions being executed and the number of external memory accesses at run time.
Stochastic IntraDVS can be completely implemented inside the operating systems and does not require special compiler support. Consequently, it interferes less with the actual task than the compiler-based methods. This also means that tasks do not need to be re-compiled when the architecture changes. In principle, this approach computes a voltage schedule only once, when the task starts executing. During task execution, no re-scheduling is done, but the supply voltage is changed at well established intervals.
Stochastic IntraDVS is based on an observation that tasks with variable execution time usually finish before their WCET. Therefore, it makes sense to execute first at a low voltage and accelerate the execution, instead of executing at high voltage first and decelerate. In this manner, if a task instance does not take the WCET, high-voltage regions can be skipped altogether. This approach uses stochastic data to build a multiple voltage schedule, in which the processor speed increases towards the deadline. The purpose for using stochastic data is to minimize the average case energy consumption.
The stochastic voltage schedule for a task is obtained using the probability distribution function of the task execution time. This probability distribution function can be obtained off-line, via simulations, or built and improved at run time. Let us denote by X the random variable associated with the number of clock cycles used by a task instance. We will use the cumulative density function, cdfx , of the random variable X (i.e., cdfx = P(X £ x)). This function reflects the probability that a task instance finishes before a certain number of clock cycles. If W is the worst-case number of clock cycles, cdfW = 1. Deciding a voltage schedule for the task means that for every clock cycle up to W , we decide a specific voltage level (and the corresponding processor speed). Each cycle y, depending on the voltage level used, will consume a specific energy, ey . But each of these cycles are executed with a certain probability, so the average energy consumed by cycle y can be computed as (1 – cdfy)ey . To obtain the average energy for the whole task, we have to consider all the cycles up to W :
Though we can get the optimal schedule using the clock cycle time of each clock cycle, it is impractical to implement because variable voltage processors provide finite clock/voltage levels. More- over, OS must issue a voltage scaling command each time it wants to change the speed even if the processor has continuous clock levels. Therefore, the virtual clock cycle times should be converted to available values by distributing the virtual clock cycle time between two existing clock frequencies. Using the speed dithering technique, the stochastic IntraDVS technique can be supported in real platforms.
Comments
Post a Comment