Architecture and Design Flow Optimizations for Power-Aware FPGAs:CAD Techniques for Reducing Power.
CAD Techniques for Reducing Power
A typical FPGA CAD flow begins with technology mapping, followed by clustering, placement, routing, and finally a step that generates the configuration bits for download on the FPGA (see Figure 20.5). While current commercial tools usually optimize for either area or speed, each of these steps can be optimized for power as well. We visit some such techniques in this section.
Technology mapping converts a netlist composed of logic gates into a netlist of LUTs. A performance- driven mapper minimizes the depth of the combinational network, which has previously been done optimally using tools such as FlowMap [21] and CutMap [22]. These tools model each logic gate in the design as a node in a graph, with all the connections as edges. To map into K-input LUTs, nodes are merged or decomposed to create groups with the number of incoming edges equal to or less than K. To produce a depth-optimal mapping, some of the nodes in the graph are duplicated. However, this increases the number of nodes, and thereby the number of connections in the mapped netlist, which increases the amount of power consumed by the design. A simple technique to make the mapping power-aware would be to discourage the duplication of nodes [3]. Another approach to reduce power in the mapper is to include the switching activities of nets in the mapping algorithm. This technique tries to construct LUTs that absorb high-activity nets, thereby removing them from the netlist. The combination of the above two techniques has been estimated to reduce energy by 7.6% compared to CutMap, and by 16.9% when compared to FlowMap [4]. Another algorithm was proposed by Li et al. [23], which used an efficient
network-flow computation method to perform the low-power technology mapping. They report 14% power reduction compared to CutMap.
The next step in the CAD flow involves clustering (packing) LUTs and FFs into logic blocks. Modern FPGAs use clustered logic blocks, consisting of multiple LUTs, FFs, and other logic elements. The connections within a cluster are faster and consume less power than intercluster connections. Clustering can be viewed as a preplacement step where logic elements are packed together to reduce the complexity of the final placement. The goals of packing could be to minimize area—in which case, it will try to pack all clusters fully, or delay—in which case, it will try to pack logic elements on the critical path together, or routability—in which case, it will minimize the number of distinct nets connected to a cluster. Clustering can also be made power-driven by clustering logic elements such that high-activity nets use the intracluster connections. Since intracluster connections have lower capacitance than the intercluster connections, the total power consumption is reduced if high-activity nets use the former. This technique is estimated to reduce the energy by 12.6% for clusters of size 4 [4]. Another technique, which uses Rent’s rule to minimize routing area at the clustering phase, reduces FPGA power by approximately 13% because of a decrease in the active wire length [24].
Clustering is followed by placement, in which these packed clusters are assigned coordinates on the FPGA. VPR uses simulated annealing for placement, with the following cost functions:
Lamoureux and Wilton [4] showed that the above cost function reduces FPGA power by 6.7%, but simultaneously increases the critical-path delay by 4.0%. Therefore, the power-delay product (energy) remains almost the same.
The next CAD step involves routing the nets. Routing in FPGAs is constrained by the number of tracks and switches available in the fabric. The timing-driven router of VPR uses a pathfinder-based algorithm with the following cost function to evaluate a routing track n while forming a connection from source i to sink j.
where Crit(i, j) is the same as in the placement, delay(n) the Elmore delay of node n, and congestion(n) a cost denoting the congestion at the node n that is updated as routing progresses. Therefore, the router favors a track with lower delay and congestion. Furthermore, it gives priority to delay for nets on the critical path, and to congestion for the other nets.
To make the router power-aware, the cost function can be modified by adding a cost based on activity of the net as follows [4]:
where Activity(i) is the switching activity of net i, MaxActivity the maximum switching activity of all the nets, and MaxActCrit the maximum activity criticality that any net can have. Using the above cost function reduced the energy by 2.6% on the average.
The final CAD step generates a bit file that stores the FPGA’s configuration information. Some power optimizations can be done even at this stage. Anderson et al. [5] observed that a routing mux leaks less when its output is kept at the logic value of 1. Therefore, changing the configuration bits of the LUTs to maximize the probability that their outputs are at logic 1, can reduce FPGA leakage. In fact, this technique reduces the active leakage (leakage in used portion of the FPGA) by 25% on average [5]. Reconfiguring the LUTs also influences their dynamic power. However, finding the power-optimal configurations for all the LUTs in a design is a difficult problem. Therefore, a technique that processes clusters of LUTs and reduces the power in such local clusters was proposed [6]. This reduced the dynamic power by 20.6%.
An important point to remember is that all these techniques are not completely independent, and their individual power savings do not accumulate if all of them are applied. An example is that power optimization in the mapper reduces the power by 7.6% and power-aware clustering reduces it by 12.6%, but when both of them are applied to a design, the total power reduction is only 17.6%, and not the 20.2% it would be if they were perfectly cumulative [4].
Comments
Post a Comment