Logic Synthesis for Field Programmable Gate Array (FPGA) Technology:Chortle
Chortle
The Chortle algorithm is specifically designed for TLU-based FPGAs. The input to the Chortle algorithm is an optimized AND/OR/NOT Boolean network. Internally, the circuit is represented as a forest of DAGs, with the leaves representing the inputs and the root representing the output, as shown in Figure 67.7. The internal nodes represent the logic functions AND/OR. Edges represent inverting or noninverting signal paths.
The goal of the algorithm is to implement the circuit using the fewest number of K-input CLBs in minimal running time. Efficient running time is a key advantage of Chortle, as FPGA mapping is a computationally intensive operation in the FPGA synthesis procedure.
The terminology of the Chortle algorithm defines the mapping of a node, n, in a tree as the circuit of look-up tables rooted at that node that extends to the leaf nodes. The root look-up table of node n is the mapping of the Boolean function that has the node n as its single output. The utilization of a look-up table refers to the number of inputs, U, out of the K inputs actually used in the mapping. Finally, the
utilization division, µ, is a vector that denotes the distribution of the inputs to the root look-up table among subtrees. For example, a utilization vector of µ = {2,1} would refer to a table look-up function that has two of the K inputs from the left logic subtree, and one input from the right subtree.
Tree-Mapping Algorithm
The first step of the Chortle algorithm is to convert the input graph into forest of fan-out-free trees, where each logic function has exactly one output. As illustrated in Figure 67.8, node n has a fan-out degree of two; thus, two new nodes, n1 and n2, are created that implement the same Boolean equation of node n. Each subtree is then evaluated independently.
Chortle uses a postorder traversal of each DAG to determine the mapping of each node. The logic functions connecting the inputs (leaves) are processed first; the logic functions connecting those functions are processed next, and so on until reaching the output node (root).
Chortle’s tree-mapping algorithm is based on dynamic programming. Chortle computes and records the solution to all subproblems, proceeding from the smallest to the largest subproblem, avoiding recomputation of the smaller subproblems. The subproblem refers to computation of the minimum- cost mapping function of the node n in the tree. For each node ni, the subproblem, minMap(ni,U ), is solved for each value of U, ranging from 2, …, K (U = K refers to a look-up function that is fully utilized, while U = 2 refers to a TLU with only two inputs).
In general, for the same value of U, multiple utilization vectors, µ(u1, u2, …, uƒ), are possible, such that åf m =U . The utilization vector determines how many inputs are to be used from each of the previous optimal subsolutions. Chortle examines each possible mapping function to determine this node’s minimum-cost mapping function, cost(minMap(n,U)). For each value of U Î {2, …, K}, the utilization division of the minimum-cost mapping function is recorded [10].
Example
The Chortle mapping function is best illustrated by an example, as illustrated in Figure 67.9. For this example, we will assume that each CLB may have as many as four inputs (i.e., K = 4). The inputs, {A,B,C,D,E,F}, perform the logic function: A * B + (C * D) E + F.
In the postorder traversal n1 is visited first, followed by n2 … n5. For n1, there is only one possible mapping function, namely, U = 2, µ = {1,1}. The same is true for n2.
When n3 is evaluated, there are two possibilities, as illustrated in Figure 67.10. First, the function could be implemented as a new CLB with two inputs (U = 2), driven from the outputs of n2 and E. This subgraph would use two TLBs; thus, it would have a cost function of 2. For U = 3, only one utilization vector is possible, namely, µ = {2,1}. All three primary inputs C, D, and E are grouped into one CLB, thus producing a cost function of 1. We store only the utilization vectors and cost functions for minMax(n3,2) and minMax(n3,3).
When n4 is evaluated, there are many possibilities, as illustrated in Figure 67.11. With U = 2 (µ = {1,1}), a two-input CLB would combine the optimal result for n3 with the primary input F, producing a function with a cost of 2. For U = 3 (µ = {2,1}), a three-input CLB would combine the optimal result for n3: U = 2 with both inputs E and F, also at a cost of two CLBs. Finally, for U = 4, a single CLB would implement the function (C * D) * (E + F), at a cost of 1. We store the utilization vectors and cost functions for minMax(n4,2), minMax(n4,3), and minMax(n4,4).
Finally, we evaluate the output node, n5, as illustrated in Figure 67.12. We see that there are four possible mappings and, of those, two minimal mappings are possible. Chortle may return either of the mappings where two CLBs implement: n5 = (A * B) + n3 + F and n3 = (C * D) * E.
Chortle-crf
The Chortle-crf algorithm is an improvement of the original Chortle algorithm. The major innovation with Chortle-crf involves the method for choosing gate-level node decomposition. The other improvements involve the algorithm’s response to reconvergent and replicated logic. The name, Chortle-crf, is based on the new command line options (-crf) that may be given when running the program (-c for constructive bin-packing for decomposition, -r for reconvergent optimization, and -f for replication optimization) [11]. Each of the optimizations are detailed below.
Decomposition
Decomposition involves splitting a node and introducing intermediate nodes. Decomposition is required if the original circuit has a fan-in greater than K. In this case, no one CLB could implement the entire function. In general, the decomposition of a node may yield a circuit that uses fewer CLBs. Consider, for example, implementations with four-input CLBs (K = 4) of the circuit shown in Figure 67.13. Without decomposition, the output node forces the suboptimal use of the first two function generators (i.e., A * B and C * D are implemented as individual CLBs). With decomposition, however, the output node OR gate is decomposed to form a new node, which implements the function (A * B) + (C * D), which can be implemented in one CLB.
The original Chortle algorithm used an exhaustive search of all possible decompositions to find the optimal decomposition for the subcircuit, causing the running time at a node to increase exponentially as
the fan-in increased. As a heuristic within the original Chortle algorithm, nodes would be arbitrarily split if the fan-in to a node exceeded 10, allowing each subfunction to be computed in a reasonable amount of time. If a node was split, however, the solution was no longer guaranteed to be optimal.
The improved Chortle-crf algorithm uses first-fit-decreasing bin-packing algorithm to solve the decom- position problem. Large fan-in nodes are decomposed into smaller subnodes with smaller fan-in. Next, the look-up tables for the input functions are bin-packed into CLBs. A look-up table with k inputs is merged into the first CLB that has at least K – k unused inputs remaining. A new CLB is generated, if needed, to accommodate the k inputs.
Reconvergent Logic
Reconvergent logic occurs when a signal is split into multiple function generators, and then those output signals merge at another generator. An example of reconvergent logic is shown in Figure 67.14. When the XOR gate was converted into a SOP format by the technology-independent minimization phase, two AND gates and an OR gate were generated. Both AND gates share the same inputs. If the total number of distinct inputs is less than the size of the CLB, it is possible to map these functions into one CLB. The Chortle-crf algorithm finds all local reconvergent paths, and then examines the effect of merging those signals into one CLB.
Replicated Logic
For multioutput logic circuits, there are cases when logic duplication uses fewer CLBs than logic that uses subterms generated by a shared CLB. Figure 67.15 shows an example of a six-input circuit with two outputs. One product term is shared for both functions f and g. Without replication, the subfunction implemented by the middle AND gate would be implemented as one CLB, as well as the subfunctions Original Ckt.
for f and g. In this case, however, the middle AND gate can be replicated, and mapped into both function generators, thus allowing the entire circuit to be implemented using two CLBs, rather than three.
When a circuit has a fan-out greater than one, Chortle may implement the node explicitly or implicitly. For an explicit node, the subfunction is generated by a dedicated CLB, and this output signal is treated as an input to the rest of the logic. For an implicit node, the logic is replicated for each fan-out subcircuit. The algorithm computes the cost of the circuit, both with replication and without. Logic replication is chosen if this reduces the number of CLBs used to implement the circuit.
Chortle-d
The primary goal of Chortle-d is to reduce the depth of the logic (i.e., the largest number of CLBs for any signal path through combinational logic) [12]. By minimizing the longest paths, it is possible to increase the frequency at which the circuit can operate. Chortle-d is an enhancement of the Chortle-crf algorithm. Chortle-d, however, may use more look-up tables than Chortle-crf to implement a circuit with a shorter depth.
The Chortle-d algorithm separates logic into strata. Each stratum contains logic at the same depth. When nodes are decomposed, the outputs of the tables with the deepest stratum are connected to those at the next level. Chortle-d also employs logic replication, where possible. Replication often reduces the depth of the logic, as illustrated in Figure 67.15.
The depth optimization is only applied to the critical paths in the circuit. The algorithm first minimizes depth for the entire circuit to determine the maximum target depth. Next, the Chortle-crf algorithm is employed to find a circuit that has minimum area. For paths in the area-optimized circuit that exceed the target depth, depth-minimization decomposition is performed. This has the effect of equalizing the delay through the circuit.
It was found that for the 20 circuits in the MCNC logic synthesis benchmark, the chortle-d algorithm constructed circuits with 35% fewer logic levels, but at the expense of 59% more look-up tables.
Comments
Post a Comment