Logic Synthesis for Field Programmable Gate Array (FPGA) Technology:FPGA Structures.
Introduction
Field programmable gate arrays (FPGAs) enable rapid implementation of complex digital circuits. FPGA devices have the added advantage that they can be reprogrammed and reused, allowing the same hardware to implement entirely new designs or to allow existing hardware systems to implement a circuit with revised logic. While many general techniques used for traditional IC logic synthesis methods are used in the computer-aided design tools for FPGA hardware, FPGA circuits have unique characteristics that affect the synthesis process.
The FPGA device consists of a number of configurable logic blocks (CLBs) interconnected by a routing matrix. Pass transistors are used in the routing matrix to connect segments of metal lines. There are three major types of CLBs: those based on Programmable Logic Arrays (PLAs), those based on multiplexers, and those based on table look-up (TLU) functions. Automated logic synthesis tools optimize the mapping of the Boolean network to the physical circuits within the FPGA device. FPGA synthesis extends the methods used to solve the general problem of multilevel logic synthesis. FPGA logic synthesis is usually solved in two phases. A technology-independent phase uses a general multilevel logic optimization tool (such as Berkeley’s MIS) to reduce the complexity of the Boolean network. Next, a technology-dependent optimization phase opti- mizes the logic for the particular type of device. In the case of the TLU-based FPGA, each CLB can implement an arbitrary logic function of a limited number of variables. Different FPGA optimization algorithms aim to optimize different objectives that include the number of CLBs used, the logic depth, and the routing density. The Chortle algorithm, for example, is a direct method that performs dynamic programming to map the logic into TLU-based CLBs. It converts the Boolean network that describes the function of the circuit into a forest of directed acyclic graphs (DAGs); then it evaluates and records the optimal subsolutions to the logic mapping problem as it traverses the DAG. Two-step algorithms operate by first decomposing the nodes, and then performing a node elimination. Later sections of this chapter discuss and detail the Xmap, Hydra, and MIS-pga algorithms.
FPGA devices are fabricated using the same submicron geometries as other silicon devices. As such, the devices benefit from the rapid advances in device technology. The overhead of the programming bits, general function generators, and general routing structures, however, reduce the total amount of logic available to the end user.
FPGA Structures
An FPGA consists of reconfigurable logic elements, flip-flops, and a reprogrammable interconnect structure. The logic elements are typically arranged in a matrix. The interconnect is arranged as a mesh of variable-length metal wires and pass transistors to interconnect the logic elements. The logic elements are programmed by downloading binary control information from an external ROM, a built-in EPROM, or a host processor. After the download, the control information is stored in the device and used to determine the function of the logic elements and the state of the pass transistors. Unlike a PLA, the FPGA can be configured to implement multilevel logic functions.
The granularity of an FPGA refers to the complexity of the individual logic elements. A fine-grain logic block appears to the user to be much like a standard mask-programmable gate array. Fine-grain logic blocks implement simple functions of a few variables. A course-grain logic block (such as those in devices from Xilinx, Altera, Actel, and Quicklogic) provides more general functions of a larger number of variables. A Xilinx look up table (LUT), for example, can implement any Boolean function of five variables, or two Boolean functions of four variables.
It has been found that the course-grain logic blocks generally provide better performance than the fine-grain logic blocks. Course-grained devices require less space for interconnect and routing by combining multiple logic functions into one logic block. In particular, it has been shown that a four-input logic block uses the minimal chip area for a large variety of benchmark circuits [1]. The expense of a few extra underutilized logic blocks outweighs the area required for the larger number of fine-grained logic blocks and their associated larger interconnect matrix and pass transistors. This chapter focuses on the logic synthesis for course-grained logic elements.
A course-grained CLB can be implemented using a PLA-based AND/OR elements, multiplexers, or SRAM-based LUT elements. These configurations are described below in detail.
Look-Up Table-Based CLB
The basic unit of LUT-based FPGAs is the CLB, implemented as an SRAM of size 2n X 1. Each CLB can implement any arbitrary logic function of n variables, for a total of 2n functions.
An example of an LUT-based FPGA is the Xilinx FPGA, as illustrated in Figure 67.1. Each CLB has three LUT generators, and two flip-flops [2]. The first two LUTs implement any function of four variables, while the third LUT implements any function of three variables. Separately, each CLB can implement two functions of four variables. Combined, each CLB can implement any one function of five variables, or some restricted functions of nine variables (such as AND, OR, XOR).
PLA-Based CLB
PLA-based FPGA devices evolved from the traditional PLDs. In a PLD, each basic logic block is an AND–OR block consisting of wide fan-in AND gates feeding a few-input OR gate. The advantage of this structure is that many logic functions can be implemented using only a few levels of logic, due of the large number of literals that can be used at each block. It is, however, difficult to make efficient use of all inputs to all gates. Even so, the amount of wasted area is minimized by the high packing density of the wired-AND gates.
To further improve the density in a PLD, another type of logic block, called the logic expander, has been introduced. It is a wide-input NAND gate whose output could be connected to the input of the AND–OR block. While its delay is similar, the NAND block uses less area than the AND–OR block, and thus increases the effective number of product terms available to a logic block.
Multiplexer-Based CLB
Multiplexer-based FPGAs utilize a multiplexer to implement different logic function by connecting each input to a constant or a signal [3]. The ACT-1 logic block, for example, has three multiplexers and one logic gate. Each block has eight inputs and one output, implementing:
Multiplexer-based FPGAs can provide a large degree of functionality for a relatively small number of transistors. Multiplexer-based CLBs, however, place high demands on routing resources due to the large number of inputs.
Interconnect
In all structures, a reprogrammable routing matrix interconnects the configurable logic blocks. A portion of the routing matrix in a Xilinx 4000-series FPGA, for example, is illustrated in Figure 67.2. Local inter- connects are used to join adjacent CLBs. Global routing modules are used to route signals across the chip.
The routing and placement issues for the FPGAs are somewhat different from those of custom logic. For a large fan-out node, for example, an optimal placement for the elements for the fan-out would be along a single row or column, where the routing could be implemented using a long line. For custom logic, the optimal placement would be as a cluster, where the optimization attempted to minimize the distance between nodes. For the FPGA, the routing delay is influenced more by the number of pass transistors for which the signal must cross rather than by the length of the signal line.
The power of the FPGA comes from the flexibility of the interconnect. A block diagram of a typical third-generation FPGA device is shown in Figure 67.3. The CLB matrix and the mesh of the interconnect occupy most of the chip real area. Macro blocks, when present, implement functions such as high-density memory block, multiplier, digital signal processor, microprocessor, or Gigabit-rate SERializer/DESerializer (SERDES) cores. The I/O blocks surround the chip and provide connectivity to external devices.
Comments
Post a Comment