An Exploration of Hardware Architectures for Face Detection:ASIC Implementation of a Neural Network-Based Face Detection
ASIC Implementation of a Neural Network-Based Face Detection
An ASIC implementation of the neural network-based face detection was first presented in our previous work in Ref. [25]. The implementation presented here is significantly of more area and is power-efficient. The entire face detection process with all three stages was mapped on a single chip. However, this section will emphasize the detection architecture, i.e., the neural network stage. The IPG and IE stages aim to enhance the detection process, and are, in fact, decoupled from the neural network block. A brief description of these two stages will be given below; for a more detailed analysis, the reader is referred to Refs. [25,26].
The IPG module is the interface of the system with the outside world. The image data source feeds data to the module via a 64-bit bus running at the system clock frequency. The incoming image is a 320 × 240-pixel grayscale frame. The image pyramid generator creates a large number of 20 × 20 subwindows. The original frame is also scaled down at various levels of magnification and each level generates further 20 × 20 subwindows. In this fashion, both large and small faces in the original image are guaranteed to be analyzed
by the system. However, this method generates a total of 8050 subwindows per 320 × 240 frame, causing the IPG module to be memory-dominated. It requires 80 kB of memory, spread over 10 equally sized banks, two of which are dual ported. Windows are generated in raster-scan style and each window is individually and completely handed off to the IE module through a 32-bit interface and several hand- shaking control signals. The IPG module is throttled by the IE and neural network blocks to maintain a steady flow of data. The scaling of the original picture is performed using a subsampling technique, rather than a more complex matrix-multiplication-based affine transformation. This method reduces system complexity significantly, with only a modest reduction in accuracy. Because of its memory-dominated nature, IPG is by far the largest module of the system. Furthermore, the RAM modules used by the IPG also dictate the system clock frequency, which is 125 MHz.
The IE module is responsible for improving the overall image quality of the incoming 20 × 20 windows, thereby improving the chances of detection by the neural network block. There are several enhancement techniques that can be employed, such as lighting correction, edge sharpening, and histogram equalization. The aim is to minimize the impact of environmental variations within the picture on the detection process. The processed pictures should, ideally, have a uniform image quality. This implementation assumes use of modern-day cameras, which eliminate the need for dedicated lighting correction and sharpening units. Therefore, the only technique implemented was histogram equalization. Histogram equalization improves the contrast and intensity distribution of each window. The module works in two distinct phases: a cumulative distribution function (CDF) array construction (the main component of the histogram equalization tech- nique) and output streaming to the neural network. The IE module performs histogram equalization on the
complete 20 × 20 window in a single pass, which provides better image results than using smaller subwindows. The module requires 503 clock cycles to fully process a window and send it to the neural network block.
The neural network component of the face detection implementation bears the responsibility of detecting the presence of a face within the 20 × 20 search windows generated by the IPG unit and subsequently enhanced in the preprocessing stage (see Figure 83.1). The neural network (NN) module receives input from the preprocessing unit (i.e., the IE module) at a rate of 1 pixel (8 bits) per clock cycle. The unit produces a single bit output, which is asserted when a face is detected within a window. No handshaking is necessary between IE and NN; the control is left entirely to the IE unit which initiates a new window transfer by asserting a “Frame Start” signal. A constant stream of pixels then follows for 400 consecutive cycles (recall that 1 pixel is transmitted per clock cycle), which concludes the 20 × 20 (i.e., 400 pixels) window transfer. The neural network unit requires 513 clock cycles to complete the processing of a single window. By utilizing computational overlapping between the three major units of the system, this time is completely masked by the IPG and IE modules.
The NN module was architected to employ parallelism, but remains as small as possible by fully utilizing existing resources in all stages of its operation. This approach ensures minimal area and resource budgets. However, despite the benefits that such minimalist philosophy affords, it also spawns a major design challenge: timing and scheduling complexity. For this reason, the neural network unit is more control-intensive, rather than data or computation-intensive. The functionality of the module is shown in Figure 83.2. Each incoming pixel is multiplied by a predetermined weight value (obtained off-line through training), accumulated over an entire subwindow, and the resulting sum passed through an activation function, which determines the output value forwarded to the next layer of neurons. In this implementation, the neurons are distributed in three stages: the first stage operates directly on the incoming pixel values, the second stage operates on the outputs of the first stage, and the third stage on the outputs of the second stage. The third stage provides the final single-bit output indicating the presence or not of a face within the 20 × 20 window. The accuracy of a hardware-implemented neural network depends heavily on the accuracy of the activation function implementation. The activation function used in this system is the hyperbolic tangent (tanh), implemented as a look-up table (LUT) in an SRAM. After investigating both fixed-point and floating-point arithmetic choices and several bit widths of precision, it was found that a 16-bit fixed-point implementation would achieve accuracy to within 0.1% of double-precision floating-point arithmetic (used in software like MATLAB). This way, the extra hardware complexity and delay overhead of a floating-point architecture were avoided, with negligible impact on accuracy. Figure 83.3 illustrates the hyperbolic tangent function and the significant increase in accuracy by moving from an 8-bit to a 16-bit implementation. Note that the stored LUT covers only domain values from 0 to 3. This is because tanh is odd and saturating. Hence, negative domain values are the same as the corresponding positive ones multiplied by −1, while values beyond 3 (−3) saturate to 1 (−1). These properties allow us to keep the LUT small, saving SRAM area and improving performance.
The overall architecture of the neural network detection block is shown in Figure 83.4. Stage 1 of the architecture (shown on the left-hand side of Figure 83.4) is the most complex. It consists of a 4 kB single- ported SRAM (SPRAM 1), which holds weight values and four multiply-accumulate (MAC) blocks. The stage is divided into three “virtual” neuron groups: the first group divides the 20 × 20 window into four 10 × 10 subwindow neurons (see Figure 83.2) and employs a single MAC block. The second group divides the 20 × 20 window into sixteen 5 × 5 subwindow neurons and also employs a single MAC block. The third group divides the main window into six 5 × 20 overlapping subwindow neurons. The overlapping nature of the neurons necessitates the use of two MAC blocks in this group. The incoming pixels come in a rasterscan form, which implies that the MAC operation constantly skips from neuron to neuron before comple- tion. Therefore, the partial MAC results of each neuron group are stored in registers and swapped back and forth, allowing the use of a single MAC block by all neurons within the group. Careful selection of word length and ordering of read requests by the three neuron groups eliminates contention for SPRAM 1.
Stages 2 and 3 are shown on the right-hand side of Figure 83.4. They share a single MAC block and a 4 kB single-ported SRAM (SPRAM 2). This RAM holds the weight values for stages 2 and 3, the results of stages 1 and 2, and the activation function LUT. There are contention issues with SPRAM 2, because of simultaneous requests for the LUT from all three stages, compounded by requests for stage 2 and 3 weights
and stage 1 and 2 results. These issues are resolved by the Timing Control and LUT Logic unit, which coordinates the whole process with minimal delay overhead. Stage 1 is run in parallel with stages 2 and 3 to maximize throughput and resource utilization, consistent with our lightweight design methodology. To preserve accuracy, the bit widths of the internals of the MAC blocks were chosen so as to avoid overflow even in a worst-case scenario. This is imperative, since tricks like saturating addition reduce the detection accuracy, which is of utmost importance in a neural network.
The design also includes some RAM initialization logic, which provides an interface with an off-chip CPU, which stores the weight values (obtained during the off-line training phase) and the activation function in RAM upon power-up.
The entire face detection system was implemented in Verilog HDL and simulated in ModelSim to ensure correctness. Following functional verification of the architecture, the system was synthesized in Synopsys Design Compiler and underwent postsynthesis verification in ModelSim. Finally, Cadence Encounter was used for placement and routing. The final layout is shown in Figure 83.5.
The final chip is 7.3 mm2 and consumes a total of 165 mW. The original goal of a lightweight design was achieved. The chip operates at 125 MHz and performs face detection at 24 fps (320 × 240-pixel frames). Increasing the computational units, i.e. the MAC blocks, would facilitate deeper parallelization and a subsequent increase in the frame rate to well beyond real time. In sharp contrast, a software implementation of such a neural network requires ~1.5 s for a single 320 × 240 image frame on a Sun Blade 1000 [8]. While an ASIC implementation offers detection at high frame rates and with a competitive accuracy, it is complicated and costly. As such, we examine the implementation of the detection stage on an alternative platform, an FPGA prototyping board. We describe the details of this implementation next.
Comments
Post a Comment