An Exploration of Hardware Architectures for Face Detection:FPGA Implementation of a Neural Network-Based Face Detection
FPGA Implementation of a Neural Network-Based Face Detection
In mapping any traditionally software-based algorithm to hardware, understanding the underlying algorithm and the target platform is critical in leveraging algorithm performance and platform constraints (power and area). In our implementation, we chose to prototype the neural network detection stage on the Xilinx XUP2V-Pro development board, which offers a USB 2.0 interface to the Xilinx XC2VP30 FPGA, as shown in Figure 83.6. This FPGA offers 136 embedded multipliers and block memories making it ideal for the MAC intense neural network. It is the inclusion of these functional resources that make modern FPGA architectures more than prototyping platforms but rather ideal platforms for implementing real-time imaging algorithms such as face detection. In this section, we describe the neural network face detector implementation on the Xilinx Virtex-II Pro XC2VP30 FPGA.
As Figure 83.7 shows, the structure of networks in the first, second, and third layer only differ by the number of internal neurons. The following discussion, therefore, will be limited to Network 1 of the first layer.
Referring to Figure 83.8, networks in layer 1 perform the Multiply Accumulate, MAC, operation for a number of neurons. Because the input rate is one pixel per cycle, at most one neuron will be active within the network for a given cycle. The multiply and adder functional units can be time-shared provided there is sufficient storage to maintain intermediate results of in-progress accumulations. For each neuron assigned to a network, an accumulator register is allocated within the shared multiply and accumulate unit. The neurons in Network 1 are each 10-pixel ´ 10-pixel in area and will therefore process the accumulation of 100 24-bit products. Since adding two M bit numbers will at most result in an M+1 bit number, it follows that
the maximum number of bits needed to represent the addition of N M-bit numbers is S = élog(N )ù + M .
Consequently the accumulators in Network 1 are
The pixel steering module, as its name suggests, directs pixels to the appropriate accumulator register.Since the sequence of the pixels in the input stream is always in raster scan order, the pixel steering module derives the index of the destination accumulator register for the current pixel from the current pixel count.
The Tanh Lookup Table module implements the hyperbolic tangent activation function. The hyperbolic tangent function is asymptotic to 1 and –1, symmetric with respect to zero, and for all practical purposes, saturates to y = 1 and y = –1 at x = 8 and x = –8 respectively. The table consists of 2048 16-bit entries and utilizes two Block SelectRAM resources. The address port receives unsigned values between 0.0 and +8.0 in {0.3.9} fixed-point format, and returns the hyperbolic tangent of that number in {0.1.14} fixed point format. Since Hyperbolic Tangent is symmetric about zero, the sign bit is not used directly in the lookup but rather used to correctly sign the Hyperbolic Tangent output in {1.1.14} format. Given that the lower 14-bits of the 31-bit accumulator registers represent the fractional portion of the sum, the activation function takes as input bits [17.6] of the accumulator registers (11 bits).
The weight table provides storage for weight coefficients used in the network. Again, because of the deterministic ordering of the incoming pixels, the weight table index is derived from the current pixel count. To test the performance of the face detector we developed a driver application capable of parsing a database of face and non-face images. The driver application sequentially sends the images to the FPGA via the USB interface. The output from the face detector is captured and compared with the database annotations. Figure 83.9 shows a screen shot of the driver application.
Despite errors associated with fixed-point number representation and lookup table based function approximation, the face detection system achieved 94% detection accuracy. The clock frequency of the face detector is 100 MHz and it requires 813 cycles to process a single 20-pixel ´ 20-pixel sub window. The latency
of the system is therefore, 8.13 microseconds. A 320 × 240 image frame scaled four times—at reasonable scale factors—with five pixel overlaps will generate approximately 3000 sub windows. The face detector will process this frame in 24 milliseconds and will process 41 such frames in 1 second.
Comments
Post a Comment