An Exploration of Hardware Architectures for Face Detection:Performance Evaluation

By Ahmed Farahat - October 14, 2015

Performance Evaluation

One of the advantages of the Ada Boost algorithm is the ability to distinguish between regions that do not contain a face and the ones that might contain a face, and stop and proceed to the next frame in the case that there is no strong evidence of a face present in the current frame. This results in two interesting observations when we transform the algorithm in hardware. The ﬁrst observation has to do with the number of faces present in an image. In images where a single face is present, one of the search windows (at least) will have to be evaluated along all stages. Given that the image is searched in parallel however, the time taken to evaluate a single search window is the same time needed to evaluate all search windows, thus, unlike software, there is no increase in latency. However, when there are multiple faces present in the image, unlike software where the latency increases rapidly [7], the parallel implementation processes all faces in parallel, thus the delay remains constant. Another interesting case is the number of images with different sizes; when two or more faces of different size are present in the source image, detection will occur at each scale. Therefore, the worst-case scenario for detection faces would be at least one face in every detection scale. Practically however, this is extremely hard to happen within an image frame; a large face will cover most secondary smaller faces in an image, and will result in a large face with a number of smaller faces, or, similar sized faces spread throughout the source image [7,12]. Hence, it is reasonable to assume that the worst-case scenario will almost never happen. The second observation has to do with the cases where faces are not present in an image frame; the search windows will likely fail somewhere through the ﬁrst few stages for all search windows at all scales, thus enabling a new image frame to be processed. In such cases, the frame rate increases. As such, sample frames containing a number of faces of different sizes were chosen, and by taking these observations into consideration the architecture was evaluated.

We use an 8-bit per pixel 320 × 240 grayscale image as our input frame. As such we need to provide storage for the case where all pixels will be set at 255, an unlikely scenario, but necessary for correct operation. Recall that the maximum integer value that can be stored in an integral image is 255 × 320 × 240 and the maximum integer value that can be stored in an integral squared image is (255)2 × 320 × 240. This requires 25 and 33 bits, respectively. We design the architecture using 320 × 240 CDTUs, 120 MEUs, and 4 CUs. Each CDTU connects to its neighbors via a 33 bit bus.

To evaluate the performance of the proposed implementation, we designed and veriﬁed the architecture using Verilog HDL and the ModelSim® simulator. We then synthesized the architecture using a commercial 90 nm library and targeting a 500 MHz clock cycle. Our synthesized design indicates that the experimental architecture consumes an area of ~115 mm2. The area can be reduced depending on the desired image size however. We used several 320 × 240 images [27] containing a number of faces. Given the large size of the array, simulation of an entire 320 × 240 frame on ModelSim would be an extremely time- consuming operation and would require extensive amount of resources. As such, a prototype 24 × 24 array of CDTUs was designed, along with the corresponding 12 MEUs and a CU. The 24 × 24 size was chosen because it is the base feature size as proposed in [7]. Each 320 × 240 image was partitioned into 24 × 24 subimages. Each subimage was fed as input to the array, and using the ModelSim simulator, the computation proceeded for a 24 × 24 portion of the image. The total number of clock cycles until each 24 × 24 frame was completely processed was measured. However to detect faces of larger size (thus computing the time features of larger size compute), the corresponding software implementation was used. Each frame was run through the software implementation, and for every search window in the source image, the computation progressed was marked (indicating how far along a search window computation progressed in terms of the features and stages computed). The resulting endpoint for each search window was then used to compute the number of cycles for that search window if the operation would happen on the hardware architecture, and was used along with the computation time obtained through simulations to project the detection frame rate. The computed number of cycles was projected for an entire 320 × 240 frame, and the average number of cycles per test frame was then estimated, giving a rough estimate of 52 fps. Obviously, the frame rate depends on the number of faces of different sizes rather than the number of faces in the picture, a major advantage over the software implementation where both parameters affect the latency. Some experimental frames are shown in Figure 83.17. As the frames show, the detection accuracy is affected by the face orientation, as proﬁle faces are much harder to detect and better training is desired. It must be noted, however, that the hardware implementation does not impact the accuracy of the computation compared with software implementation, as the same experimental frames had equal detection accuracy when modeled in software. A particular issue however concerns the amount of search windows marked as faces; the hardware platform returns whether a search window contains a face or not. Faces, which are detected by several windows (due to the face being

sufﬁciently small and the search windows overlapping), need to be identiﬁed as such (and not as two or more faces). This is however left upon the host application. Figure 83.17 shows such scenarios as well.

Search This Blog

Integrated circuit course

An Exploration of Hardware Architectures for Face Detection:Performance Evaluation

Performance Evaluation

Comments

Post a Comment

Popular posts from this blog

Architecture and Design Flow Optimizations for Power-Aware FPGAs:Low-Power Circuit Techniques.

Adders:Carry Look-Ahead Adder.

SRAM:Decoder and Word-Line Decoding Circuit [10–13].