An Exploration of Hardware Architectures for Face Detection:Proposed Architecture

By Ahmed Farahat - October 14, 2015

Proposed Architecture

Data Storage Requirements

First we need to determine the hardware requirements for the algorithm in terms of storage and data ﬂow. There are two places of interest: the training set storage and the computation stage. To evaluate the necessary storage requirements for the training parameters, we used the Intel Open Computer Vision Library (CV) [27] for the training data. The CV library provides a state-of-the-art software implemen- tation of the AdaBoost detector, utilizing a very accurate pool of features. The training set uses a starting feature size of 24 × 24 pixels, and scales each feature by a factor of 1.2, resulting in 13 scaled feature sizes.

The largest feature size is 214 × 214 pixels. The training set associated with the library provides us with necessary information about the accuracy and precision required for the hardware computation. From the training set, we can derive that we need 8 bits per rectangle weight, for each threshold value and for each predetermined feature sum. All these values are not integers; they are signed ﬁxed-point numbers. Hence we need a sign bit and ﬁxed-point representation of 2 integer bits and 5 decimal bits. The dynamic range supported is ±3.96875, which reﬂects the values given in the CV training set. We also need up to 8 bits to store each rectangle offset from the feature corner, as the largest feature size used in the CV set is 224 × 224. As such, each rectangle needs 4 × 16 bits to store the dx and dy values, and 8 bits for its associated weight. Each feature has either two, three or four rectangles, and each stage has a number of features, ranging from 9 to 211 features per stage. The total number of features used in the reference training set is 2913, spread over 25 stages. The total number of rectangles is 6383. An important factor, however, is the frequency of each feature computation; due to the nature of the algorithm, ~80% of the computation occurs during features from stages 1 and 2, which are only 25 features for a total of 50 rectangles. Hence, our emphasis falls on providing rapid access to the data necessary to compute the ﬁrst two stages as thereafter; only locations that have a very high probability of containing a face will be evaluated.

Next, we determine the necessary storage for the integral image and integral squared image, as well as the data ﬂow parameters. The input image is considered to be an 8-bit per pixel grayscale image. We use a 320 × 240 input image frame size, with the maximum pixel value set at 255. As such we need to provide storage for the case where all pixels will be set at 255, an unlikely scenario, but necessary for correct operation. As such, the maximum integer value that can be stored in an integral image is 255 × 320 × 240 and the maximum integer value that can be stored in an integral squared image is (255)2 × 320 × 240.

This requires 25 bits per integral image entry, and 33 bits per squared integral image entry. The rectangle sums need an accumulator, which can accommodate nonsaturating arithmetic, and the computed result needs at least 25 bits of storage. During the variance computation, the accumulator needs 33 bits to facilitate the squared integral image values. We describe next the proposed architecture.

Architecture Description

The algorithm operations are computed over an array grid processor as said earlier. The array consists of units which hold the integral and integral squared image values, and minimal hardware to propagate data in all directions in the array. Each unit is also equipped with hardware to perform additions and subtractions, so that computation of the integral and integral squared image can happen in a systolic manner, and so that the rectangle sum can also be computed within the array. Essentially the system consists of three major components: the collection and data transfer units (CDTUs), the multiplication and evaluation units (MEUs) and the control units (CUs). The units are organized in a grid-like manner, with the MEUs located at the left side at each pair of rows of the grid, and the CUs distributed evenly across the rows. The number of CUs and the distribution among rows depends on the size of the entire array and the delay/performance requirements to reduce the size of the control region. The number of CUs and the amount of CDTUs in each control region can vary according to the design budget and performance requirements. Each CDTU can communicate with each of its four neighbors via a data bus of 36 bits. A ﬂoorplan of the particular system is shown in Figure 83.13, illustrating the location of each unit and the data movement across the system. Next we discuss in detail the architecture of each unit.

Search This Blog

Integrated circuit course

An Exploration of Hardware Architectures for Face Detection:Proposed Architecture

Proposed Architecture

Comments

Post a Comment

Popular posts from this blog

SRAM:Decoder and Word-Line Decoding Circuit [10–13].

Adders:Carry Look-Ahead Adder.

Timing Description Languages:SDF