An Exploration of Hardware Architectures for Face Detection:Computation Overview

Computation Overview

The operation essentially is partitioned into the following stages: computation of integral and integral squared images, computation of image standard deviation, computation of rectangles per feature, feature computation, stage evaluation, and configuration (see Table 83.1). When a feature size increases (i.e., when a search window size increases) the computation is essentially treated as a new one. When the image has been searched at all search window sizes, the system is ready for the next image frame.

In each case, all three units collaborate to perform the computation. Incoming pixels stream in the processor in parallel along all rows of the processor and are shifted in row-wise every cycle. First, the integral image is computed. The computation consists of horizontal and vertical shifts and additions. Incoming pixels are shifted inside the array on each row. Depending on the current pixel column, each of the computation units performs one of three operations; it either adds the incoming pixel value into the stored sum, or propagates the incoming value to the next-in-raw processing element while, either shifting and adding in the vertical dimension (downwards) the accumulated sum or simply doing nothing in the vertical dimension. The computation and data movement is outlined in Figure 83.16. To compute the squared integral image, the same procedure is followed. The incoming pixel passes through the multiplier in the MEU, which computes the square of the pixel value, and then that value alternates with the original pixel value as inputs to the array. It must be noted that since the multiplier present in the MEU allows for the result of an 8 × 8 multiplication to be available at a single clock cycle, by using a duplex multiplier [28]. As such, the integral and squared integral image are computed in alternate cycles with the entire computation taking 2 ∗ [(m + (m − 1) + (n − 1)] cycles, for an input image of n rows

An Exploration of Hardware Architectures for Face Detection-0150

The next step is to compute the image variance for each search location. The variance is computed by taking the sum of the mean square pixels over the search window and subtracting from it the squared sum of the mean pixels. Given that the search window over the integral and integral squared images can be viewed as a rectangle itself, it makes sense to follow the same approach as with the feature rectangles, using as corners the four search window size corners. Also, given that the search window size is known, there is no need to divide the sum of the pixels and over the size of the window; instead, that is incorporated into a constant when the variance is computed. The sum of the pixels is then squared, and subtracted from the value of the squared integral image, to give us the variance. To compute the standard deviation, we need the square root of the variance. The square root however is a tedious operation when it comes

An Exploration of Hardware Architectures for Face Detection-0151

to hardware, so a better solution must be found. Recall that to compute the compensated threshold (t) we need the original feature threshold (to), as shown in the paragraph following Eq. (83.1). The original feature threshold is part of the training set. The compensated threshold used in the feature evaluation is the product of the original threshold and the image standard deviation (σI), as shown in Eq. (83.2).

If we therefore square both sides of the equation, the computation essentially becomes a multiplication of the variance, with the squared value of the original threshold (which can be precomputed during training and stored in the training set) to yield the squared value of the compensated threshold. In the original computation, the feature sum is compared with the compensated threshold, and depending on the outcome, a predetermined value is added to the accumulated stage sum. In the modified version, we need to compare the squared value of the feature sum instead; hence, instead of having to take a square root, we use the available multiplier in the MEU to compute the squared feature sum and compare that with the compensated threshold obtained from the multiplication of the variance with the squared value of the original threshold. It must be noted that the image variance is computed only once for each search window and it is stored into each CDTU for the entire algorithm at each scale. The variance value changes only when the feature size changes or a new frame arrives:

An Exploration of Hardware Architectures for Face Detection-0152

Next, the rectangle computation happens. Each feature has either two, three, or four rectangles that need to be computed. Recall from the algorithm that the sum of each rectangle is computed via two additions and two subtractions using the rectangle corner data. The computation happens in the following manner. Each CDTU acts as a starting corner for each search window (i.e. the left most top value). For each rectangle in each feature, each corner point is shifted toward the collection point. The points move one at a time, but in parallel for all rectangles/features in the array. At each collection point, the point is either added or subtracted to an accumulated sum, with the rectangle value computed when all points of each rectangle arrive at the collection point. As such, each point requires dx + dy cycles to reach the collection point, where dx and dy are the offset coordinates of the point with respect to the upper left corner of the search window. The CDTUs support computation of up to four rectangles per feature. When finished collecting the rectangle sums for a single feature, the collected sums are stored in the CDTU that represents the starting corner for each feature. Next, all the collected sums are then shifted leftwards toward the MEUs, one sum at a time per MEU. From left to right, eventually all sums arrive in each MEU, where the rectangle weights are multiplied with the incoming sums, to evaluate the feature. It must be noted that each feature contains up to four rectangles, and to compute the stage sum we also need the accumulated feature sum from the previous feature computation and the variance. Hence each CDTU takes six shifts to transfer each rectangle and the accumulated stage sum and variance value to its neighboring CDTU. When each rectangle value enters the MEU, the rectangle weight is multiplied with each rectangle sum and accumulated together to compute the feature sum. The compensated threshold is then computed using the original threshold and the variance as described earlier. The feature sum is then squared using the multiplier, and compared with the compensated threshold to set the feature result. The partial stage sum is accumulated with the feature result and shifted with the FB in a toroidal fashion to the CDTU on the far right of the grid, to continue the computation. Eventually, when all feature results are computed, they are stored back into the CDTUs in the grid and the computation resumes with the next feature.

At the end of a stage, the stage sum is compared against the stage threshold, and if the threshold is not met, the location is flagged as a nonface by resetting the FB that is shifted with the stage sum. If the FB is reset, a counter at each controller unit is decremented to keep information about the number of face candidates in the image. If the counter is set to zero after the completion of a stage, the controller unit simply changes scale and proceeds to compute the next scale as no search window has passed the stage. Note that since data movement needs to happen however as computation happens in parallel for all locations, a location, although not containing a face, still has to move along with the rest to maintain the order of computation. However, when a location without a face arrives for computation at the MEUs, the MEU does not compute the feature sum, rather sends a request to the controller unit to simply shift the next location to the MEU, and the location around to the far right CDTU to resume computation. Additionally, there are certain CDTUs that do not take part in the collection of data points.

For example, to scan the bottom of the integral image of a 240 × 320 frame, the search window starts at location (0, 216); locations below row 216 will not be starting corners for a 24 × 24 search window, as the window will exceed the image size. Similarly, CDTUs for locations 296 and further right, there is also no starting search window corner. As such, the respective CDTUs are also treated as dummies and the MEU does not perform any computation on them; it instead resets their FB so that they are treated as nonface locations. Note that the face counter is initialized with the dummy CDTUs in mind and the dummy CDTUs are identical to the rest of the system.

When all stages complete for a single scale, the flagged locations contain a face. If the scale computed is the last one, the computation ends, and each search window with a face still has its FB set inside the representing CDTU. Each location that contains a face is shifted to the right and outside of the grid array, to the output of the processor for the host application to proceed.

Comments

Popular posts from this blog

SRAM:Decoder and Word-Line Decoding Circuit [10–13].

ASIC and Custom IC Cell Information Representation:GDS2

Timing Description Languages:SDF