Case Study: A Three-Level Parallel High-Speed Low-Power Architecture for EBCOT-Tier-1

Since EBCOT tier-1 is a bottleneck component of the entire JPEG 2000 system, hardware design should be considered in the interactions of EBCOT tier-1 with other components as well as the improvement of EBCOT tier-1 performance itself, i.e., considering not only memory design issue in the context of the entire JPEG 2000 system but also the system throughput and power consumption of the single EBCOT component. Besides, there are trade-offs between system throughput and power consumption. Reducing the supply voltage is a popular power saving technique. But it reduces system performance by increasing delay. In contrast, in submicro field, drain leakage will increase with Vt dropping to maintain noise margins and meet frequency demands. It leads to excessive battery draining standby power consumption. By increasing system throughput and finishing subtask quickly, the system can have more time to sleep and save standby power.

With consideration of all the issues mentioned above, ParaBCS-based three-level parallel high-speed low-power architecture [12] is discussed here. To reduce memory size, this architecture can fit into stripe-based pipeline scheme (mentioned in Section 81.3.2).* To achieve high system throughput, three

levels of parallelism in Section 81.4.2 are adopted: (1) the parallelism among bit-planes: all the bit- planes can be processed simultaneously and SigState memory can be removed by predicting the SigState of each bit; (2) the parallelism among pass scanning: three passes scan one bit-plane in parallel, bits are coded immediately after they are evaluated and no processing time is wasted; and (3) the parallelism among coding bits: four bits in a column can be coded in the same clock cycle simultaneously by adding extra primitive encoders.

Since this architecture is based on ParaBCS, the power consumption in memory access in SeriBCS is reduced. To further reduce the power consumption, computation in coding process is reduced without decreasing the system throughput. In ParaBCS, all the bit-planes in a code-block are coded in parallel together. In fact, for some bit-planes, no MSB of any DWT coefficient is located on or above them, so their coding is not needed. For stripe-based pipeline scheme, it is not possible to find out these redundant bit-planes because all data in a code-block are not ready when coding starts and the worst case has to be considered. So a detecting technique is adopted to detect these redundancies and disable the coding components for them. To simply put this idea, scanning of 1-D DWT coefficients is shown in Figure 81.20, where the horizontal axis is the coding scanning order and the vertical axis is the magnitude of DWT coefficients. For the original ParaBCS, the worst case should be considered and the entire coding area, including area 1, area 2, and real DWT coefficients area, are coded. For the original SeriBCS, it is known that area 1 is not needed, so coding activities in areas 1 are removed in comparison with the original ParaBCS. By noticing that all coding activity for one DWT coefficient in areas 1 and 2 are the same and the coding results are identical fixed values. So bit-plane coding in areas 1 and 2 is not necessary. Here, the detecting techniques can detect area 1 as well as area 2 and remove computation in these areas.

Since EBCOT tier-1 includes bit-plane coding and AE coding which are connected by FIFO, computation can be reduced in bit-plane coding phase, FIFO, and AE coding phase. In the bit-plane coding phase and FIFO phase, for areas 1 and 2, CX are from run-length coder in Pass 3 and D is 0. So coding could be removed and only a counter is applied to indicate how many CX and D and a disable flag are applied to indicate if the counter is valid for the current bit-plane. AE phase is a strictly serial process and contains the historical information. So only the coding in the same scanning step of areas 1 and 2 have the same code-word. Keeping just the first AE coding, all other AE codings in areas 1 and 2 are disabled. Before scanning for one bit-plane that leaves from area 2 to real DWT coefficient area, the current AE is initialized from the first AE coding and works like normal coding. Besides, a forwarding technique is used to further reduce power consumption in AE. Some two continuous context labels can be combined in the late two pipeline stages and they work like one context label. So one clock cycle is gated

out instead of two clock cycles being required, leading to reduction in power consumption. Figure 81.21 shows how forwarding technique reduces switching activities of the last two pipeline stages. So for cycle 2, the last two pipeline stages are removed after forwarding happens.

EBCOT Tier-1 Architecture

The VLSI architecture is shown in Figure 81.22. The bit-plane buffer banks contain all the DWT coefficients. The load logic model fetches four wavelet coefficients in one clock cycle from the code block memory. MSB of these DWT coefficients are used by load-logic model to initialize the SigState of bits. For each DWT coefficient, the bits above its MSB, including MSB, are initialized as insignificant and the others as significant. The initialized SigStates are fed into column-based NC generator models for evaluation of NC. Note that these initial SigStates are not the final values that can be used to calculate the NC. More steps are needed to predict the final values. (Section 81.4.2 will discuss this in detail.)

The column-based NC generator for bit-plane i is associated with a column of the bit-plane i. So n column-based NC generators may code all the bit-planes simultaneously. Here, the bit-plane n–i coding and the bit-plane i coding are combined together in FIFO model. (for simplicity, only two combinations are shown in Figure 81.22). The number of context labels from the bit plane n–i is usually less than that from the bit plane i since the bit plane n–i may have more bits coded by run-length coding. Run-length coding may encode more than 1 bit, but it generates only one context label. This combination will benefit FIFO (No big variance for data in FIFO).

As shown in Figure 81.12, each column-based NC generator can evaluate NC for four bits simultaneously. All the NC are used to determine which pass a bit belongs to and what the context label is. In primitive encoder model, there are four encoders for significant coding (pass 1 or pass 3), four encoders for MR (pass 2 only) and a run-length coder for pass 3. So the primitive encoder model may concurrently code four bits in a column. Note that in our architecture run-length coder and sign-coder work in parallel with the primitive coder above. As a result, column-based NC generators and primitive encoders provide the parallelism among passes and among bits. The primitive encoders write their outputs (CX and D) into FIFO. CX and D in FIFO are consumed by the pipelined high-speed AE.

The detecting mechanism is implemented by using detecting control that is shown in Figure 81.23. Owing to detecting mechanism, two processes may happen: (1) disable process where BC, FIFO access, and AE coding are disabled; (2) normal process where all coding units work in the same way as the original architecture. The detecting unit evaluates the initial SigState of an input column from load logic module. At the beginning of a code-block coding, the detecting units are reset to 0. All initial SigStates of a column from load logic module are ORed with the detecting unit output and feed the result into the detecting unit. So whenever an input column has a nonzero SigState, the detecting unit output will be 1 and retain this until the current code-block has finished its coding. If the detecting unit output is 0, the disable process works and BC process and FIFO access are disabled. If the detecting unit output is 1, the normal process works and BC process is coded as the original process and FIFO stores results from the BC process. One counter is used to indicate how many columns are coded in the disable process. Since BC and AE coding work column by column, only one counter is needed. Each bit-plane is associated with one disable flag that indicates if the current BC is disabled or not. If it is disabled, AE fetches the fixed CX and D and decreases the counter. If it is in a normal process, AE fetches the CX and D from FIFO.

When detecting mechanism is applied to AE, one difference should be noticed. AE for the bit-plane n is always coded even though the detecting unit output is 0. AE is a strictly serial process and its output is associated with historical information. Even though, columns in the same scanning step have the same states in AE when detecting outputs are 0, the state is variable with the scanning steps. So one AE is used to trace the change of states. For other AEs except the one for the bit-plane n, they are disabled when their detecting outputs are 0. Whenever detecting unit output changes from 0 to 1, the current register A, register C, and byte-out are loaded from the AE for the bit-plane n.

This architecture can achieve higher system throughput and reduce power consumption. As shown in Table 81.4, this architecture can encode a bit-plane with one code block of size N × N in only 0.35~ 0.46 × N × N clock cycles and is four times as fast as the other architectures. The detecting technique is adopted to reduce power consumption. Figure 81.24 shows how the detecting techniques can efficiently reduce computation of bit coding in BC. Similar trends are found in FIFO and AE. Experimental results with standard test image benchmarks show that the forwarding techniques in AE and detection techniques retain

the same system throughput and achieve about 48, 16, and 20% improvement for BC, FIFO, and AE, respectively, in power consumption by comparison with the architecture based on the original ParaBCS.

Conclusion

There are two typical VLSI architectures in literature: ParaBCS, where all bit-planes in a code block are coded in parallel, and SeriBCS, where all bit-planes in a code-block are coded bit-plane by bit-plane. ParaBCS has more parallelism and does not need state memories required by SeriBCS. These two schemes are compared in system throughput, PSNP performance, memory size, and power consumption. In either ParaBCS or SeriBCS, the architecture comprises of BC, AE, and FIFO that connects BC with AE and balances the different throughputs between them. Subsequently, VLSI architectures in literature are evaluated and design techniques used by these architectures are discussed in more detail. Finally, two case studies (one based on SeriBCS and the other on ParaBCS) are presented.

Search This Blog

Integrated circuit course