Design Techniques and Challenges:VLSI Architectures for EBCOT of JPEG 2000

VLSI Architectures for EBCOT of JPEG 2000

Since EBCOT tier-1 algorithm can support SeriPM and ParaPM, some VLSI architectures are implemented to support SeriPM and the other VLSI architectures are implemented to support ParaPM. But from structural view of coding hardware, these VLSI architectures can be divided into two categories: SeriBCS and ParaBCS. In SeriBCS, all the bit-planes are coded in serial and a code-block is coded bit- plane by bit-plane. In ParaBCS, all the bit-planes are coded simultaneously.

SeriBCS versus ParaBCS

Both SeriBCS- and ParaBCS-based architecture could be possible solution for entropy encoding in JPEG 2000 system. To compare these two schemes, both SeriBCS and ParaBCS are evaluated in PSNR perfor- mance, power consumption, and memory issues in the context of the entire JPEG 2000 system [12].

ParaBCS can support ParaPM only, but SeriBCS can support both ParaPM and SeriPM. This is because ParaBCS is based on ParaPM. The performance of ParaPM is slightly poorer than SeriPM, but it is more error-resilient (that is an important characteristic in wireless application).

Figure 81.7 shows the difference between these two schemes from hardware design view. SeriBCS contains one bit-plane coding engine, so state memory should be used to store SigState. To have a high system throughput, several SeriBCS engines should be used. (two SeriBCS engines are shown in Figure 81.7). This requires the multiple code-block memory. The size for a code-block is commonly 0.5N kB (64 X 64 X N), if N SeriBCS-based EBCOT tier-1 engines are used. Besides, since a bit-plane depends on all the bit-planes above it, several state memories are required to store states such as SigState, first MR, and visiting, to maintain this context information. Each state memory size is commonly 64X64 bits, since SeriBCS-based EBCOT tier-1 engines can start coding only if all data in a code-block are ready. The statistics of the code-block can be used to know if coding of a bit-plane is needed or not. Then some bit-planes can be removed from the coding process owing to no MSB of any DWT coefficient being above or on these bit-planes.

ParaBCS contains as many bit-plane coding engines as the number of bit-planes in a code-block, so state memory can be removed by predicting SigState. With this prediction, ParaBCS can start coding whenever all data in a column are available. After coding, the memory for the column can be released for DWT coding engine to store its outputs. By taking advantages of this method, Refs. [8,13] proposed a stripe pipeline scheme, where DWT and EBCOT tier-1 coding engines work based on stripes. The buffers between DWT and EBCOT tier-1 coding engines are of sizes similar to that of the stripe. DWT and EBCOT tier-1 coding engines can switch among these buffers to process coding. So it optimized the buffers between DWT coding and EBCOT tier-1 coding, dramatically reduced memory size, and achieved memory-efficient JPEG 2000 system, where the memory requirements are reduced to only 8.5% compared with conventional architectures. But, since all data in a code- block are unknown while the coding process starts, the statistics of the whole code-block cannot be used to decide which bit-plane is not coded in the current code-block. To do a correct coding, the worst case must be considered, i.e., all the bit-planes in a code-block are coded although some bit- planes are probably not coded. This results in a big redundancy in computation that just wastes power consumption.

Note that ParaBCS-based architecture is comprised of N bit-plane coding engines to achieve a system throughput N times as fast as the system throughput of SeriBCS-based architecture. To fairly evaluate hardware implementation, we compared these two architectures under the condition of the same system throughput, i.e., the hardware required to code one bit-plane. Obviously, ParaBCS require 0.6 kbits memory while SeriBCS requires 16 kbits.

Finally, SeriBCS and ParaBCS are compared with respect to power consumption. Since the distinct difference between them is that SeriBCS contains memories that are not in ParaBCS, a SeriBCS-based architecture is implemented to evaluate the effect of memories. Since Xilinx FPGA prototyping supports memory IP cores and can provide rough power analysis by XPower tools, SeriBCS-based architecture is prototyped in Xilinx Vertex II pro device. The prototyped architecture works at 50 MHz and the device utilization is summarized in Table 81.2. Its power analysis is shown in Figure 81.8. The experimental results show that memory access takes much power (~47%) in the entire EBCOT tier-1 coding engine. While ParaBCS-based architecture does not have power consumption in memory

access like SeriBCS-based architecture, ParaBCS introduces redundancy in computation that consumes more power.

Further Discussion on VLSI Architectures for EBCOT

VLSI architectures for EBCOT support either SeriPM or ParaPM of EBCOT tier-1 algorithm. An architecture implementing SeriPM was introduced in Ref. [3], where three passes scan a bit-plane in serial. In this architecture, fetching operation is column based, i.e., in a clock cycle, one column (4 bits) is fetched from memory instead of one bit to reduce the number of memory access. Pixel-skipping tech- niques were adopted to skip the unnecessary bit evaluation (the bit evaluation for bits that are not coded in the current pass). But pixel-skipping techniques cannot remove the unnecessary bit evaluation com- pletely and still waste some clock cycles. So two architectures implementing ParaPM were proposed in Refs. [10,14], where three passes scan a bit-plane in parallel and column-based fetching operation was adopted similar to that in Ref. [3]. Architectures implementing ParaPM evaluate one exact bit in a clock cycle and no clock cycle is wasted as in architectures implementing SeriPM. In Ref. [14], the parallelism between four coding bits was introduced, where two bits from different passes may be coded in the same clock cycle, leading to improvement of system throughput. All the architectures mentioned above are based on SeriBCS, i.e., all the bit-planes are coded in serial and a code-block is coded bit-plane by bit- plane. So a state memory that associates SigState with the current bit-plane is required to maintain the historical coding information. To remove state memory, a ParaBCS was proposed in Ref. [11]. This architecture does not need the memory for SigState and all the bit-planes of a code-block are coded together in parallel. But four bits of a column are coded bit by bit and only one bit is coded in one clock cycle. By uncovering that parallelism among four bits of a column one can significantly increase system throughput; the parallelism was embedded into ParaBCS and the proposed three-level parallel architecture to achieve high system throughput for multimedia real-time applications [15].

AE, as one part of EBCOT tier-1, is a strictly serial process and consumes the context information constructed by bit-plane coding phase. So low performance of AE can significantly limit the performance of high-throughput bit-plane coding, leading to low performance of EBCOT. Multicycle architecture for AE obviously has low system throughput. To achieve high system throughput, some pipelined AE archi- tectures were proposed in Refs. [3,16–20].

While all the above architectures focus on how to overcome the computation complexity of EBCOT tier-1, many architectures for the entire JPEG 2000 system [6,7,21–25] were proposed. They can be classified into two categories: (1) to process multiple code-blocks in parallel by using multiple SeriBCS- based EBCOT tier-1 engines [6,7,21,22]; (2) to process one single code-block by using a single ParaBCS- based EBCOT tier-1 engine [23–25]. Since ParaBCS-based EBCOT tier-1 engine has higher system throughput than the SeriBCS-based one, the above two categories could meet the system throughput requirement. For the entire JPEG 2000 system, memory issues are also key factors. In general, the larger the tile size parameter to perform JPEG 2000 compression, the higher is the compression ratio. But more memory is required. The tile memory occupies >50% of area in conventional JPEG 2000 architecture [8]. These bottlenecks are mainly caused by the different coding flows between the DWT and EBCOT tier-1 processes, since the DWT process requires an entire time memory to carry out the subband transformation [9] and EBCOT tier-1 divides each subband into several code-blocks and performs entropy coding. Optimizing the individual components only may lead the overall encoding system to suffer from performance degradation [6,7], because different components have different I/O bandwidths and buffers. A block-based scan for DWT [23,26] was proposed to eliminate the use of tile memory (commonly 96 kB) at the cost of the increase of memory bandwidth. Although the tile memory is eliminated, the memory requirements for nonoptimized block scan order are still too high. In Ref. [8], by taking the throughput and the dataflow of DWT and EBCOT tier-1 into joint consideration, a stripe pipeline scheme for DWT and EBCOT tier-1 was proposed to solve the above problems. The main idea was to match the throughput and the dataflow of the two modules so that the size of local buffers between the two modules was minimized.

Search This Blog

Integrated circuit course