Design Techniques and Challenges:Design Techniques for EBCOT of JPEG 2000
Design Techniques for EBCOT of JPEG 2000
In this section, several design techniques used by the architectures mentioned above are discussed in more detail. Since EBCOT architectures are comprised of BC, FIFO, and AE, design techniques are presented separately.
BC in SeriBCS
(1) Memory access pattern [3]. In SeriBCS, state memory is used to store significance state. Memory access and arrangement is a nontrivial issue for BC. In bit-based design, nine significant state variables are required, including variables of the sample under coding and the eight neighboring samples. In a column-based design, three columns, including current and two neighboring columns, are needed at the same time. With six bits for each column, totally 18 significant and sign variables are required. The column-based operation is shown in Figure 81.9. The rectangles stand for a shift register array for significant state variables. Black rectangles represent current coding samples and the white ones rep- resent neighbor samples. In addition, adder trees are used to provide the sum of significant neighbors of each sample in current coding column simultaneously and to share hardware cost. After a column is coded, variables in the register array are shifted to the nearby column on the left, the original variables in the left column are abandoned, and new variables are loaded from memory and stored in the right column.
Since significance states from neighboring stripes are needed, it is quite a challenge to arrange and access memory. Three feasible memory arrangements (A, B, and C) are shown in Figure 81.10 and indicate how memory arrangement can affect system performance. For memory arrangement A, four significance states of a column are grouped into a word. These words are interleaved-placed in three memories, i.e., three stripes are interleaved-placed in three memories. Every time a new column is processed, three words (12 bits), including the current coding column, the upper one, and the lower one, have to be loaded from three memories, one word per memory. During read access, although only six variables are needed for coding a new column, 12 variables are loaded from memory. During write access, if there are any changes in the significant state, only the current column has to be written back. Therefore, there are 6 redundant bits in read cycle and no redundancy in write cycle. For memory arrangement B, variables are interleaved-placed in two memories. Every time a new column is processed, two words (8 bits) are loaded from two different memories. During read access, although only six variables are needed for coding a new column, eight variables are loaded from memory. During write access, when data have to be written back to memory, there will be four redundant bits. Therefore, there are two redundant variables in read cycle and four redundant variables in write cycle. In summary, arrangement A reads six redundant bits in every 12 bits, but writes no redundant bits, while configuration B reads two redundant bits and writes four redundant bits.
In memory arrangement C, every four variables in two nearby columns are grouped to be a word, and words are placed in three memories in an interleaving format (A, B, C, B, A, B, C, B, C, B, A). During read access, the variables of two nearby columns are loaded into the register together in one cycle and the memory access frequency is reduced to half. Although the memory bus width is still 12 bits, no redundant variables are loaded. During write access, there will be 4 redundant bits for the write back of data of two columns. Compared with arrangement A and B, arrangement C has two redundant bits access for one read and write of one column, while arrangement A as well as B have six redundant bits.
(2) Pixel skipping [3,4]. Four bits in a column could be coded in different passes. So if four bits are evaluated in every pass, many clock cycles are wasted. The pixel skipping technique is used to identify only those coefficients in a column that need to be coded (NBC) for the current pass. An “NBC pixel” variable is used as input to the pixel skipping circuitry. This input has 4 bits that indicate which bits in the column need to be encoded during the current pass. For example, if the NBC pixel input is equal to (1010)2, then the zeroth and second bits (reading from left to right) are to be encoded, and the NBC indexes produced should be (00)2 and (10)2, respectively. A diagram of two-pixel skipping circuitry is presented in Figure 81.11.
The uppermost adder in the figure tallies the bits in the NBC pixel input to determine the total number of bits to be coded for the column (e.g., 10102 implies two bits are to be encoded). A pixel accumulator is incremented by one in each cycle to keep track of how many bits in the column have been coded thus far. These values are used as inputs to a comparator to determine when coding of the entire column is complete. At this time, a ChangeColumn signal is set, and non-NBC bits are skipped.
In the basic pixel skipping implementation shown in Figure 81.11(a), the inputs to the initial multiplexers are constant and can be removed if their inputs are identical. Furthermore, multiplexers with input values opposite to that of the control signal (i.e., control signal of 0 gives an output of 1; control signal of 1 gives an output of 0) can be replaced by a single inverter whose input is the original multiplexer’s control signal (i.e., invert the control signal). Figure 81.11(b) illustrates the simplified pixel skipping architecture. The new design eliminates three multiplexers and reduces two of the multiplexers to smaller inverters. Additionally, the original design’s pixel accumulator can be replaced with a simple DFF.
Pixel skipping techniques are useful for serial pass mode of EBCOT tier-1. For parallel pass mode of EBCOT tier-1, bits in different passes can be coded simultaneously. So whenever a bit is accessed and evaluated, it is assigned to one pass and coded immediately.
BC in ParaBCS
Supporting three-level parallelism [15]. Since ParaBCS is based on ParaPM, ParaBCS can adopt more parallelism. To support the multilevel parallelism, removing data dependency is essential. The data depen- dency among three passes can be removed if the parallel mode of EBCOT is adopted. The data dependency among bit-planes and among coding bits can be removed if SigState can be predicted before three pass scannings. The change of SigState can occur either in pass 1 or pass 3. As a result, pass 1 is self-dependent, pass 2 depends on pass 1, and pass 3 depends on pass 1 and is also self-dependent.
To remove these dependencies, two context windows shown in Figure 81.12 were adopted, where one column is mapped to the fourth bit from the previous stripe and four bits from the current stripe. Columns A, B, and C form the context window 1. Columns C, D, and E form the context window 2. These two context windows are implemented as 5 X 5 register array. Context window 1 is used to indicate pass 1 coding state, i.e., whether the bits are coded in pass 1 or not. It can be predicted by using initial
Sig State and its bit-value. If a bit is insignificant and any neighbor is significant, it is coded in pass 1. After one column finishes coding, all values are right shifted by one column.The pass 1 coding states are shifted to context window 2. In context window 2, all the NC and pass coding states (i.e., which pass the bit belongs to) can be calculated by using initial SigState from load logic model, the bit values and pass 1 coding states. Pass 1 coding states indicate the bits coded in pass 1. The bit is coded in pass 2, if the bit is initialized as significant. The rest of all the bits are coded in pass 3. The neighbor contribution calculation algorithm is shown in Figure 81.13, where A, B, C, D, E, 1, 2, 3, 4, 5 are used to identify the location of bits as shown in Figure 81.12.
Parallelism among four bits in one column can benefit reduction of power consumption in two ways. First, this parallelism permits four bits to be evaluated simultaneously and optimization can be made for all four bits together. So, adding this parallelism can reduce memory access, share circuits to evaluate neighbor information, and reduce the movements of values in registers. Second, the increased throughput from this parallelism can be used for trade-off with power consumption. The computation may finish quickly and stay in sleep mode longer.
FIFO
Depth of FIFO depends on burst rate of inputs and reading rate of outputs. If reading rate of output is fixed, and the higher the burst rate of inputs is, the deeper FIFO is. For a specific application, an optimized depth exists. If FIFO is too long, memory space is wasted. If FIFO is too short, data overflow will happen.
(1) Combining bit-planes to reduce depth of FIFO [11,15]. Assume that the number of bit-planes for a code-block is N, the MSB has an index N, and the LSB has an index 1. Bit-plane scanning order is from MSB to LSB. In the bit-plane indexing with bigger number, pass 3 dominates the coding process. In the bit-plane indexing with smaller number, pass 1 or pass 2 dominates the coding process. This behavior results from the BC process. At the beginning, the entire significance state is initialized as insignificant and bits either have higher probability to be coded in pass 3 or do not need coding. With scanning going from top to bottom, there is more significance state change from insignificant to significant and bits have higher probability to be coded in pass 1 or pass 2. Figure 81.14 shows the number of bits coded in three passes. Note that pass 3 can use run-length coder to generate CX–D pair. In the best case, four bits could be coded together into one CX–D pair only. But in pass 1 or pass 2, one bit is coded into at least one CX–D pair. So the higher the index of bit-plane, the lower the burst rate of the bit-plane. To save FIFO space, outputs from two bit-planes (one is the lower-level bit-plane with index i and the other is the higher-level bit-plane with index N - i, where N is the total number of bit-planes) are combined together and stored into one FIFO. Thus, the data rates from two bit-planes are averaged and the total burst rate will be reduced, leading to reduced depth of FIFO.
Arithmetic Encoder
AE is a strictly serial process. To increase system throughput, multicycle and pipelined architectures were proposed in literature. Multicycle architecture has an obvious disadvantage in system throughput. So pipeline architecture is usually adopted to enhance system throughput. In this section, pass switching technique and forwarding technique are discussed. Pass switching arithmetic encoder can support ParaPM efficiently. Forwarding technique is combined to pipelined AE to reduce power consumption.
(1) Pass Switching Arithmetic Encoder [10]. In ParaPM, three passes are coded simultaneously. To consume these simultaneous CX–D pairs in three passes, it seems that three AE coders are required. Since AE coder
is usually costly, three AE codes make the pass-parallel architecture impractical. Fortunately, instead of using three AE coders, a low-cost pass switching arithmetic encoders (PSAE) as shown in Figure 81.15 can be used. Like normal AE coders, PSAE includes the functionalities of coding state transition, interval and code register calculating, renormalization, bit-stuffing, and encoder termination. PSAE differs from the normal AE coders in that it has the capability of coding CX–D pairs in different coding passes without any conflicts. This is achieved by using three sets of context registers as well as coding state registers, which are allocated for three passes respectively. The pass signal delivered from BC selects one of the three sets of registers. Therefore, the much expensive part (i.e., the arithmetic coding PE) is shared for three coding passes. Although more registers are required for PSAE, the number of contexts registers required for PSAE is not the triple of 18 contexts. There are only 13 possible contexts generated in pass 1, 3 in pass 2, and 15 in pass 3.
(2) Combining forwarding technique to pipelined AE [12]. To reduce the power consumption of AE, forwarding technique combined with clock gating technique can be applied. One distinct characteristic of this scheme is that two symbols can be coded as one symbol in the last two pipeline stages. As a result, only two pipeline stages are required instead of four pipeline stages for two symbols. So the switching activities in two pipeline stages are removed without decreasing the system throughput. AE encoding procedure is inherently a serial process with high dependency, but we still observe the possibility of combining two continuous contexts. If the second context is encoded in B2 in Figure 81.4, the codeword is not changed and only shift operation is needed. So if this shift operation can be forwarded, these two continuous contexts can be processed as one context with only attached shift amount. In AE, two continuous contexts are processed in the pipeline stage one followed by another. Let us suppose that the first context is in the current pipeline stage, then the second context is in the previous pipeline stage. When the second context is satisfied with the combination condition, the previous pipeline stage can forward the shifting amount to the current pipeline stage and at the same time it gates out the last two pipeline stages for the second context by just adding AND gate and delay unit.
The combination condition is essential for this forwarding technique. During AE coding, a pair of CX and D is read into the FSM. FSM generates MPS status and Qe. As shown in Figure 81.4, no matter what the inputs are, registers A and C are updated in either one of two ways: (1) update C with C + Qe and update A with A-Qe, if B1 or B3 is taken; (2) keep C constant and update A with Qe, if B2 is taken. Then renormalization is executed and registers A and C are shifted until the MSB of A is located in the leftmost position. The counter (CT) indicates how many bits are shifted out. Every time one bit is shifted out, the CT is counted down by 1. Once the CT is counted down to 0, byte-out procedure is called to output one byte into the bitstream buffer. Then the CT is reinitialized as 7 or 8 according to bit-stuffing technique.
Note that if (2) happens, only shift operation is needed for codeword and the number of shifting is determined by Qe only. So two continuous contexts can be processed as one context except for the increased number of shifting.
To decrease the complexity of circuits, an efficient evaluation is needed. Some simplified operations are adopted for this evaluation by modifying the algorithm in Figure 81.4 such as A = A – Qe is removed, A&0x8000 is replaced with A[14 : 0] < Qe[14 : 0] and A < Qe is replaced with A < Qe « 1. So A and C are updated as in Figure 81.16.
The algorithm is shown in Figure 81.17. The four-stage pipelined architecture is as shown in Figure 81.18. In stage 1, context information (CX and D) are fetched in one clock cycle. CX is fed into FSM to generate the corresponding Qe and the possible CX updates, i.e., the next state that FSM will skip to. In stage 2, the updated CX is fed back to stage 1 to update the FSM. If the CX is processed in branch B2 shown in Figure 81.4, the shifting amount is forwarded to stage 3 and the clock cycles for stages 3 and 4 are gated out, since the context is combined with the previous one already. Otherwise, the CX is coded as the original algorithm. C register accumulation is done by using 28-bit adder. To avoid a long time delay, the 28-bit adder is divided into 16 bit adder in stage 3 and 12-bit adder in stage 4. The lower 16 bits of C register are updated by using 16-bit adder and shifting operation. In stage 4, after the higher 12 bits of C register are updated by using 12-bit adder and shift operation, the renormalization and byte-out procedure are executed.
Comments
Post a Comment