VLSI Architectures for Forward Error-Control Decoders:RS Decoder Architectures
Introduction
The publication of Shannon’s ground breaking paper [1,2] in 1948, identifying the fundamental limits of information transfer capacity of noisy communication channels, resulted in the birth of information theory and coding theory as two distinct areas of research. Coding theory involves the investigation of efficient error control codes (ECC), the study of their performance bounds, and the determination of computationally efficient decoding algorithms. In fact, the present moment is historic in another sense as the tremendous advances in the theory and practice of coding theory converges with the semiconductor industry trend embodied by Moore’s Law [3]. The latter has enabled the implementation of decoders that were considered too complex to implement just a few years ago. As a result, it is not uncommon to see codes such as low-density parity check (LDPC) codes [4] being “rediscovered” and incorporated in next generation communications standards.
A well-designed code can enable orders-of-magnitude reduction in the bit error-rate (BER) with a nominal cost in terms of power and latency. ECC can be said to have singularly resulted in the tremendous growth in communication infrastructure that has transformed modern society. This is evident in the pervasiveness of the Internet, cellular phones, modems, and wireless systems today and the continuing emergence of newer generations of communications standards ranging from digital video broadcast (DVB-H,T-DMB [5,6]), wireless metropolitan area network (MAN) (IEEE 802.16 [7]), wireless local area network (LAN) (IEEE 802.11g [8]), wireless personal area network (Bluetooth, [9]), 4G [10], fiber communications (OC-192 [11]), asymmetric digital subscriber lines (ADSL [12]), backplane/storage area networks (ANSI X3.230 [13]), and many others. The availability of a reliable, low-cost implementation substrate such as silicon has made integrated circuits (ICs) for communications an important area of research in industry as well as in academia.
A key component of any communications standards mentioned above is the ECC and a key requirement in being able to deliver low-cost silicon is a deep understanding of the fundamentals of ECC performance and their very large-scale integrated circuit (VLSI) architectures. This chapter presents the basics of the theory underlying some of the commonly employed ECCs as well as some of the more advanced ones such as LDPC along with discussion of issues involved in a VLSI implementation of ECC decoders. This chapter is organized as follows: in Section 82.2, we present Reed–Solomon (RS) decoders followed by convolutional codes, the Viterbi algorithm, and Viterbi decoder architectures in Section 82.3. In Section 82.4, we present turbo decoders that were discovered in the early 1990s and are considered to be a major breakthrough in coding theory, and since then have been incorporated in many wireless communication standards. Finally, in Section 82.5, we present LDPC decoders which are considered as a competitor to turbo codes and are being evaluated for inclusion in next-generation communication standards.
RS Decoder Architectures
RS codes [14–16] are employed in numerous communications systems. These include deep space, digital subscriber loops, data storage, wireless, and more recently, optical systems. The drive toward higher data rates makes it necessary to devise very high-speed implementations of decoders for RS codes.
An (n,k) RS code is a mapping from k data symbols to n code-symbols. The (255, 239) RS code is a good example. A data or code-symbol is an m-bit word where m = 8 (i.e., a byte) is typical. Such a symbol is said to lie in a finite field or a Galois field of size 2m (GF(2m)), e.g., GF(256) is the set of all 8-bit numbers. Usually, n = 2m - 1 and the construction of an (n,k) RS code guarantees that up to t = ë(n - k)/2û symbol errors are correctable. For example, up to 8 symbol errors are correctable in a (255, 239) code.
Inclusion of an (n,k) RS code in a communication system results in code rate R = k/n and an increase in line rate by a factor of 1/R. For example, earlier in this decade, optical links at OC-192 data rates (9.953 Gb/s) needed to increase their data rates to 10.6 Gb/s to include the overhead owing to the inclusion of (255, 239) RS code. This coding overhead is considered quite acceptable because the code provides a few orders-of-magnitude reduction in the BER for the same signal-to-noise ratio (SNR).
Finite-Field Arithmetic
An RS encoder and decoder is composed of finite-field arithmetic units such as the finite-field adder, multiplier, and divider. Finite-field arithmetic units map operands from GF(2m) back into GF(2m) and thus differ from conventional integer arithmetic such as two’s complement. Finite-field computations are elegantly described if we view elements of GF(2m) as polynomials with degree at most m - 1, e.g., the element 10000101 in GF(256) can be represented as the polynomial 1 + x5 + x7. Note that the coefficients in the polynomial representation belong to the binary set 0, 1. More importantly, these coefficients belong to GF(2) in which addition is simply an exclusive-OR (EXOR) operation and multiplication is an AND operation.
The addition of two GF(256) elements A(x) and B(x) is denoted by A(x) (f) B(x) and is simply the addition of the two polynomials. Thus, finite-field addition of two operands is a bit-wise EXOR of the two operands and hence a GF(256) adder is an array of 8 EXOR gates.
Finite-field multiplication of A(x) and B(x) is given as
In this section, we will consider finite-field arithmetic units as being combinational logic circuits. Numerous techniques and architectures exist for efficiently implementing finite-field arithmetic units (see Ref. [17] for details).
RS Encoder
In broadband communication systems, the RS encoder is typically followed by a convolutional encoder. In the absence of a convolutional encoder, the code symbols are input to a channel modulator, which maps the code symbols on to a signal constellation [19]. The choice of the signal constellation depends on the transmit power, the channel frequency response and SNR, the data rate, and also the required BER at the receiver. Typical signal constellations include BPSK, QPSK, and QAM [19].
RS Decoder
RS decoders are classified into two categories: hard-decision RS decoders and soft-decision RS decoders. A hard-decision RS decoder receives one symbol per transmitted symbol and no other information. In contrast, a soft-decision decoder may receive one or more symbols for each transmitted code symbol in addition to information regarding the reliability of these symbols. We focus on hard-decision decoders in this section as they constitute the majority of decoders in current implementations. A brief introduction to soft-decision decoders is provided at the end of this section.
The input to the hard-decision decoder for each transmitted codeword is the set of n received symbols: (rn-1, rn-2, …, r0) termed the received word. Analogous to the transmitted codeword polynomial C(z), the received word can also be viewed as a polynomial R(z). Any of the received symbols in R(z) may be in error. Hence, the received word R(z) can be written as
where the number of nonzero coefficients in the error-polynomial E(z) depends on the number of erroneous symbols v within the received word R(z). Since the maximum number of decoder correctable errors t in a received word is usually much smaller than the code polynomial degree n - 1, for typical high rate codes, E(z) can be compactly expressed as
1. Syndrome polynomial computation
2. Key equation solution
3. Chien formula-based error location search
4. Forney formula-based error value evaluation
The architectures for each of these steps will be discussed in the remainder of this section.
(1) Architectures for syndrome polynomial computation. Syndrome computation is the evaluation of the received word at the roots of the transmitted codeword or the generator polynomial G(z). The syndromes si for 0 £ i < 2t are given by
The n - 1 multiplications and n additions in Eq. (82.6) are recursively computed using a multiplier– accumulator (MAC). The architecture for computing 2t syndromes in parallel using 2t multipliers and adders with Horner’s rule is shown in Figure 82.2(a).
This architecture requires n clock cycles to complete the computation. It should be noted that one of the operands to the multiplier is a constant. The complexity of constant input multipliers is roughly half that of a regular multiplier. One can further reduce area by pipelining the multipliers and folding more than one syndrome value computation onto the same MAC.
High-throughput syndrome computation architectures are obtained by using a two-stage implemen- tation as shown in Figure 82.2(b). In the first stage, l input received symbols are passed through each MAC to compute a partial product of R(am0 +i ). In the second stage, another multiplier–adder pair multiplies and adds the partial products together. Both stages are based on Horner’s rule.
Given the syndrome values s0, s1, …, s2t-1, the syndrome polynomial is defined as
(2) Architectures for key equation solver. The key equation solver block contains recursion and hence determines the throughput of the entire RS decoder. The outputs of the key equation solver are the error
locator polynomial L(z), and the error evaluator polynomial W(z) defined as
where v £ t are the number of symbol errors and the v roots of the error locator polynomial are the error locations obtained via Chien search (described in Section 82.1.4). The error evaluator polynomial contains information regarding the error values that are obtained using Forney’s formula as shown in Section 82.1.4.
The key equation relates the syndrome polynomial S(z), the error locator polynomial L(z), and the
error evaluator polynomial W(z) as follows:
The key equation solver determines the polynomials L(z) and W(z) using S(z). From Eq. (82.9), it is easily seen that once L(z) is determined, W(z) is obtained as the coefficients of the first 2t terms in the product L(z)S(z). In addition, we have from the linear complexity property [16] of the finite-field Fourier transform that the error locator satisfies the following linear recursion:
Solving the key equation comprises two separate problems: determining the least degree polynomial L(z) (with degree v) such that the product L(z)S(z) has zero coefficients from degree v to degree 2t - 1. Note that, since the maximum correctable errors is limited to t, all valid error locators have degree v £ t. This is followed by determining the product L(z)S(z) up to degree v - 1. Typically, both the problems of determining the error locator and the error evaluator are solved together in an iterative process.
Two main iterative algorithms available for solving the key equation are: the Berlekamp–Massey algorithm (BMA) [16] and the Euclid’s algorithm [20]. Architectures for both these algorithms are discussed next.
The BMA algorithm solves the modified problem of determining a minimum degree polynomial L(z) that satisfies Eq. (82.10) under the relaxed condition: deg(L(z)) £ 2t. This relaxation ensures that the BMA always has a solution. When the solution obtained has a degree v £ t, the BMA solution is in fact the error locator polynomial L(z) of Eq. (82.9).
The BMA is an iterative algorithm. In the r th of the 2t iterations, it generates the minimum degree polynomial L(r,z) of degree vr that satisfies the following linear recursion up to the current iteration index r:
This is subtracted from the actual syndrome value sr to obtain the discrepancy coefficient Dr = sr - sr . If Dr ¹ 0, then the error locator is updated by addition of a scaled version of the error locator from a previous iteration called the scratch polynomial. The scaling constant is chosen to reduce the discrepancy coefficient to zero. The scratch polynomial is updated every iteration to ensure that the minimum degree property of the error locator is maintained.
From an architectural perspective, the throughput bottleneck occurs in computing the discrepancy coefficient. This is because the critical path in computing Eq. (82.12) passes through at least one multiplier and a logarithmic number of adders implementing Eq. (82.12). Updating the error locator and error evaluator polynomials has a fixed complexity and involves multiplication of a polynomial with a scalar constant followed by component-wise addition. These operations have a critical path passing through a multiplier and an adder.
A systolic architecture to reduce the critical path has been described in [21]. The main idea behind this architecture is to iterate the discrepancy coefficients in a manner similar to the computation of the error locator polynomial. With this modification, the architecture becomes extremely regular. The critical path within the iteration is then reduced to a multiplier followed by an adder. This is turn increases the achievable throughput of the RS decoder.
The pseudocode of the reformulated inversionless Berlekamp–Massey (RiBM) algorithm from Ref. [20] is given in Figure 82.3. The systolic architecture implementing the pseudocode is shown in Figure 82.4. In the pseudocode, di(0) (for 0 £ i < 2t) denotes the initial value of the discrepancy polynomial and qi(0)(for 0 £ i < 2t) is the initial value of the corresponding scratch polynomial. Note that the d3t(0) corresponds to the initial error locator polynomial (set to unity), and q3t(0) corresponds to its scratch polynomial (also initialized to unity). The architecture shown in Figure 82.4(a) consists of an array of 3t identical processing elements. The details of the processing elements are shown in Figure 82.4(b). As seen in this figure, the critical path of the architecture passes through one multiplier and an adder. With each iteration, all the terms of the error locator and the discrepancy polynomial move one position to the left. Hence, the discrepancy coefficient required in the current iteration (denoted by d0(r) in Figure 82.4) is always available at the leftmost processing element. At the end of 2t iterations, the error locator is available at processing elements from t to 2t and W(h)(z) from Eq. (82.11) at processing elements numbered 0 to t - 1. More details on the reformulation leading to this algorithm and further architectural details are available in Ref. [21].
In addition to BMA, the key equation can also be solved by using the well-known Euclid’s algorithm [20] for determining the greatest common divisor (GCD) of two polynomials. RS decoder implementation
based on the Euclid’s algorithm is a well-researched problem [22–24]. In general, if A(z) and B(z) are two polynomials with coefficients over a field, and if G(z) is the GCD of the two polynomials, then Euclid’s algorithm expresses G(z) as a function of A(z) and B(z), i.e.,
Comparing Eq. (82.11) and Eq. (82.13), we see that the key equation is equivalent to finding the GCD G(z) (error evaluator W(z)) and R(z) (error locator L(z)) given the syndrome polynomial S(z) = A(z) and B(z) = x 2t. Euclid’s algorithm finds the GCD through a process of iterative division of two poly- nomials and their remainders. This is followed by a traceback of the iterations to find R(z). However, a modified form of the Euclid’s algorithm called the extended Euclid’s algorithm determines both the GCD (error evaluator) and the coefficient polynomial (error locator) in the same set of iterations. The pseudocode of the extended Euclid’s’ algorithm, based on Ref. [22] is shown in Figure 82.5. In this figure, ai and bi are the leading coefficients of Ri(z) and Qi(z), respectively. This pseudocode can be implemented on a systolic array of 2t processing elements similar to that of the systolic BMA. The
structure of one processing element [22] is shown in Figure 82.6. Further details on this implementation can be found in Ref. [22].
Both the extended Euclid’s architecture and the RiBM architecture have the same critical path and require the same number of iterations for processing a syndrome polynomial. If the number of multipliers is taken as a representative of the hardware complexity, then the systolic RiBM architecture requires 6t multipliers whereas the systolic extended Euclid’s architecture requires 8t multipliers. In addition, the control circuitry of the Euclid’s architecture is more complex compared to the RiBM architecture. Pipelining the BM architecture and the extended Euclid’s architecture is discussed in Ref. [21] and Ref. [23], respectively.
Chien Search and Forney’s Formula
The roots of the error locator polynomial L(z) can be obtained from the Chien search procedure [25]. The error values at the error locations can be determined from the Forney formula [26]. Chien search involves the evaluation of the error locator polynomial at all the elements of the finite field and the flagging of those elements at which the polynomial evaluated to zero. Note that, from Eq. (82.7), the roots of the error locator are Xi for each error location Xi = ai. Since the error locator degree is constrained to a maximum of t, we require t multiplications and t + 1 additions to evaluate the error locator at one finite-field symbol value. Hence the complexity of Chien search is nt multiplications and n(t + 1) additions. Note that Chien search can be implemented via the Horner’s rule as discussed in connection with syndrome computation. The highest throughput is obtained when all the evaluations are performed in parallel using n MACs while minimum area implementations perform all the evaluations serially using one MAC. The Chien search architecture is chosen on the basis of the throughput of the other blocks such as the key equation solver architecture.
If deg(L(z)) £ t and all the roots of L(z) lie in the Galois field of operation, then the error values at these error locations are found using the Forney algorithm. Given the error location Xi , the error locator L(z), and the error evaluator W(z), the error value Yi is given by Forney’s formula [26] as
The denominator in Forney’s formula is obtained as part of the Chien search procedure. The Forney formula still requires an inverter. Efficient inverter architectures are provided in Ref. [17]. The numerator in Eq. (82.14) is computed using the Horner’s rule. The computed error values are subtracted from the received word to obtain the transmitted codeword.
Advanced RS Decoders: Soft-Decision Decoders
Reliability information, either at the symbol level or at the bit level can be used to improve on the bit-error performance obtained from hard-decision decoders described thus far. The simplest soft-decision decoder is an erasure decoder [16] that employs received symbol reliability information and a preset threshold to declare certain received symbols to be erasures. Each erasure reduces the minimum distance of the code by unity. With m erasures, the maximum number of correctable errors t¢ among the unerased symbols is given by
A hard-decision decoder can perform erasure decoding with modification to the initialization of the error locator and error evaluator. Hence, the complexity of a erasure decoder is similar to that of a hard- decision decoder. If the erased locations contain some of the erroneous received symbols, then the total number of errors that can be corrected is greater than that of the hard-decision decoder. The maximum number of received symbols that can be erased is however limited by the minimum distance of the code. An extension of the erasure decoding idea is the generalized minimum distance (GMD) algorithm [27], that iteratively increases the number of erasures and checks for decodability at each iteration. This algorithm performs t erasure decoding trials with the number of erasures increased by two for each trial. The complexity of GMD decoding can be reduced by using the results of the previous trial as an initialization of the current trial. This version of the GMD decoding algorithm is discussed in Ref. [28]. Another soft-decision decoding algorithm primarily employed for RS codes over small fields is the ordered statistics decoding (OSD) algorithm [29,30]. OSD uses bit-level reliabilities to reduce the paritycheck matrix of the code to a form having unit weight columns for low-reliability bits. This is followed by flipping low-reliability bits that do not satisfy the binary parity check equations thereby providing an initial estimate of the transmitted codeword. A search is conducted over sets of high-reliability bits to obtain a codeword closer to the received word. The size of the search set of high-reliability bits is typically limited to two or three as the search complexity is exponential in the number of these bits.
Algebraic soft-decision decoding [31] is a recently developed algorithm for soft-decoding RS codes. This technique assumes more than one received symbol and corresponding reliability information for each transmitted symbol. Coordinate pairs are then formed using the code-evaluation point and the received symbol. A bivariate (two-dimensional) finite–field polynomial is generated that passes through each of these coordinate points. The number of times the polynomial passes through a particular coordinate point is determined by the reliability of the received symbol. The curve is then factorized to obtain the transmitted data polynomial. Architectures for algebraic soft-decoding have been proposed in Refs. [32,33] and are presently an active area of research.
Comments
Post a Comment