Computer Arithmetic or VLSI Signal Processing:Floating-Point Arithmetic

By Ahmed Farahat - October 14, 2015

Floating-Point Arithmetic

Recent advances in VLSI have increased the feasibility of hardware implementations of ﬂoating-point arithmetic units. The main advantage of ﬂoating-point arithmetic is that its wide dynamic range virtually eliminates overﬂow for most applications.

Floating-point number systems. A ﬂoating-point number, A, consists of a signiﬁcand (also called a fraction or a mantissa), Sa , and an exponent, Ea. The value of a number, A, is given by the equation

where r is the radix (or base) of the number system. Use of the binary radix (i.e., r = 2) gives maximum accuracy, but may require more frequent normalization than higher radices.

The IEEE standard 754 single-precision (32-bit) ﬂoating-point format which is widely implemented, has an 8-bit biased integer exponent which ranges between 0 and 255 [11]. The exponent is expressed in excess 127 code so that its effective value is determined by subtracting 127 from the stored value. Thus, the range of effective values of the exponent is –127 to 128, corresponding to stored values of 0 to 255, respectively. A stored exponent value of ZERO (Emin) serves as a ﬂag for ZERO (if the signiﬁcand is ZERO) and for denormalized numbers (if the signiﬁcand is non-ZERO). A stored exponent value of 255 (Emax) serves as a ﬂag for Inﬁnity (if the signiﬁcand is ZERO) and for “Not a Number” (if the signiﬁcand is non- ZERO). The signiﬁcand is a 25-bit sign magnitude mixed number (the binary point is to the right of the most signiﬁcant bit). The most signiﬁcant bit is always a ONE except for denormalized numbers. More detail on ﬂoating-point formats and on the various considerations that arise in the implementation of ﬂoating-point arithmetic units are given in [7,12].

Floating-point addition. A ﬂow chart for ﬂoating-point addition is shown in Figure 80.18. For this ﬂowchart, the operands are assumed to be “unpacked” and normalized with magnitudes in the range [1, 2]. On the ﬂow chart, the operands are (Ea, Sa) and (Eb, Sb), the result is (Es, Ss), and the radix is 2. In step 1 the operand exponents are compared; if they are unequal, the signiﬁcand of the number with the smaller exponent is shifted right in step 3 or 4 by the difference in the exponents to properly align the signiﬁcands. For example, to add the decimal operands 0.867 ´ 105 and 0.512 ´ 104, the latter would be shifted right by 1 digit and 0.867 added to 0.0512 to give a sum of 0.9182 ´ 105. The addition of the signiﬁcands is performed in step 5. Steps 6–8 test for overﬂow and correct if necessary by shifting the signiﬁcand one position to the right and incrementing the exponent. Step 9 tests for a zero signiﬁcand. The loop of steps 10–11 scales unnormalized (but non-ZERO) signiﬁcands upward to normalize the result. Step 12 tests for underﬂow.

Floating-point subtraction is implemented with a similar algorithm. Many reﬁnements are possible to improve the speed of the addition and subtraction algorithms, but ﬂoating-point addition will, in general, be much slower than ﬁxed-point addition as a result of the need for preaddition alignment and postaddition normalization.

Floating-point multiplication. The algorithm for ﬂoating-point multiplication forms the product of the operand signiﬁcands and the sum of the operand exponents. For radix 2 ﬂoating-point numbers, the signiﬁcand values are ³1 and <2. The product of two such numbers will be ³1 and <4. At most a single right shift is required to normalize the product.

Floating-point division. The algorithm for ﬂoating-point division forms the quotient of the operand signiﬁcands and the difference of the operand exponents. The quotient of two normalized signiﬁcands will be ³0.5 and <2. At most a single left shift is required to normalize the quotient.

Floating-point rounding. All ﬂoating-point algorithms may require rounding to produce a result in the correct format. A variety of alternative rounding schemes have been developed for speciﬁc applications. Round to the nearest, round toward plus inﬁnity, round toward negative inﬁnity, and round toward ZERO are required for implementations of the IEEE ﬂoating-point standard.

Search This Blog

Integrated circuit course

Computer Arithmetic or VLSI Signal Processing:Floating-Point Arithmetic

Floating-Point Arithmetic

Comments

Post a Comment

Popular posts from this blog

SRAM:Decoder and Word-Line Decoding Circuit [10–13].

Adders:Carry Look-Ahead Adder.

Timing Description Languages:SDF