Computer Arithmetic or VLSI Signal Processing:Floating-Point Arithmetic
Floating-Point Arithmetic
Recent advances in VLSI have increased the feasibility of hardware implementations of floating-point arithmetic units. The main advantage of floating-point arithmetic is that its wide dynamic range virtually eliminates overflow for most applications.
Floating-point number systems. A floating-point number, A, consists of a significand (also called a fraction or a mantissa), Sa , and an exponent, Ea. The value of a number, A, is given by the equation
where r is the radix (or base) of the number system. Use of the binary radix (i.e., r = 2) gives maximum accuracy, but may require more frequent normalization than higher radices.The IEEE standard 754 single-precision (32-bit) floating-point format which is widely implemented, has an 8-bit biased integer exponent which ranges between 0 and 255 [11]. The exponent is expressed in excess 127 code so that its effective value is determined by subtracting 127 from the stored value. Thus, the range of effective values of the exponent is –127 to 128, corresponding to stored values of 0 to 255, respectively. A stored exponent value of ZERO (Emin) serves as a flag for ZERO (if the significand is ZERO) and for denormalized numbers (if the significand is non-ZERO). A stored exponent value of 255 (Emax) serves as a flag for Infinity (if the significand is ZERO) and for “Not a Number” (if the significand is non- ZERO). The significand is a 25-bit sign magnitude mixed number (the binary point is to the right of the most significant bit). The most significant bit is always a ONE except for denormalized numbers. More detail on floating-point formats and on the various considerations that arise in the implementation of floating-point arithmetic units are given in [7,12].
Floating-point addition. A flow chart for floating-point addition is shown in Figure 80.18. For this flowchart, the operands are assumed to be “unpacked” and normalized with magnitudes in the range [1, 2]. On the flow chart, the operands are (Ea, Sa) and (Eb, Sb), the result is (Es, Ss), and the radix is 2. In step 1 the operand exponents are compared; if they are unequal, the significand of the number with the smaller exponent is shifted right in step 3 or 4 by the difference in the exponents to properly align the significands. For example, to add the decimal operands 0.867 ´ 105 and 0.512 ´ 104, the latter would be shifted right by 1 digit and 0.867 added to 0.0512 to give a sum of 0.9182 ´ 105. The addition of the significands is performed in step 5. Steps 6–8 test for overflow and correct if necessary by shifting the significand one position to the right and incrementing the exponent. Step 9 tests for a zero significand. The loop of steps 10–11 scales unnormalized (but non-ZERO) significands upward to normalize the result. Step 12 tests for underflow.
Floating-point subtraction is implemented with a similar algorithm. Many refinements are possible to improve the speed of the addition and subtraction algorithms, but floating-point addition will, in general, be much slower than fixed-point addition as a result of the need for preaddition alignment and postaddition normalization.
Floating-point multiplication. The algorithm for floating-point multiplication forms the product of the operand significands and the sum of the operand exponents. For radix 2 floating-point numbers, the significand values are ³1 and <2. The product of two such numbers will be ³1 and <4. At most a single right shift is required to normalize the product.
Floating-point division. The algorithm for floating-point division forms the quotient of the operand significands and the difference of the operand exponents. The quotient of two normalized significands will be ³0.5 and <2. At most a single left shift is required to normalize the quotient.
Floating-point rounding. All floating-point algorithms may require rounding to produce a result in the correct format. A variety of alternative rounding schemes have been developed for specific applications. Round to the nearest, round toward plus infinity, round toward negative infinity, and round toward ZERO are required for implementations of the IEEE floating-point standard.
Comments
Post a Comment