High-Performance computation, 3D Graphics and Signal Processing utilize high-performance floating-point computation units. More than 80% operation in floating point computation comprise of addition and multiplication operations. In most of the cases an addition operation is followed by multiplication operation. This emphasizes the need of more efficient fused multiply and add unit to increase the overall throughput and power efficiency within a floating-point unit. A fused multiply and add (FMA) unit performs both multiply and add operations in a single iteration.
An efficient FMA may have higher throughput, but compromise much on precision of a floating-point number utilizing less power. Higher precisions of floating point offer higher accuracy but are more expensive in terms of power and throughput. Therefore, while input is in higher precision that could be presented in lower precision without losing accuracy. This enables to take higher precision number and perform arithmetic floating point operations in lower precision and finally yield result in higher precision again.
This allows circuits to operate at higher throughput at lower power utilizing lower precision datapath. As an example, a double precision fused add unit can perform a single double precision operation, two single precision or four half precision operations in the same cycle.
This solution extends the work done in [1], with aim to improve the multiply fused add unit to implement double precision operation compliant with IEEE 754 standard [2]. The proposed solution performs a single double precision, two single and four half precision operations in a single iteration. The multiplier would be implemented using radix-4 encoding to generate partial products, followed by wallace tree compression, finally adder would be implemented using carry look ahead adder.
The Proposed architecture is shown in Figure 1, it takes three operands, a, b and c. Multiplication is performed in first clock cycle, while the value of final exponent is calculated in parallel. Then multiplication result is added with third operand c in second cycle. Finally, the result is rounded and normalized in third cycle.
Figure 1 - Proposed Trans Precision FMA Architecture
As the final product bits of multiplication result would be double the total number of bits in the operands. To get the result that has same width as of operand we should truncate and round the results. The architecture introduces approximations in floating point calculation by turning off some of the multiplier partial products in data path with using certainty tracker explained in the later paragraph. Approximation is only performed on the part of bits that would be truncated in result.
Certainty tracker keeps track of the precision control using exponent values in parallel to all operations. It predicts to turn off certain parts of multiplication units to reduce power, and then certainty tracker again checks if the final output has desired precision. If result is not precise enough then the whole operation could be performed again without approximation due to certainty tracking.
The proposed solution goals to achieve highly accurate and higher throughput and results in floating point FMA with minimal power.
[1] H. Kaul et al., “A 1.45GHz 52-to-162GFLOPS/W variable-precision floating-point fused multiply-add unit with certainty tracking in 32nm CMOS,” Dig. Tech. Pap. - IEEE Int. Solid-State Circuits Conf., vol. 55, pp. 182–183, 2012, doi: 10.1109/ISSCC.2012.6176987.
[2] Microprocessor Standards Committee, IEEE Standard for Floating-Point Arithmetic - IEEE Xplore Document. 2019.
Muhammad Usman
Shaukat Ali
Muhammad Shoaib
Dr. Hassan Saif
Dr. Rashad Ramzan
National University of Computer and Emerging Sciences (FAST-NUCES) introduced the first specialized MS Integrated Circuit (IC) design program in Pakistan in Spring 2020. To know more about this nascent MS IC Design program please have a look at the link below.
http://isb.nu.edu.pk/rfcs2/MS.htm
This design implements variable precision floating point fused multiply and add (FMA )unit capable to perform a single double precision, two single precision and four half precision FMA operation in each iteration. Additionally it also, tries to minizine power consumption of each iteration by using runtime precision selection.
processor