This is a proposal for IEEE PICO design contest. A Matrix Multiplier based on Systolic Array parallel computation with hybrid Modified Booth Wallace tree multiplier as processing element for AI on edge applications is proposed that enhance computational speed while reducing area and power consumption. The proposed design will have the big (O) of 2N-1 with power consumption < 3 W.
Machine learning is an emerging field that is now being widely used for consumer-based applications. Machine learning and Artificial Intelligence (AI) algorithms are implemented in two stages. The first stage is called training in which weights are set using forward and back propagation. This is a computation intensive and power-hungry task. This task is mainly handled on GPUs or ASICs.
The second stage is the process of running a trained neural network to classify data or estimate a value is called inference. Practically, inference is implemented on consumer end devices that have area and power constraints. When we implement inference on general purpose micro-controllers, the speed is highly compromised which may result in failure of real time processes i.e., object detection in Unmanned Aerial Vehicles (UAVs).
Inference employs large-scale matrix multiplication between input and weights. Conventional matrix is done in steps. By employing parallel computation algorithms, we are proposing a matrix multiplier that executes multiplications and accumulation operation at greater speed while saving area and power as compared to general-purpose micro-controllers for AI on edge applications.
In this project, we will implement parallel computation for energy efficient and fast matrix multiplication. The algorithm that we are using for matrix multiplication is systolic array as shown in Figure 1. It uses processing elements (PE) [1] [2]. Conventional matrix multiplication gives an output after steps whereas the output of systolic array is which is quite faster than conventional matrix multiplication [3]. For example, if we are multiplying two 3x3 matrices, we would employ 5x5 PEs and output would occur at T=8.
The computation of systolic array is implemented in such a way that for T=1; 1^{st} row and 1^{st} column of matrix A and B respectively is input into the array. The output is propagated diagonally as shown in Figure 1 as where _{ }is previous output. At T=2; 2^{nd} row and 2^{nd} column of matrix A and B is entered into the array. If A is our input matrix and B is our weights matrix then the input data flow propagates from top to bottom and weights data flow propagates from left to right.
As we are implementing an algorithm that does parallel computation with limited power and area, so we are integrating Modified Booth Wallace Multiplier as our PE in our matrix multiplication algorithm [4].
It consists of four major modules: Booth encoder, partial product generator, Wallace tree formation and a fast adder such as Carry Look ahead Adder as shown in Figure 2.
The Modified Booth algorithm reduces partial product by for Radix-4 encoding and by for Radix-8 encoding [5]. We are using Modified Booth to reduce area and by using Wallace tree, we are reducing our delay thus making Modified Booth Wallace Tree one of the fastest energy efficient multiplies.
Figure 1 Matrix multiplication using systolic array algorithm representation
Figure 2 Modified Booth Wallace multiplier algorithm representation
Figure 3 Proposed architecture for chip
The proposed design specifications are given below:
Table 1 Proposed Design Specifications
Power Consumption |
<3W |
Throughput |
1GOPs |
Big (O) |
2N-1 |
[1] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit”, 44th Annual International Symposium on Computer Architecture, Association for Computing Machinery, pp. 1–12, 2017
[2] Zhijie Yang, Lei Wang, Dong Ding, Xiangyu Zhang, Yu Deng, et al., “Systolic Array Based Accelerator and Algorithm Mapping for Deep Learning Algorithms”, 15th IFIP International Conference on Network and Parallel Computing (NPC), pp.153-158, 2018
[3] Halil Snopce and Azir Aliu, “Comparison among Performance Measures for Parallel Matrix Multiplication Algorithms”, Research Journal of Applied Sciences, Engineering and Technology, (21): 4415-4422.
[4] Farrukh, Fasih Ud Din & Zhang, Chun & Yancao, Jiang & Zhang, Zhonghan & Ziqiang, Wang & Wang, Zhihua & Jiang, Hanjun, “Power Efficient Tiny Yolo CNN using Reduced Hardware Resources based on Booth Multiplier and WALLACE Tree Adders”, IEEE Open Journal of Circuits and Systems, PP. 1-1. 10.1109/OJCAS.2020.3007334, 2020
[5] Savita Nair, Ajit Sara,” A Review Paper on Comparison of Multipliers based on Performance Parameters”, International Journal of Computer Applications, Vol. 5, Issue. 4, pp. 6-9, 2014
This is a proposal for IEEE PICO design contest. A Matrix Multiplier based on Systolic Array parallel computation with hybrid Modified Booth Wallace tree multiplier as processing element for AI on edge applications is proposed that enhance computational speed while reducing area and power consumption. The proposed design will have the big (O) of 2N-1 with power consumption < 3 W.
acc
sky130A