Efabless Logo
MATRIX...
public project
MPW-8   

Matrix Multiplier using Systolic Array and modified Booth-Wallace Tree for AI on Edge Applications

Abstract

This is a proposal for IEEE PICO design contest. A Matrix Multiplier based on Systolic Array parallel computation with hybrid Modified Booth Wallace tree multiplier as processing element for AI on edge applications is proposed that enhance computational speed while reducing area and power consumption. The proposed design will have the big (O) of 2N-1 with power consumption < 3 W.

Motivation

Machine learning is an emerging field that is now being widely used for consumer-based applications. Machine learning and Artificial Intelligence (AI) algorithms are implemented in two stages. The first stage is called training in which weights are set using forward and back propagation. This is a computation intensive and power-hungry task. This task is mainly handled on GPUs or ASICs.

The second stage is the process of running a trained neural network to classify data or estimate a value is called inference. Practically, inference is implemented on consumer end devices that have area and power constraints. When we implement inference on general purpose micro-controllers, the speed is highly compromised which may result in failure of real time processes i.e., object detection in Unmanned Aerial Vehicles (UAVs).  

Inference employs large-scale matrix multiplication between input and weights. Conventional matrix is done in  steps. By employing parallel computation algorithms, we are proposing a matrix multiplier that executes multiplications and accumulation operation at greater speed while saving area and power as compared to general-purpose micro-controllers for AI on edge applications.

Proposed Solution

  • Matrix Multiplication Using Systolic Array

In this project, we will implement parallel computation for energy efficient and fast matrix multiplication. The algorithm that we are using for matrix multiplication is systolic array as shown in Figure 1. It uses  processing elements (PE) [1] [2]. Conventional matrix multiplication gives an output after  steps whereas the output of systolic array is  which is quite faster than conventional matrix multiplication [3]. For example, if we are multiplying two 3x3 matrices, we would employ 5x5 PEs and output would occur at T=8.

The computation of systolic array is implemented in such a way that for T=1; 1st row and 1st column of matrix A and B respectively is input into the array. The output is propagated diagonally as shown in Figure 1 as  where  is previous output. At T=2; 2nd row and 2nd column of matrix A and B is entered into the array. If A is our input matrix and B is our weights matrix then the input data flow propagates from top to bottom and weights data flow propagates from left to right.

  • Modified Booth Wallace Multiplier:

As we are implementing an algorithm that does parallel computation with limited power and area, so we are integrating Modified Booth Wallace Multiplier as our PE in our matrix multiplication algorithm [4].

It consists of four major modules: Booth encoder, partial product generator, Wallace tree formation and a fast adder such as Carry Look ahead Adder as shown in Figure 2.

The Modified Booth algorithm reduces partial product by  for Radix-4 encoding and by  for Radix-8 encoding [5]. We are using Modified Booth to reduce area and by using Wallace tree, we are reducing our delay thus making Modified Booth Wallace Tree one of the fastest energy efficient multiplies.

Block Diagram

Figure 1 Matrix multiplication using systolic array algorithm representation

Figure 2 Modified Booth Wallace multiplier algorithm representation

Figure 3 Proposed architecture for chip

Project Specifications

The proposed design specifications are given below:

Table 1 Proposed Design Specifications

Power Consumption

<3W

Throughput

1GOPs

Big (O)

2N-1

Team Members

  1. Shaheer Ashraf (Team Lead) NUCES (FAST-NU)
  2. Muhammad Uzair NUCES(FAST-NU)
  3. Faizan Khan NUCES (FAST-NU)
  4. Prof. Dr. Rashad Ramzan NUCES (FAST-NU)
  5. Dr. Hassan Saif NUCES (FAST-NU)

References

[1] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit”, 44th Annual International Symposium on Computer Architecture, Association for Computing Machinery, pp. 1–12, 2017

[2] Zhijie Yang, Lei Wang, Dong Ding, Xiangyu Zhang, Yu Deng, et al., “Systolic Array Based Accelerator and Algorithm Mapping for Deep Learning Algorithms”, 15th IFIP International Conference on Network and Parallel Computing (NPC), pp.153-158, 2018

[3] Halil Snopce and Azir Aliu, “Comparison among Performance Measures for Parallel Matrix Multiplication Algorithms”, Research Journal of Applied Sciences, Engineering and Technology, (21): 4415-4422.

[4] Farrukh, Fasih Ud Din & Zhang, Chun & Yancao, Jiang & Zhang, Zhonghan & Ziqiang, Wang & Wang, Zhihua & Jiang, Hanjun, “Power Efficient Tiny Yolo CNN using Reduced Hardware Resources based on Booth Multiplier and WALLACE Tree Adders”, IEEE Open Journal of Circuits and Systems, PP. 1-1. 10.1109/OJCAS.2020.3007334, 2020

[5] Savita Nair, Ajit Sara,” A Review Paper on Comparison of Multipliers based on Performance Parameters”, International Journal of Computer Applications, Vol. 5, Issue. 4, pp. 6-9, 2014

Organization URL

http://isb.nu.edu.pk/rfcs2/

Description

This is a proposal for IEEE PICO design contest. A Matrix Multiplier based on Systolic Array parallel computation with hybrid Modified Booth Wallace tree multiplier as processing element for AI on edge applications is proposed that enhance computational speed while reducing area and power consumption. The proposed design will have the big (O) of 2N-1 with power consumption < 3 W.

Category

acc

Process

sky130A