harshvagadiya /  MAC_Vectors_For_DNN_Accelerators

Created
Maintained by harshvagadiya
The continuous progress in Deep Neural Network (DNN) models, marked by their escalating complexity and scale, coupled with the abundance of available training data, has generated an unprecedented need for computational resources. However, executing these emerging workloads on general-purpose compute cores presents notable difficulties in terms of memory usage and power consumption. Consequently, numerous approaches have been investigated to tackle these challenges, and among them, DNN accelerator architectures have emerged as a prominent solution. Recent research advocates using Posit and Fixed Posit representations instead of Floating Point (FP) representations for inferencing emerging workloads to enhance system performance without loss in accuracy. We propose to design a System on Chip (SoC) consisting of MAC Vectors for Simba Accelerator [1], Gemmini Accelerator [2], Shared Global Buffer, Controller, and IO interfaces. A noteworthy aspect of this architecture is the distribution of eight input words among eight Vector MAC units, where each unit consists of eight multipliers that combine their products through an adder tree, resulting in a single word. The accumulation buffer plays a vital role, as it stores the single word by progressively adding it to the previous partial sum. Additionally, each Vector MAC unit is equipped with its weight buffer, facilitating parallelized operations. Shared Global Buffer: The Shared Global Buffer is a SRAM memory connected to the external DRAM and is shared between both accelerators. By providing a shared memory space, it allows efficient data exchange and communication between the accelerators, promoting seamless collaboration during computation tasks. Utilizing the Shared Global Buffer can indeed help to reduce off-chip data accesses significantly. Instead of frequently accessing data from the external DRAM, which typically involves higher latencies and consumes more power, the accelerators can leverage the shared buffer to store and retrieve intermediate data. This minimizes the need for constant data transfers to and from the external DRAM, thereby reducing data access latencies and lowering overall power consumption. Consequently, this optimized data flow can lead to improved system performance and energy efficiency, making the Shared Global Buffer a valuable asset in the overall accelerator architecture. Controller: The Controller serves as a crucial component in the overall accelerator architecture, responsible for making intelligent decisions regarding the usage of the Shared Global Buffer and the number representation for the Multiply-Accumulate (MAC) operation. Regarding the Shared Global Buffer, the Controller efficiently manages and arbitrates its access between the two accelerators. It decides which accelerator can utilize the buffer at a given time, ensuring that data exchanges between the accelerators occur seamlessly and without conflicts. By controlling access to the Shared Global Buffer, the Controller optimizes data flow, minimizing data transfer overhead and improving overall system performance. Additionally, the Controller plays a vital role in enabling either one of the accelerator design and determining the appropriate number representation for the MAC operation. The Controller gives us the flexibility to use any of the number representations from Posit, Fixed Posit, and FP, for the MAC operation. By taking charge of these critical decisions, the Controller ensures that the accelerators work cohesively and efficiently, making the best use of the Shared Global Buffer and employing the most suitable number representation for MAC operations. This intelligent control and coordination contribute to the overall effectiveness and performance of the accelerator system, meeting the demands of various computation tasks effectively and with minimized off-chip data accesses. References: 1. Brian Zimmer, et al . 2020. A 0.32–128 TOPS, scalable multi-chip-module-based deep neural network inference accelerator with ground-referenced signaling in 16 nm. IEEE Journal of Solid-State Circuits 55, 4 (2020), 920–932. 2. H. Genc et al., "Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration," 2021 58th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 2021, pp. 769-774, doi: 10.1109/DAC18074.2021.9586216. Upload Block Diagram https://ieee-cas.org/system/files/webform/unic_cass_proposals/2115/Block_Diagram.pdf
Members 1