Efabless

Factored MAC for...

public project

MPW-5

Repository

Description

Parallel Multiplier:

Multipliers dominate the datapath of the systolic array architecture. A parallel multiplier consists of three major parts: 1) partial product generation; 2) partial products reduction trees (e.g., Wallace tree); and 3) addition of the pair of final sum and carry rows of partial products using carry propagate adder (CPA) to get the final product.

Consider an M × N-bit multiplier such that x_M-1x_M-2… x₁x₀ and y_N-1x_N-2… y₁y₀ are the multiplier and multiplicand, respectively. The N rows of M-bit partial products PP_i,j can be expressed as:

PP_i,j = x_iy_i ∀i,j, 0 ≤ i ≤ M, 0 ≤ i ≤ N

In fast multipliers, the modified Booth encoding algorithm is employed to reduce the height of partial products. A radix-R=2^r, r>0, booth multiplier reduces the height of partial products from N to ⌈(N+1)/r⌉, reducing the size and enhancing the speed of the reduction tree. The existing systolic array architectures employ radix-4 booth multipliers which has ⌈(N+1)/2⌉ number of partial products.

Radix-8 Booth Multiplier:

In radix-8 booth multipliers the height of the partial products reduces to ⌈(N+1)/3⌉, which brings a noticeable reduction in the area of partial products reduction tree and critical path delay. However, booth multipliers with a radix higher than 2 requires hard multiples, that is not a power of two and are not obtainable using simple shift and complement operations. In radix-8 booth multiplier ±0, ±Y, ±2Y, ±3Y,±4Y multiples of the multiplicand Y are required, and the generation of hard multiple 3Y involves the addition of Y and 2Y using CPA which slows down the multiplier.

The radix-8 booth recoding partitions the multiplier X into sets of 4 bits, 3 adjacent binary bits x_3(i+1)-1x_3i+1x_3i and a borrow bit x_3i-1 from the previous set, each set encoded into the control lines s, d, t, q, n using:

s_i = (~(x_3(i+1)-1⊕ x_3i+1) ∧ (x_3i⊕ x_3i-1))

d_i = ((x_3i⊕x_3i+1) ∧ ~(x_3i⊕x_3i-1))

t_i= ((x_3(i+1)-1 ⊕ x_3i+1) ∧ (x_3i ⊕ x_3i-1))

q_i = ((x_3(i+1)-1 ⊕ x_3i+1) ∧ ~(x_3(i) ⊕ x_3i+1) ∧ ~(x_3i ⊕ x_3i-1))

ni = x_3(i+1)-1

for i=0, 1,2,…, ⌈N/3-1⌉-1,

where N is the width of multiplier Y and x_-1=0. These five control lines select a multiple of the multiplicand Y among ± 0, ± Y, ± 2Y, ± 3Y, ± 4Y. For an M × N radix-8 multiplier, 5⌈(N+1)/3⌉$ control signals are generated in parallel. Except the all-zero case, these five control lines use one-hot encoding and we can re-design the code to reduce 5 lines to 4 lines. However, this can incur an additional delay in the booth selection and we use this standard coding as it is. The complexity of radix-$8$ booth recoding logic is higher than that of radix-4, but we will show that the entire booth recording logic can be removed from individual multipliers in a systolic array.

Factored Radix-8 MAC:

A radix-8 booth multiplier involves the dedicated pre-processing of complex booth recoding on the multiplier X and 3Y=Y+2Y generation on the multiplicand Y. When the radix-8 booth multiplier is employed in the PEs of the systolic array, the logic for the aforementioned dedicated pre-processing on X and Y are replicated across PEs. The main idea of this design is to use the radix-8 booth encoding in designing the MACs of the systolic array and at the same time factor out this repeated dedicated pre-processing logic from the PEs and place them at the inputs.

Figure-1 shows the architecture of the factored Radix-8 MAC. The X inputs accepted from the left edge are pre-processed to generate (s,d,t,q,n) for MAC and the Y inputs accepted from the top edge are pre-processed and the resultant 3Ys along with Ys are given to MAC.

Factored MAC

In this way, this MAC achieves the advantages of the Radix-$8$ booth multipliers and at the same time eliminates the associated drawbacks by factoring out the complex booth encoding logic and hard multiple 3Y computations. This brings a significantly decreasing impact on the delay, area, and power.

However, this gain in the performance of the MAC is achieved but it may cost in terms of input registers because, in this radix-8, we forwards 3Y along with multiplicand Y to MAC compared with Y alone in the MAC of radix-4.

Further, the computing systems for machine learning are usually highly optimized for throughput. Achieving a high throughput of the overall system necessitates the pipelining of the multipliers datapath. The partial products reduction tree makes the major part of a parallel multiplier. The pipelining of a multiplier’s datapath requires the placement of balanced cut-set registers in the Wallace tree levels which is executed by registering all the partial product bits at that level. Thus, the number of partial product bits at any Wallace tree-level decides the size of the pipeline cut-set. Figure 2 below shows a summary of the pipeline cut-set sizes at various Wallace tree levels in radix-4 and factored radix-8 booth
multipliers. Since shifted partial products of various widths are compressed at each Wallace tree level, the widths of the intermediate partial products vary. In Figure 4, e_i and e^′_i
denote positive integers and note that e_i and e^′_i ≪ M.

Note: This work is taken from our project entitled "Factored Radix-8 Systolic Array for Tensor Processing", and if you find FSA useful in your research, please consider citing:

Ullah, I., Inayat, K., Yang, J.S. and Chung, J., 2020, July. Factored radix-8 systolic array
 for tensor processing. In 2020 57th ACM/IEEE Design Automation Conference (DAC) (pp. 1-6). IEEE. [Link]

Owner

KASHIF INAYAT

Organization URL

https://kashifinayat.com/

Description

This is a factored MAC, in which we have designed the factored Radix-8 Booth Multiplier (16 bits) and accumulation is performed with 32 bits carry propagation adder (CPA). A radix-8 booth multiplier involves the dedicated pre-processing of complex booth recoding on multiplier X and 3y=Y+2Y generation on the multiplicand Y. In systolic arrays architectures, when we deploy a MAC in processing elements, the logic for the aforementioned dedicated pre-processing on X and Y are replicated across PEs. To save the area, delay, and power in systolic array-based design, we factored out booth encoding/recoding and 3Y logic from the multiplier.

Git URL

https://github.com/KashifInayat/F-MAC.git

Process

sky130A

Labels

MPW-7

Open MPW