Hi,
I read the Ethos-U55 specification and I can't understand what the "MACs(8x8) 32,64,128,256" mean.
Can anyone help me with more detailed explanation?
Thanks.
Multiply-Accumulate Operation (MAC)- Computations in NN network involve a multiplication and then an addition and are thus referred to as Multiply-Accumulate Operations (MACs). These (32/63/128/256) configuration of U55 means that (32/63/128/256) number of 8x8 MACs/cc supported in respective version of the config of U55.Note the higher the MAC config, more will be number of DPUs(dot product unit), adders, area size, power etc.
Not sure it's right, just my understanding that one "8x8 MACs/cc" means there are 8 multiply-accumulate circuits operation in parallel at one clock cycle and the inputs of each multiply-accumulate circuit has two 8-bit numbers.
I also see the Cortex-M55 specification and the Multiply-accumulate (MAC)/cycle up to 2 x 32-bit MACs/cycle, 4 x 16-bit MACs/cycle and 8 x 8-bit MACs/cycle. User can configure the operation of MAC.
Could you please explain more for "8x8 MACs/cc" ?
Let me explain in more detail with considering 256 MAC config.
256 MAC config means that can do 256 8x8 multiplications per cc. A MAC counts as two operations (mul+add).
Suppose you have a network of 8 bit IFM. for e.g
INPUT : 10 * 10 * 3
Convolution : 128*3*3*3 (3x3x3 (HxWxD) kernel and 128 filter) (Same pad)
Output shape: 8*8*128
MAC = 8*8*128*3*3*3 = ~221184 MACs
Now, 256 8*8 MAC can happen in one clock cycle. So, you would need 221184/256 = ~864 cc for computing these MAC's.
Now say if your IFM is 16 bit then this means that we can do 128 16x8 multiplications per cc. which mean that now we need 1728 cc for these MAC operations.
You can compute things for other configs similarly.
Note that there are other parameter like DPUs per configs and to get this maximum performance, we need to fill all DPUs inside MAC.Generally the width, height and depth should be a multiple of the microblock size which is 2x2x8 for Z-256. You can refer Arm U55 TRM for details.