Slow U55 performance for 1d convolution

I am executing the following simple 1 dimensional convolution model on U55 - Input 1x223x1x1, kernel 1x64x1x1, output 1x160x1x1. Inputs are 16 bits, filters - 8 bits.

It shown as Conv2D block in Netron and Vella.

[DEBUG] Operators in the model:
 {'CONV_2D'}
[DEBUG] Input layer serving_default_conv2d_input:0 has shape [  1 223   1   1]
[DEBUG] Output layer StatefulPartitionedCall:0 has shape [  1 160   1   1]
[DEBUG] The model has 1 outputs/classes

...

[DEBUG] main_split_1
[DEBUG]   0 <NpuStripe: name=StatefulPartitionedCall:0, ifm_box=<Box [0, 0, 0, 0] - [1, 223, 1, 1]>, ifm2_box=<Box [] - []>, ofm_box=<Box [0, 0, 0, 0] - [1, 160, 1, 1]>, weight_box=<Box [0, 0, 0, 0] - [1, 1, 1, 1]>, block_config=[32, 2, 4, 8]>
[DEBUG] Register-Level Command Stream: Input
[DEBUG] 0 Conv2D name=StatefulPartitionedCall:0: <NpuStripe: name=StatefulPartitionedCall:0, ifm_box=<Box [0, 0, 0, 0] - [1, 223, 1, 1]>, ifm2_box=<Box [] - []>, ofm_box=<Box [0, 0, 0, 0] - [1, 160, 1, 1]>, weight_box=<Box [0, 0, 0, 0] - [1, 1, 1, 1]>, block_config=[32, 2, 4, 8]>
[DEBUG]       IFM: h=223,w=1,c=1, region=1, NHWC, INT16, size=446, scale: 3.0515939215547405e-05, zero: 0
[DEBUG]          Stride y/x/c: 2/2/2, tiles: w0=1, h0=223, h1=223, base=['0x0', '0x0', '0x0', '0x0']
[DEBUG]          name=serving_default_conv2d_input:0_npu
[DEBUG]       OFM: h=160,w=1,c=1, region=1, NHWC, INT16, size=320, scale: 4.6055512939346954e-05, zero: 0
[DEBUG]          Stride y/x/c: 2/2/2, tiles: w0=1, h0=160, h1=160, base=['0x1c0', '0x0', '0x0', '0x0']
[DEBUG]          name=StatefulPartitionedCall:0_cpu
[DEBUG]       Kernel: w=1, h=64, stride=(1, 1), dilation=(1, 1)
[DEBUG]       NpuPadding(top=0, left=0, bottom=0, right=0)
[DEBUG]       Weights: (region=0, address=0x20, length=192)
[DEBUG]       Scales: (region=0, address=0x10, length=16)
[DEBUG]       NpuBlockTraversal.PART_KERNEL_FIRST
[DEBUG]       Block config: h=32,w=2,c=8, NpuResamplingMode.NONE, NpuRoundingMode.NATURAL

The problem is that it is processed in U55-128 with an elementwise performance of ~2 cycles per MAC (10240 MACs, 21539 NPU cycles according to Vella output), while I would expect something closer to 64 MACs per cycle.

If the order of the dimensions is changed to Input 1x1x223x1, kernel 1x1x64x1, output 1x1x160x1, the Vella shows 13375 NPU cycles/batch. This is better, but still not as expected. I can run the calculations in the main core faster.
My understanding is that everything relates to the order of data and U55 internal buffers usage (Vella block config for the first case: h=32,w=2,c=8, Block config for the second one: h=1,w=64,c=8).

So the question is  - what should I do to get better calculation performance for my case? Change the order of dimensions? Use special Vella compiler config defines? Use larger input vectors or filters?
Strange, that Vella compiler does not optimize this automatically.

Parents
  • Hi Oleg,

    I contacted the Vela team, and they have managed to recreate and analyse the behaviour described. There analysis is as follows:

    Firstly, the Ethos-U is designed for block-based processing of deep convolutions. Therefore, it is important to note that this example network is not ideal for obtaining optimal hardware utilisation. To better utilise the MAC hardware both the IFM and OFM depths would need to be increased.

    However, given the example convolution (Input 1x223x1x1, kernel 1x64x1x1, output 1x160x1x1), the theoretical performance can be estimated by using Table 4-131 (Convolution performance for 16-bit activations) in the Arm® EthosTM-U55 NPU Technical Reference Manual (TRM). The calculation is:

    MAC Count = OFM Height * Kernel Height = 160 * 64 = 10240 MAC operations

    CONV (kernel first) operations per cycle = Maximum MAC Count per cycle * IFM depth efficiency * OFM depth efficiency * Kernel efficiency = 64 * 1/8 * 1/8 * 1/2 = 0.5 MACs per cycle

    => Estimated cycle count = MAC Count / CONV (kernel first) operations per cycle = 10240 / 0.5 = 20480

    For the second example (which changes the data to be in the Width dimension rather than the Height), the performance is improved because the kernel becomes 1x1x64x1 and for maximum efficiency the kernel width should be divisible by 2 (see Table 4-131). Hence, the operations per cycle for the second example is = 64 * 1/8 * 1/8 = 1 MAC per cycle.

    In summary, to get the best hardware utilisation all operators should consider the constraints listed in the performance tables in sections 4.8.3 and 4.8.4 of the TRM.

    So yes, in summary you'll need some sort of batching/blocking to be able to take full advantage of the NPU capabilities.

    Cheers,

    Ben

Reply
  • Hi Oleg,

    I contacted the Vela team, and they have managed to recreate and analyse the behaviour described. There analysis is as follows:

    Firstly, the Ethos-U is designed for block-based processing of deep convolutions. Therefore, it is important to note that this example network is not ideal for obtaining optimal hardware utilisation. To better utilise the MAC hardware both the IFM and OFM depths would need to be increased.

    However, given the example convolution (Input 1x223x1x1, kernel 1x64x1x1, output 1x160x1x1), the theoretical performance can be estimated by using Table 4-131 (Convolution performance for 16-bit activations) in the Arm® EthosTM-U55 NPU Technical Reference Manual (TRM). The calculation is:

    MAC Count = OFM Height * Kernel Height = 160 * 64 = 10240 MAC operations

    CONV (kernel first) operations per cycle = Maximum MAC Count per cycle * IFM depth efficiency * OFM depth efficiency * Kernel efficiency = 64 * 1/8 * 1/8 * 1/2 = 0.5 MACs per cycle

    => Estimated cycle count = MAC Count / CONV (kernel first) operations per cycle = 10240 / 0.5 = 20480

    For the second example (which changes the data to be in the Width dimension rather than the Height), the performance is improved because the kernel becomes 1x1x64x1 and for maximum efficiency the kernel width should be divisible by 2 (see Table 4-131). Hence, the operations per cycle for the second example is = 64 * 1/8 * 1/8 = 1 MAC per cycle.

    In summary, to get the best hardware utilisation all operators should consider the constraints listed in the performance tables in sections 4.8.3 and 4.8.4 of the TRM.

    So yes, in summary you'll need some sort of batching/blocking to be able to take full advantage of the NPU capabilities.

    Cheers,

    Ben

Children