I am executing the following simple 1 dimensional convolution model on U55 - Input 1x223x1x1, kernel 1x64x1x1, output 1x160x1x1. Inputs are 16 bits, filters - 8 bits.
It shown as Conv2D block in Netron and Vella.
[DEBUG] Operators in the model: {'CONV_2D'}[DEBUG] Input layer serving_default_conv2d_input:0 has shape [ 1 223 1 1][DEBUG] Output layer StatefulPartitionedCall:0 has shape [ 1 160 1 1][DEBUG] The model has 1 outputs/classes
...
[DEBUG] main_split_1[DEBUG] 0 <NpuStripe: name=StatefulPartitionedCall:0, ifm_box=<Box [0, 0, 0, 0] - [1, 223, 1, 1]>, ifm2_box=<Box [] - []>, ofm_box=<Box [0, 0, 0, 0] - [1, 160, 1, 1]>, weight_box=<Box [0, 0, 0, 0] - [1, 1, 1, 1]>, block_config=[32, 2, 4, 8]>[DEBUG] Register-Level Command Stream: Input[DEBUG] 0 Conv2D name=StatefulPartitionedCall:0: <NpuStripe: name=StatefulPartitionedCall:0, ifm_box=<Box [0, 0, 0, 0] - [1, 223, 1, 1]>, ifm2_box=<Box [] - []>, ofm_box=<Box [0, 0, 0, 0] - [1, 160, 1, 1]>, weight_box=<Box [0, 0, 0, 0] - [1, 1, 1, 1]>, block_config=[32, 2, 4, 8]>[DEBUG] IFM: h=223,w=1,c=1, region=1, NHWC, INT16, size=446, scale: 3.0515939215547405e-05, zero: 0[DEBUG] Stride y/x/c: 2/2/2, tiles: w0=1, h0=223, h1=223, base=['0x0', '0x0', '0x0', '0x0'][DEBUG] name=serving_default_conv2d_input:0_npu[DEBUG] OFM: h=160,w=1,c=1, region=1, NHWC, INT16, size=320, scale: 4.6055512939346954e-05, zero: 0[DEBUG] Stride y/x/c: 2/2/2, tiles: w0=1, h0=160, h1=160, base=['0x1c0', '0x0', '0x0', '0x0'][DEBUG] name=StatefulPartitionedCall:0_cpu[DEBUG] Kernel: w=1, h=64, stride=(1, 1), dilation=(1, 1)[DEBUG] NpuPadding(top=0, left=0, bottom=0, right=0)[DEBUG] Weights: (region=0, address=0x20, length=192)[DEBUG] Scales: (region=0, address=0x10, length=16)[DEBUG] NpuBlockTraversal.PART_KERNEL_FIRST[DEBUG] Block config: h=32,w=2,c=8, NpuResamplingMode.NONE, NpuRoundingMode.NATURAL
The problem is that it is processed in U55-128 with an elementwise performance of ~2 cycles per MAC (10240 MACs, 21539 NPU cycles according to Vella output), while I would expect something closer to 64 MACs per cycle.
If the order of the dimensions is changed to Input 1x1x223x1, kernel 1x1x64x1, output 1x1x160x1, the Vella shows 13375 NPU cycles/batch. This is better, but still not as expected. I can run the calculations in the main core faster.My understanding is that everything relates to the order of data and U55 internal buffers usage (Vella block config for the first case: h=32,w=2,c=8, Block config for the second one: h=1,w=64,c=8).
So the question is - what should I do to get better calculation performance for my case? Change the order of dimensions? Use special Vella compiler config defines? Use larger input vectors or filters?Strange, that Vella compiler does not optimize this automatically.