I have two questions about vela compiler compile my tensorflow ilte model (ethos u55 with 128MACs)

I try to use vela compiler compile two model one have kernel with 16 channels and one have 17 channels

the vela report the allocation peak tensor size(bytes) , total sram used (KiB) , NPU cycles(cycles/batch) , batch inference times(ms)

and i observe the following situation :

Tensor refers to the feature map size, our input is 128x128=16384, and our model channel number is set to 16, so the second layer Tensor size is 128x128x16=262,144, and there is an overlap of the part is the size of the SRAM allocation, so Allocation Peak Tensor Size = 262144x2 = 524288 Bytes , Total SRAM used = 512 KiB. The final NPU cycle is 2795457cycle, Batch Inference time = 5.59 ms at clokc=500MHz.

If other conditions remain unchanged and the model with 17 channels is changed, after running, we can see from Fig. 4.7 that the second layer Tensor size was originally thought to be 128x128x17=278528, but it is actually 524288, which means we can know whether the number of channels of the model is a multiple of 16 or not, which has a serious impact on the allocation of memory resources. allocation Peak Tensor Size = 524288 x 2 =1064960 Bytes, Total SRAM used = 1040 KiB, Final NPU cycle is 5228677cycle, Batch Inference time=10.47 ms at clokc=500MHz.

1. It seems vela compiler allocate tensor sram size depends on the NN-model's max channel and with 16 channels as a unit?

2.Why does the total SRAM used seriously affect the NPU cycles, as the above report points out that the SRAM usage of the channel 17 model is twice as much as that of the channel 16 model (the same parameter of the vela compiler figure-shared SRAM mode), then the NPU cycles are almost the same as the amount of SRAM used? Is this DMA related?

thank you!