I recently trained two model, and used vela compiler to estimate their performance.
There are their model architecture :
(I changed the right model's middle layer from "Standard Convolution(Conv2D_16x3x3x16(NHWC))" to "Depthwise Seperable Convolution(DepthwiseConv2D_1x3x3x16(NHWC)+Conv2D_16x1x1x16(NHWC))
I compiled them with the same system config and memory mode.
(I selectd accelerator ethos-u55-128 and add --optimise Performance command)
Finally,there are their vela report :
(Model with all Standard Conv)
(Model with Depthwise Seperable Conv)
I compared them and had some question :
1. In "Total SRAM Bandwidth", why the model with depthwise conv is bigger than the model with all standard conv ?
Is this means depthwise conv will transfer the feature map between external SRAM and NPU internal SRAM more frequently?
2. In "Neural network macs", the numbers of depthwise conv is 3.5 times smaller than the standard conv one, but only reduce the "NPU cycles" from 2814460 to 2535830 .
It seems that using depthwise conv to improve the model's inference time doesn't get very good effect.
I read the Ethos-U55 NPU Technical Reference Manual(https://developer.arm.com/documentation/102420/0200/?lang=en), and found that the table in topic 4.8 "Operator and Performance"
In this table, Depthwise Conv only use 16 MAC per cycle.
So I think the reason of this question is that DepthwiseConv has lower MAC utilization than standard Conv. Are there other reasons ?
Whilst these numbers roughly match those of Vela's performance estimator the best way to determine the real performance would be to run on an FPGA or even the FVP. The performance results of the FVP show that the 2 networks are almost identical in performance: