Hi,
I would appreciate it when someone tell me the role of ethos-u55/65 shared buffer. I've read the implementaton of vela to suppose that the role is to hold ONE set of IFM/IFM2/OFM blocks (+LUT etc), for example, it holds one IFM block and one OFM block in Conv2D operation, but I'm not sure if it is: If it is true, I cannot understand the role of NPU_SET_BLOCKDEP. If there is only one IFM and one OFM in the shared buffer, why the NPU have to takes care of block dependency whose value is 2 or 3?
Best regards,
Hi Tiva,
Think of the shared buffer as the internal memory of the NPU. Higher MAC configuration of the Ethos-U(for ex. U55-128MACs or U55-256MACs) have bigger shared buffer compared to lower MAC configurations(U55-32 or U55-64) because NPU will have more data to process and hence needs more memory. The size of the shared buffer for a given MAC configuration is fixed by Arm. The purpose of the shared buffer is to store data that the NPU is processing- IFMs, accumulators, look-up-tables as well as data transferred to/from memory. As per the Ethos-U55 TRM table 4-141, the shared buffer has 5 buffers(IFM,IFM2,Accumulator,Output,LUT) and each of these buffers with the exception of the LUT has two entries.
The NPU_SET_BLOCKDEP describes the dependency between jobs. The rational here is that the output of one ML operator is the input for the following operation and you can have a maximum of 3 outstanding operations where the NPU writes input data before the data is read by the next operation. That's why NPU_SET_BLOCKDEP maximum value is 3.
Lastly, the management of the shared buffer is done solely by the Vela compiler. If you want to run a NN on a SoC with an Ethos-U, you have to compile your model with Vela and compiler will schedule the execution of your model on the hardware, you don't need to think about the shared buffer of the Ethos-U. If you have further questions, you can get in touch by emailing us at support-ml@arm.com
Thanks,George
Hi Tiva, In addition to George Gekov reply you can refer to U55 TRM 4.9.1 Shared buffer to get the additional details.
Hi Gekov, Singh,
Thank you very much for your reply. I have read TRM and vela, but I'm not sure if vela emits optimal code that satisfies constraint such as BLOCKDEP. I've sent an email to ARM.
Juudzi
Hi Juudzi, Yes, the command stream generated by Vela is optimal and it leverages the NPU_SET_BLOCKDEP command. If you compile your model with --verbose-register-command-stream CLI option, you can see the command stream generated by the compiler and can see that the NPU_SET_BLOCKDEP parameter is used.
Best regards,George