This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to make Ethos-U NPU work on an ARM Cortex-A + Cortex-M processor?

I have a question about how to make Ethos-U NPU work on a ARM Cortex-A + Cortex-M processor. First, I found ethos-u-linux-driver-stack and ethos-u-core-software on https://git.mlplatform.org/.

1. I know ethos-u-linux-driver-stack is Ethos-U kernel driver. Should it be integrated into the Linux OS running on Cortex-A or be integrated into the Linux OS running on Cortex-M? I am nor clear about which core it need to perform on.

2. For ethos-u-core-software, how to run it? I did't find the detail steps to run it. Does it run on NPU or any core?

3. Except the above two repos, is there any other repo necessory to make Ethos-U NPU work on an ARM Cortex-A + Cortex-M processor?

Thanks for your suggestion in advance.

Top replies

Parents

0 Kristofer Jonsson over 2 years ago in reply to alisonw

I thought I had responded to this question, but I can't find my answer in the thread.

The SHRAM is built into the NPU and the size can't be changed by software. The memory is typically used for storage of weights, biases and temporary data. This is a small memory that will not fit a network like ssd mobilenet. Instead small portions of the weighs and biases are copied to the SHRAM as they are needed.

There are 8 "logical memory channels". Which logical channel that is used for what data (ifm, ofm, weights, biases etc) is coded in the command stream by Vela. The driver maps the logical channels to the physical DMA interfaces (M0 and M1) in the region config registers.

Consequently which DMA interface that is used depends both on Vela and the driver. Now this is subject to change, but as of today with the default settings in Vela and the driver, the command stream, weights and biases will go over M1, and all other data over M0. This allows the TFLu model (command stream, weights and biases) to be moved from fast memory (SRAM) to slower memory (flash or DRAM) without congesting the M0 interface.
Cancel
Up 0 Down

Cancel

Reply

0 Kristofer Jonsson over 2 years ago in reply to alisonw

I thought I had responded to this question, but I can't find my answer in the thread.

The SHRAM is built into the NPU and the size can't be changed by software. The memory is typically used for storage of weights, biases and temporary data. This is a small memory that will not fit a network like ssd mobilenet. Instead small portions of the weighs and biases are copied to the SHRAM as they are needed.

There are 8 "logical memory channels". Which logical channel that is used for what data (ifm, ofm, weights, biases etc) is coded in the command stream by Vela. The driver maps the logical channels to the physical DMA interfaces (M0 and M1) in the region config registers.

Consequently which DMA interface that is used depends both on Vela and the driver. Now this is subject to change, but as of today with the default settings in Vela and the driver, the command stream, weights and biases will go over M1, and all other data over M0. This allows the TFLu model (command stream, weights and biases) to be moved from fast memory (SRAM) to slower memory (flash or DRAM) without congesting the M0 interface.
Cancel
Up 0 Down

Cancel

Children

0 alisonw over 2 years ago in reply to Kristofer Jonsson

Kristofer, please help to confirm my below comments.

1. The fast memory (SRAM) you mentioned in "This allows the TFLu model (command stream, weights and biases) to be moved from fast memory (SRAM) to slower memory (flash or DRAM) without congesting the M0 interface." is the SHRAM built into the NPU, right?

2. Now, the current process is SHRAM <-> DMA & M1 or M0 <-> slower memory (flash or DRAM), right?

Maybe I didn't understand your last sentence clearly.
Cancel
Up 0 Down

Cancel
0 Kristofer Jonsson over 2 years ago in reply to alisonw
SHRAM - Built into the NPU. DMA is used to move data between external memory and the internal SHRAM.

SRAM - Fast memory outside of the NPU.

Flash/DRAM - Slow memory outside of the NPU.

The TFLu has two buffer, model and arena. For optimal performance both the model and the arena should be placed in SRAM (or similar memory technology), however SRAM is expensive and at the cost of performance the model could be moved to DRAM or flash.
Cancel
Up 0 Down

Cancel
0 alisonw over 2 years ago in reply to Kristofer Jonsson

Kristofer, I want to set the particular SRAM and DRAM address for U65 on our processor, how should I do?
Cancel
Up 0 Down

Cancel
0 Kristofer Jonsson over 2 years ago in reply to alisonw
I am not sure what you mean with setting the SRAM and DRAM address for U65. Could you please elaborate a bit more on what problem you have?

If you for example wonder about how to place the model and arena buffers in memory, then perhaps this information might help you.

Running an inference on the TFLu framework requires three memory regions.

ITCM. This is where code is placed.

Model. This buffer contains constant data like weights, biases and the Ethos-U command stream.

Arena. This is a heap, used for read write data like IFM, OFM, activations etc.

For the tests we have upstreamed to MLPlatform we have defined two additional buffers.

IFM data.

Expected OFM data. Used to verify that the inference completed successfully.

Please have a look at the baremetal example application. Each buffer is named with a section attribute.

Model: https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ethos-u-core-platform.git/+/refs/tags/21.05-rc1/applications/baremetal/models/keyword_spotting_cnn_small_int8/model.h#30

Arena: https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ethos-u-core-platform.git/+/refs/tags/21.05-rc1/applications/baremetal/main.cpp#46

IFM: https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ethos-u-core-platform.git/+/refs/tags/21.05-rc1/applications/baremetal/models/keyword_spotting_cnn_small_int8/input.h#25

OFM: https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ethos-u-core-platform.git/+/refs/tags/21.05-rc1/applications/baremetal/models/keyword_spotting_cnn_small_int8/output.h#25

The section attributes are placed in memory by the scatter file (ArmClang) or linker script (GCC). To change where the buffers are placed in memory you need to edit the scatter file or linker script.

ArmClang: https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ethos-u-core-platform.git/+/refs/tags/21.05-rc1/targets/corstone-300/platform.scatter

GCC: https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ethos-u-core-platform.git/+/refs/tags/21.05-rc1/targets/corstone-300/platform.ld
Cancel
Up 0 Down

Cancel
+1 alisonw over 2 years ago in reply to Kristofer Jonsson

Hi, Kristofer, thanks for your sharing. I have a reference about the codes on corstone-300. It seems ifm, model(weights, biases and the Ethos-U command stream), ofm are located in SRAM. If ETHOSU_FAST_MEMORY_SIZE is defined, tensor arena will be located in SRAM, or else it will be located in DDR. I think ETHOSU_FAST_MEMORY_SIZE means whether there is a fast memory, such as SRAM. Please correct me, if there is anything wrong.

Actually, in my current codes, ifm, model, ofm and tensor arena are all located in DDR. I think only AXI M1 inference is used, because M1 is connected to DDR, M0 is connected to fast memory (on-chip SRAM) in our system. So I want to know whether I need to change Vela tool or driver to make this case work. For example, model is located at DDR, and AXI M1 is used, should I make some configuration in Vela tool or driver for it?

Moreover, as there is fast memory in our system, should I locate tensor arena in fast memory (on-chip SRAM)? I think the fast memory is not large enough for the model, such as mobilenet v1.
Cancel
Up -1 Down

Cancel
0 alisonw over 2 years ago in reply to alisonw

Hi, Kristofer, except the above questions, I also have some confusion about two hardware registers.

The first register is REGIONCFG. What does the region mean? What data should be located at region 0-7? For each region, is it needed to config the region connects to AXI0 or AXI1?

The second register is BASEPx. According to the codes, the address of weight, input tensor and other buffers are written to BASEPx. Is there a rule to put the specific buffer address to the specific BASEPx? For example, BASEP0 is necessary to write weight buffer address.

Please help to give me some guide.
Cancel
Up 0 Down

Cancel
0 Kristofer Jonsson over 2 years ago in reply to alisonw
Fast memory

As mentioned in previous comments there there are two TFLu buffers - model and arena - that need to be placed in memory. For a system with SRAM and DRAM we have three combinations that make sense.

Model and arena in SRAM. Best performance.

Model i DRAM, arena in SRAM. Weight bound networks will be affected most.

Model and arena in DRAM.

Vela allocates a buffer inside of the arena. This buffer contains temporary data that the NPU will access frequently, and should for optimal performance be placed in SRAM.

However, for alternative 3 the arena will be placed in DRAM. For this option Vela can be configured to split the temporary data into an "arena buffer" and a "fast memory buffer". The Ethos-U will redirect the "fast memory buffer" to a memory area in SRAM.

The fast memory feature is a bit complicated and requires synchronized changes in several places:

Vela needs to be configured to enable spilling (fast memory buffer).

The application needs to be compiled with a fast memory buffer. The size of this buffer must be equal or larger than the size Vela has been configured with.
https://git.mlplatform.org/ml/ethos-u/ethos-u-core-platform.git/tree/targets/corstone-300/target.cpp?h=21.05-rc2#n56

The fast memory buffer must be registered with the Ethos-U driver.
https://git.mlplatform.org/ml/ethos-u/ethos-u-core-platform.git/tree/targets/corstone-300/target.cpp?h=21.05-rc2#n161

Base pointers

Vela takes a tflite file as input, and produces another optimized tflite file as output. During the optimization phase Vela controls in which input tensors data is placed, like this:

Command stream

Weighs and biases.

'Arena' data. IFM, OFM, activations.

'Fast' data.

The Ethos-U NPU driver writes the address of the command stream to the QBASE register. The addresses of input tensors 2-4 are written to the BASEP<nr> registers. If spilling has been enabled, then the driver will override the 'fast' tensor address before the BASEP<nr> register is written.

Region configuration

The Ethos-U NPU has two AXI interfaces, M0 and M1. The REGIONCFG register controls over which AXI interface the base pointers are routed to.

For example, with current Vela implementation weights and biases are accessed over base pointer 0. In the region config you can control if base pointer 0 should use M0 or M1.

The default region config is defined here. Please note that AXI_LIMITS 0 and 1 are routed to M0, and AXI_LIMITS 2 and 3 to M1.

https://git.mlplatform.org/ml/ethos-u/ethos-u-core-driver.git/tree/src/ethosu_config.h?h=21.05-rc2#n28
Cancel
Up 0 Down

Cancel
0 alisonw over 2 years ago in reply to Kristofer Jonsson

Hi, Kristofer, I have a new question about model and arena placed in memory. For mobilenet model, it needs more memory for arena. Our SRAM is not large enough for arena. In such case, the only way is to put model and arena in DRAM, right? Is there any other way to improve the performance?
Cancel
Up 0 Down

Cancel
0 Kristofer Jonsson over 2 years ago in reply to alisonw

Hi

We have written some basic instructions here about the different memory configurations.

https://git.mlplatform.org/ml/ethos-u/ethos-u-core-platform.git/about/#model-and-arena-in-sram

For optimal performance both model and arena should be placed in SRAM. If that doesn't fit our recommendation is to move the model to DRAM and leave the arena in SRAM.

If that still doesn't fit, then there are two options depending on your NPU.

For Ethos-U55 the only option is to pay a performance penalty and place both model and arena in DRAM.

For Ethos-U65 there is the option to enable spilling. Spilling means that both the model and the arena are placed in DRAM, and you reserve a smaller memory area in SRAM. Vela will use the spilling buffer as a cache, and will generate extra instructions to copy frequently accessed data between the arena and the spilling buffer. There will still be a performance impact, but it will be lower compared to not using spilling.
Cancel
Up 0 Down

Cancel
0 alisonw over 2 years ago in reply to Kristofer Jonsson

Hi, Kristofer, thank you very much for your clarification. It's very useful for me.

I have some other questions. In my verification for ethos-u65, I met the following two issues.

1. For mobilenet-ssd models, the inference couldn't complete. There is no interrupt generated. Only some of the command stream are executed, the rest of it are not executed. I tried to read QREAD register during the inference process. It was not increased at some offset.

2. In some test case, the interrupt would generate during the inference process. But the command stream are not all executed. When reading the STATUS register, the value is 0x2. In normal, the interrupt should generate when the command stream are all executed, and the value of STATUS register should be 0xFFFF0022.

For the above two issues, I don't have the solution. I think one idea is to know what is the exact command of the command stream when the issues occur.

I tried to add "--verbose-register-command-stream" when converting to vela model. But I can't understand the following log. How could I find the exact command of the the command stream?

Code: Command: Param: Payload:
0x0123 cmd0.NPU_SET_PARALLEL_MODE 0 -
0x010f cmd0.NPU_SET_IFM_REGION 1 -
0x4000 cmd1.NPU_SET_IFM_BASE0 0 0x00008000 (32768)
0x4001 cmd1.NPU_SET_IFM_BASE1 0 0x00000000 (0)
0x4002 cmd1.NPU_SET_IFM_BASE2 0 0x00000000 (0)
0x4003 cmd1.NPU_SET_IFM_BASE3 0 0x00000000 (0)
0x010b cmd0.NPU_SET_IFM_HEIGHT0_M1 31 -
0x010c cmd0.NPU_SET_IFM_HEIGHT1_M1 31 -
0x010a cmd0.NPU_SET_IFM_WIDTH0_M1 31 -
0x0104 cmd0.NPU_SET_IFM_DEPTH_M1 2 -
0x4006 cmd1.NPU_SET_IFM_STRIDE_C 0 0x00000001 (1) ....

Could you help to give me some insights? Thank you very much.
Cancel
Up 0 Down

Cancel
0 Kristofer Jonsson over 2 years ago in reply to alisonw

Reading the QREAD register is a good start. QREAD is the offset in bytes from the start of the command stream. Counting cmd0 commands times 4 and cmd1 commands times 8 should make it possible to determine which command that hangs.

If the STATUS register contains 0000'0002, then the NPU is stopped and an IRQ has been raised. The command stream has not reached the end, neither has an error interrupt been raised. It is difficult to say what is causing the hang, but a possible cause could be a weight stream corruption, or a DMA job reading or writing an illegal address.

Debugging these kind of errors is usually easier on a model. I don't know if you have built a FVP (Fixed Virtual Platform) model of your hardware, or if you could try running the same network on the Corstone-300 FVP with Ethos-U65?

https://developer.arm.com/tools-and-software/open-source-software/arm-platforms-software/arm-ecosystem-fvps
Cancel
Up 0 Down

Cancel