This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to make Ethos-U NPU work on an ARM Cortex-A + Cortex-M processor?

I have a question about how to make Ethos-U NPU work on a ARM Cortex-A + Cortex-M processor. First, I found ethos-u-linux-driver-stack and ethos-u-core-software on https://git.mlplatform.org/.

1. I know ethos-u-linux-driver-stack is Ethos-U kernel driver. Should it be integrated into the Linux OS running on Cortex-A or be integrated into the Linux OS running on Cortex-M? I am nor clear about which core it need to perform on.

2. For ethos-u-core-software, how to run it? I did't find the detail steps to run it. Does it run on NPU or any core?

3. Except the above two repos, is there any other repo necessory to make Ethos-U NPU work on an ARM Cortex-A + Cortex-M processor?

Thanks for your suggestion in advance.

Top replies

0 alisonw over 3 years ago in reply to alisonw

Hi, Kristofer, could you help to reply my question?
Cancel
Up 0 Down

Cancel
0 Kristofer Jonsson over 3 years ago in reply to alisonw
Hi Alison

Sorry for the late reply. I have been on Christmas holiday for the last two weeks and this is my first day after the holiday.

The function ethosu_invoke() takes a pointer to struct custom_data_s [1], which is a dynamic array of driver actions. Vela will typically place an optimizer config followed by a command stream.

Optimizer config is used by the driver to verify that the command stream has been generated for the correct NPU.

Command stream [2] contains the actual command stream embedded for the NPU. This command stream is passed to handle_command_stream().

[1] https://git.mlplatform.org/ml/ethos-u/ethos-u-core-driver.git/tree/src/ethosu_driver.c#n71
[2] https://git.mlplatform.org/ml/ethos-u/ethos-u-core-driver.git/tree/src/ethosu_driver.c#n363

Best regards
Kristofer
Cancel
Up 0 Down

Cancel
0 alisonw over 3 years ago in reply to Kristofer Jonsson

Hi, Kristofer, it doesn't matter. Happy New Year! :) Thank you very much for your detailed explanation. According to the explanation, the loop [1] will first run case OPTIMIZER_CONFIG and verify that the command stream has been generated for the correct NPU, then data_ptr will point to the actual command stream. The loop [1] will run case COMMAND_STREAM and handle_command_stream. Is it right?

[1] https://git.mlplatform.org/ml/ethos-u/ethos-u-core-driver.git/tree/src/ethosu_driver.c#n351
Cancel
Up 0 Down

Cancel
0 Kristofer Jonsson over 3 years ago in reply to alisonw

That is correct.

Happy New Year!
Cancel
Up 0 Down

Cancel
0 alisonw over 3 years ago

Hi, Kristofer, recently I met a problem and had some ideas as below. I really hope you could give me some suggestions.

1. I want to try object detection demo on Cortex-M using tflite micro + SSD model. Then I found operator DETECTION_POSTPROCESS is not supported in tflite micro based on ethos-u 20.08 and 20.11 releases. I also checked the latest tensorflow source code, this operator is supported in tflite micro. So my question is when or which version of ethos-u release will integrate the tensorflow source code that this operator is supported in tflite micro. I suppose this operator will be supported in vela tool, right?

2. As I met such problems for tflite micro, I want to ask a question. Why don't we use tflite on Cortex-A directly instead of tflite micro on Cortex-M? Is there any necessary binding between ethos-u65 and tflite micro?
Cancel
Up 0 Down

Cancel
0 Kristofer Jonsson over 3 years ago in reply to alisonw

Hi

1. Next release of Arm Ethos-U is planned for end of February, which I am sure will include the detection postprocess operator.

Until then you can use the fetch_externals.py script to download the latest versions of all repos.

https://git.mlplatform.org/ml/ethos-u/ethos-u.git/about/

$ ./fetch_externals.py fetch

I don't know if Vela has support for detection postprocess in their plans or not.

2. Tensorflow Lite for microcontrollers (TFLu) has been designed to run on Cortex-M. I suppose it would be possible to compile a bare metal app for Cortex-A, but I guess what you wonder is, if Arm Ethos-U can be driven directly from Linux?

In theory, yes, it would be possible to drive the Arm Ethos-U directly from Linux. However, operators that cannot be mapped to the NPU would need to be executed on the CPU, either in kernel space or some kind of user space fallback. Implementing this fallback mechanism is doable, but I don't think it is a trivial task to do. The IRQ handling on Linux might also introduce latency, degrading the performance.
Cancel
Up 0 Down

Cancel
0 alisonw over 3 years ago in reply to Kristofer Jonsson

I don't have a clear understanding about how to handle the operators that can be executed on NPU and need to be executed on CPU. Are there many interactions when executing the operators on NPU and CPU? For example, the first operator executes on NPU, then the second operator executes on CPU... If so, there are too many interactions between NPU and CPU. It may be a big expense.

Actually, my idea is to run TFLite on Cortex-A. When it needs to run on NPU, Cortex-A will send this request to Cortex-M. Cortex-M will execute ethosu_invoke() and handle IRQ. I mean the core_driver still execute on Cortex-M.
Cancel
Up 0 Down

Cancel
0 Kristofer Jonsson over 3 years ago in reply to alisonw

Arm has analyzed the most common AI networks in the embedded space and tried to map the operators to the Arm Ethos-U. How well the this maps for you depends on what networks you want to run.

The software stack for Arm Ethos-U has been designed to fall back to Cortex-M for operators that are not supported by the NPU. Running TLFu on Cortex-A and dispatching custom operators to Cortex-M and the NPU could be possible, but is nothing we have planned to implement. In the Linux Driver Stack for Ethos-U we have provided an example how a Linux user space process can dispatch inferences to an Arm Ethos-U subsystem.
Cancel
Up 0 Down

Cancel
0 alisonw over 3 years ago
Hi, Kristofer,

Recently I have some hardware questions about ethos-u65. Could you give me some guide?

For field shram_size in CONFIG register, I see there are two values, SHRAM_48kB and SHRAM_96kB.

Is the size fixed or changeable by resetting some register?

Does SHRAM mean shared RAM? Is it SRAM?

Is it the RAM in NPU itself or the RAM on processor/microcontroller?

For DMA controller, I see there are several channels.

For command channel, it mentions NPU uses this channel to read the command stream, normally from external flash. Is it ok for the command steam located in DRAM?

For IFM channel and weight channel, the data is transferred from external memory to the shared RAM. Is the shared RAM big enough to save the ifm and weight for a vela model generated by vela tool + ssd mobilenet tflite model?

For Arm AMBA 5 AXI interfaces, there are two read/write master M0 and M1.

Is M0 used by the NPU to access higher data rates memory, such as on-chip SRAM? And is M1 used by the NPU to access memory, such as DRAM?

When do M0 work and when does M1 work? Should we configure them in software? I am quite unclear about it.
Cancel
Up 0 Down

Cancel
0 Kristofer Jonsson over 3 years ago in reply to alisonw

I thought I had responded to this question, but I can't find my answer in the thread.

The SHRAM is built into the NPU and the size can't be changed by software. The memory is typically used for storage of weights, biases and temporary data. This is a small memory that will not fit a network like ssd mobilenet. Instead small portions of the weighs and biases are copied to the SHRAM as they are needed.

There are 8 "logical memory channels". Which logical channel that is used for what data (ifm, ofm, weights, biases etc) is coded in the command stream by Vela. The driver maps the logical channels to the physical DMA interfaces (M0 and M1) in the region config registers.

Consequently which DMA interface that is used depends both on Vela and the driver. Now this is subject to change, but as of today with the default settings in Vela and the driver, the command stream, weights and biases will go over M1, and all other data over M0. This allows the TFLu model (command stream, weights and biases) to be moved from fast memory (SRAM) to slower memory (flash or DRAM) without congesting the M0 interface.
Cancel
Up 0 Down

Cancel
0 alisonw over 3 years ago in reply to Kristofer Jonsson

Kristofer, please help to confirm my below comments.

1. The fast memory (SRAM) you mentioned in "This allows the TFLu model (command stream, weights and biases) to be moved from fast memory (SRAM) to slower memory (flash or DRAM) without congesting the M0 interface." is the SHRAM built into the NPU, right?

2. Now, the current process is SHRAM <-> DMA & M1 or M0 <-> slower memory (flash or DRAM), right?

Maybe I didn't understand your last sentence clearly.
Cancel
Up 0 Down

Cancel
0 Kristofer Jonsson over 3 years ago in reply to alisonw
SHRAM - Built into the NPU. DMA is used to move data between external memory and the internal SHRAM.

SRAM - Fast memory outside of the NPU.

Flash/DRAM - Slow memory outside of the NPU.

The TFLu has two buffer, model and arena. For optimal performance both the model and the arena should be placed in SRAM (or similar memory technology), however SRAM is expensive and at the cost of performance the model could be moved to DRAM or flash.
Cancel
Up 0 Down

Cancel
0 alisonw over 3 years ago in reply to Kristofer Jonsson

Kristofer, I want to set the particular SRAM and DRAM address for U65 on our processor, how should I do?
Cancel
Up 0 Down

Cancel
0 Kristofer Jonsson over 3 years ago in reply to alisonw
I am not sure what you mean with setting the SRAM and DRAM address for U65. Could you please elaborate a bit more on what problem you have?

If you for example wonder about how to place the model and arena buffers in memory, then perhaps this information might help you.

Running an inference on the TFLu framework requires three memory regions.

ITCM. This is where code is placed.

Model. This buffer contains constant data like weights, biases and the Ethos-U command stream.

Arena. This is a heap, used for read write data like IFM, OFM, activations etc.

For the tests we have upstreamed to MLPlatform we have defined two additional buffers.

IFM data.

Expected OFM data. Used to verify that the inference completed successfully.

Please have a look at the baremetal example application. Each buffer is named with a section attribute.

Model: https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ethos-u-core-platform.git/+/refs/tags/21.05-rc1/applications/baremetal/models/keyword_spotting_cnn_small_int8/model.h#30

Arena: https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ethos-u-core-platform.git/+/refs/tags/21.05-rc1/applications/baremetal/main.cpp#46

IFM: https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ethos-u-core-platform.git/+/refs/tags/21.05-rc1/applications/baremetal/models/keyword_spotting_cnn_small_int8/input.h#25

OFM: https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ethos-u-core-platform.git/+/refs/tags/21.05-rc1/applications/baremetal/models/keyword_spotting_cnn_small_int8/output.h#25

The section attributes are placed in memory by the scatter file (ArmClang) or linker script (GCC). To change where the buffers are placed in memory you need to edit the scatter file or linker script.

ArmClang: https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ethos-u-core-platform.git/+/refs/tags/21.05-rc1/targets/corstone-300/platform.scatter

GCC: https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ethos-u-core-platform.git/+/refs/tags/21.05-rc1/targets/corstone-300/platform.ld
Cancel
Up 0 Down

Cancel
+1 alisonw over 3 years ago in reply to Kristofer Jonsson

Hi, Kristofer, thanks for your sharing. I have a reference about the codes on corstone-300. It seems ifm, model(weights, biases and the Ethos-U command stream), ofm are located in SRAM. If ETHOSU_FAST_MEMORY_SIZE is defined, tensor arena will be located in SRAM, or else it will be located in DDR. I think ETHOSU_FAST_MEMORY_SIZE means whether there is a fast memory, such as SRAM. Please correct me, if there is anything wrong.

Actually, in my current codes, ifm, model, ofm and tensor arena are all located in DDR. I think only AXI M1 inference is used, because M1 is connected to DDR, M0 is connected to fast memory (on-chip SRAM) in our system. So I want to know whether I need to change Vela tool or driver to make this case work. For example, model is located at DDR, and AXI M1 is used, should I make some configuration in Vela tool or driver for it?

Moreover, as there is fast memory in our system, should I locate tensor arena in fast memory (on-chip SRAM)? I think the fast memory is not large enough for the model, such as mobilenet v1.
Cancel
Up -1 Down

Cancel