I have a question about how to make Ethos-U NPU work on a ARM Cortex-A + Cortex-M processor. First, I found ethos-u-linux-driver-stack and ethos-u-core-software on https://git.mlplatform.org/.
1. I know ethos-u-linux-driver-stack is Ethos-U kernel driver. Should it be integrated into the Linux OS running on Cortex-A or be integrated into the Linux OS running on Cortex-M? I am nor clear about which core it need to perform on.
2. For ethos-u-core-software, how to run it? I did't find the detail steps to run it. Does it run on NPU or any core?
3. Except the above two repos, is there any other repo necessory to make Ethos-U NPU work on an ARM Cortex-A + Cortex-M processor?
Thanks for your suggestion in advance.
The Linux driver stack for Arm Ethos-U is provided as an example of how an Arm Cortex-A running Linux can dispatch inferences to an Arm Ethos-U subsystem (Arm Cortex-M, Arm Ethos-U, SRAM, …). This is a so-called Asymmetric MultiProcessing (AMP) system, which requires a small amount of shared memory and an external hardware block (e.g. the Arm MHU) to trigger IRQs on the remote CPU.
The Linux driver stack currently contains a user space application, a user space driver library, and a kernel driver. Important to notice is that the kernel driver will not drive the Arm Ethos-U NPU directly, but instead sends a message to an Arm Cortex-M in the Arm Ethos-U subsystem that drives the NPU.
The setup of the AMP communication is platform dependent and is done in the DTB file.
All software running on the Arm Cortex-M is referred to as core. The code that runs on the NPU is referred to as command stream.
The Arm Ethos-U (sub)system requires an Arm Ethos-U, some SRAM and an Arm Cortex-M to drive the NPU. The (sub)system is highly customizable. A customer may choose which Arm Cortex-M to use, the amount of SRAM, which peripheral to attach, which software to run etc.
Because of the high degrees of flexibility Arm can’t provide a ready packaged software to boot on the Arm Cortex-M. Core software only contains the necessary software components needed to run an inference using the TensorflowLite microcontroller framework.
You would need to write your own main() function to initialize the platform. You will also need to define a scatter file (Arm Clang) or a linker script (GCC) that describes the memory layout.
All software publicly available for Arm Ethos-U can be downloaded with fetch_externals.py from the Arm Ethos-U repository.
Other repos worth mentioning is Vela, which takes a tflite file as input and produces another optimized tflite as output. The optimized tflite file contains custom operators that are executed on the Arm Ethos-U.
The Arm Ethos-U IP has just recently been released, so it will still take some time before you can buy an Arm Ethos-U capable platform to test on. There are currently no virtual platforms available for download on the Arm website that include the Arm Ethos-U, but hopefully that will change in a not too distant future.
Kristofer, thank you very much for your detailed reply. It's very useful for me. I have two more questions.
1. The TensorflowLite microcontroller framework you mentioned is the common source https://github.com/tensorflow/tensorflow, right? I checked the tensorflow directory in ethos-u/core_software, it seems the common one, no other external patches.
2. Is OpenAMP used for IPC between Cortex-A and Cortex-M? According to the AMP communication you mentioned, it seems the current codes are sufficient.
1. Yes it is the Tensorflow framework from GitHub. The build system under tensorflow/lite/micro/tools/make/ is used to produce a static library, including CMSIS-NN and the Ethos-U driver. There is to my knowledge one small patch that has not yet reach upstream, that adjusts the build flags and a few paths to CMSIS-NN.
2. OpenAMP is in the plan, but for the moment the communication is defined in linux_driver_stack/kernel/ethosu_core_interface.h and does not make use of virtio and rpmsg. It does however use the Linux kernel maibox APIs, to abstract which hardware block that is used to trigger IRQs on the remote CPU.
Thanks a lot for your reply.
Hi, Kristofer, I am still confused about the communication between Cortex-A and Cortex-M. As you mentioned, the current communication is using the Linux kernel mailbox APIs and virtio/rpmsg are not used. I want to know whether the current codes are sufficient to accomplish the communication between Cortex-A and Cortext-M. Do virtio and rpmsg need to be used?
The code published on MLPlatform is sufficient for Linux to dispatch inferences to the Arm Cortex-M.
The reason for moving towards virtio/rpmsg (OpenAMP on the Arm Cortex-M side) is that those are Linux native APIs. They provide standard, well designed and well tested communication channels. With that said, what we have published so far is fully functional.
For reference you could follow the call chain from ethosu_inference_create() to see how the kernel driver creates an inference and dispatches it to the Arm Ethos-U subsystem. On the Core side the message is received and handled in MessageProcess::handleMessage().
Kristofer, thanks for your reply, I got it. Another question, what is your suggestion about the OS or Non-OS running on Cortex-M? How about running FreeOS on Cortex-M?
I would like to divide this answer into two parts.
The Arm Ethos-U NPU driver is OS agnostic and does not use any OS specific primitives (like mutex, queues, etc). It consequently can be paired up with any RTOS. You spawn a thread that drive the TFLu runtime from within the thread. I would even go one step further and say that it is recommended to use a RTOS. So yes, you can use FreeRTOS or any other RTOS you prefer.
What is not (yet) supported are user facing OS APIs. The APIs would be used by applications to schedule inferneces, and implemented by drivers to run inferences on the physical device. The APIs would provide hardware abstraction (you do not know which hardware that accelerates your network) and scheduling (multiple applications can share multiple NPUs).
Kristofer, according to my understanding about the current code, Tensorflow Lite APIs (for example, interpreter.Invoke) are used by the application and call the ethosu driver in micro/. Are Tensorflow Lite APIs one kind of the user facing OS APIs you mentioned? I am not clear about the user facing OS APIs which is not (yet) supported. Please correct and guide me.
User facing APIs would be part of the OS and should be generic enough to support multiple frameworks (TFLu, TVM, etc). They should allow multiple applications to share NPU resources and ideally provide hardware abstraction (the application is unaware of which hardware that accelerates the network). An application would do an OS call to run an inference, instead of directly calling interpreter.Invoke().
These APIs don't exist today and we do not yet have a clear picture of what they would look like, or if this even is the right way to go. Hardware abstraction might also be difficult to achieve, because networks might have been optimized for a specific hardware.