TinyML is a branch of machine learning (ML) that is focused on deploying ML models to low-power, resource-constrained IoT devices. Deploying ML models on IoT devices has several benefits including reduced latency and preserving privacy as all data is processed on device. TinyML gained traction in 2019 when Google’s TensorFlow team released the TensorFlow Lite for Microcontrollers (TFLM) library.
TensorFlow Lite for Microcontrollers homepage
The initial use cases and TFLM examples focused on running quantized 8-bit keyword detection and person detection models on Arm Cortex-M4 based development boards like the Arduino Nano 33 BLE Sense and SparkFun Edge. The examples leveraged the Cortex-M4 CPU’s Signed Multiply with Addition (SMLAD) instruction to perform on-device ML inferencing on model’s required multiply-accumulate (MACs) operations.
The table above summarizes the number of MACs, RAM and flash memory requirements, and inference latency of both models on the Arduino Nano 33 BLE Sense.
Modern microcontrollers that are equipped with Arm's Ethos-U55 Neural Processing Unit (NPU) are capable of running more complex models that were originally developed for mobile applications. The Ethos-U55 NPU supports bit-accurate output for 44 TensorFlow Lite operators and can be configured to perform 32, 64, 128, or 256 MAC operations per cycle. This blog will demonstrate the performance benefits of the NPU by running two TinyML applications on a modern microcontroller that is equipped with an Ethos-U55 NPU. We’ll benchmark the inference latency of the ML models used in the applications by running them on the microcontroller with and without the Ethos-U55 NPU.
The TinyML applications were deployed to a Seeed Studio - Grove – Vision AI Module V2 development board that is based on Himax’s WiseEye2 HX6538 microcontroller and integrates 16 MB of external flash memory. The Himax WiseEye2 HX6538 microcontroller is equipped with an Arm Cortex-M55 CPU and Ethos-U55 NPU, both running at 400 MHz, 512 KB of Tightly Coupled Memory (TCM), and 2 MB of SRAM.
Photo of a Seeed Studio Grove - Vision AI Module V2 development board
The board has a 15-pin Camera Serial Interface (CSI) that is connected to the Himax WiseEye2 HX6538 MCU and can be used with an OmniVision OV5647-based camera module. Applications running on the board can capture RGB images with resolutions of 160x120, 320x240, or 640x480 pixels in real-time from the camera module.
A 2.8” TFT screen and a 3.7V LiPo battery were connected to the board to create a portable battery-operated device.
The two computer vision-based applications deployed to the board will continuously:
The first application will use two ML models to predict key points on a person’s face, and the second application will use an ML model to predict key points in a person’s pose.
Both applications will use the TFLM library alongside the Ethos-U custom operator which enables offloading ML operations to the NPU. The quantized 8-bit TensorFlow Lite models used in the applications must be compiled with Arm’s vela compiler. The vela compiler collapses operations that are supported by the NPU into custom Ethos-U operators that can be dispatched to the NPU for efficient execution. Any operations that are not supported by the NPU will remain as is and fallback to running on the CPU.
(Left) Graph of 8-bit quantized TensorFlow Lite model(Right) Graph of 8-bit quantized TensorFlow Lite model compiled with vela
In the example above, all TensorFlow operators apart from the pad operations were converted by the vela compiler to Ethos-U custom operators which run on the Ethos-U55 NPU. The unconverted pad operators will fall back to running on the CPU.
This application captures a 320x240 image from the camera module and then estimates 468 key points of a person’s face. It can be used as a feature extraction layer for applications that need to identify familiar faces, monitor attention, identify emotions, or perform medical diagnosis. The application uses two ML models, the Google MediaPipe BlazeFace (short-range) model is first used to identify the location of faces in the image. After a face is detected, the Google MediaPipe Face Mesh model is used to identify the 468 key points in the largest face in the image.
Play the video below to see the models in action on the development board, the application achieves just under 11 frames per second (FPS) when using the Ethos-U55 NPU.
Recorded demo of Face Mesh model running with the Ethos-U55 enabled
The MediaPipe BlazeFace (short-range) model requires an RGB 128x128 image as its input and performs 31 million MAC operations per inference. A 8-bit quantized version of the BlazeFace (short-range) model in TensorFlow Lite format was obtained using Katsuya Hyodo’s PINTO_model_zoo GitHub repo.
The table above summarizes the RAM and flash memory requirements, and compares the inference latencies of using the Cortex-M55 CPU alone vs using the Cortex-M55 CPU with the Ethos-U55 NPU:
The Ethos-U55 NPU accelerates inference by 109 times!
The MediaPipe Face Mesh model requires an RGB 192x192 image of a cropped face with 25% padding as its input and performs 36.8 million MAC operations per inference. A 16-bit floating-point version of the model was downloaded from GitHub and converted to an 8-bit quantized model with the tflite2tensorflow tool.
The table above summaries the RAM and flash memory requirements, and compares the inference latencies of using the Cortex-M55 CPU alone vs using the Cortex-M55 CPU with the Ethos-U55 NPU:
The Ethos-U55 NPU accelerates inference by 103 times.
Offloading ML computation to the Ethos-U55 allows this application to perform over 10 inferences per second. If it was deployed to Cortex-M55 CPU, the application would only be able to perform an inference every 8 seconds when a face is visible.
This application captures a 320x240 image from the camera module and then estimates 17 key points of a person’s pose for each person detected in the image. It can be used as a feature extraction layer for applications that need to detect falls or movement, or as an input to a human machine interface.
Play the video below to see the models in action on the development board, the application achieves just over 10 FPS when using the Ethos-U55 NPU.
Recorded demo of Pose Estimation running with the Ethos-U55 enabled
This model was exported to an 8-bit quantized TensorFlow Lite model with a 256x256 RGB input using DeGirum's fork of the Ultralytic's YOLOv8 GitHub repository. DeGirum’s modifications enable the exported model to be better optimized for microcontrollers, this is achieved by removing the transpose operations and separating the model’s seven outputs for improved accuracy with quantization. This model requires 728 million MACs operations per inference.
The Ethos-U55 NPU accelerates inference by 611 times! Offloading ML computation to the Ethos-U55 allows this application to perform over 10 inferences per second, if it was deployed to Cortex-M55 CPU, the application would only be able to perform an inference once every 62 seconds. The decreased inference latency allows your application to react to human movement much faster.
This blog demonstrates that ML models developed for mobile applications and requiring 10 to 100’s of millions of multiply-accumulate (MACs) operations per inference can be deployed to a modern microcontroller equipped with an Arm Ethos-U55 NPU. These ML models required significantly more MAC operations, RAM and flash memory than the example applications that are included in the TFLM library. Using the NPU enables the applications to perform multiple inferences per second versus a single inference every few seconds or per minute without the NPU. This benefits applications and enables them to react much faster to the surrounding environment while running one or more ML models that are more complex than the model’s used in initial tinyML applications.
Learn more:
The applications shown were based on the work of the Himax team, which ported the Face Mesh and Yolov8n-pose models to the Himax WiseEye2 HX6539 microcontroller. Their work inspired me to create a compact portable battery-operated device with a TFT display and rechargeable battery.