High efficiency ML-based embedded computer vision

July 25, 2024

9 minute read time.

TinyML is a branch of machine learning (ML) that is focused on deploying ML models to low-power, resource-constrained IoT devices. Deploying ML models on IoT devices has several benefits including reduced latency and preserving privacy as all data is processed on device. TinyML gained traction in 2019 when Google’s TensorFlow team released the TensorFlow Lite for Microcontrollers (TFLM) library.

TensorFlow Lite for Microcontrollers homepage

The initial use cases and TFLM examples focused on running quantized 8-bit keyword detection and person detection models on Arm Cortex-M4 based development boards like the Arduino Nano 33 BLE Sense and SparkFun Edge. The examples leveraged the Cortex-M4 CPU’s Signed Multiply with Addition (SMLAD) instruction to perform on-device ML inferencing on model’s required multiply-accumulate (MACs) operations.

	Keyword Detection	Person Detection
MACs	336,008	7,215,492
RAM (TFLM tensor arena size)	10 KB	136 KB
Flash (to store TFLite model)	19 KB	294 KB
Inference Latency (excludes pre and post processing)	60 ms	657 ms

The table above summarizes the number of MACs, RAM and flash memory requirements, and inference latency of both models on the Arduino Nano 33 BLE Sense.

Modern microcontrollers that are equipped with Arm's Ethos-U55 Neural Processing Unit (NPU) are capable of running more complex models that were originally developed for mobile applications. The Ethos-U55 NPU supports bit-accurate output for 44 TensorFlow Lite operators and can be configured to perform 32, 64, 128, or 256 MAC operations per cycle. This blog will demonstrate the performance benefits of the NPU by running two TinyML applications on a modern microcontroller that is equipped with an Ethos-U55 NPU. We’ll benchmark the inference latency of the ML models used in the applications by running them on the microcontroller with and without the Ethos-U55 NPU.

Modern IoT Development board

The TinyML applications were deployed to a Seeed Studio - Grove – Vision AI Module V2 development board that is based on Himax’s WiseEye2 HX6538 microcontroller and integrates 16 MB of external flash memory. The Himax WiseEye2 HX6538 microcontroller is equipped with an Arm Cortex-M55 CPU and Ethos-U55 NPU, both running at 400 MHz, 512 KB of Tightly Coupled Memory (TCM), and 2 MB of SRAM.

^{Photo of a Seeed Studio Grove - Vision AI Module V2 development board}

The board has a 15-pin Camera Serial Interface (CSI) that is connected to the Himax WiseEye2 HX6538 MCU and can be used with an OmniVision OV5647-based camera module. Applications running on the board can capture RGB images with resolutions of 160x120, 320x240, or 640x480 pixels in real-time from the camera module.

A 2.8” TFT screen and a 3.7V LiPo battery were connected to the board to create a portable battery-operated device.

TinyML Applications

The two computer vision-based applications deployed to the board will continuously:

Capture an image from the camera module.
Perform ML inference along with any pre- and post-processing required to use the ML model.
Display the captured image along with overlaid ML predictions on the TFT display attached to the board.

The first application will use two ML models to predict key points on a person’s face, and the second application will use an ML model to predict key points in a person’s pose.

Both applications will use the TFLM library alongside the Ethos-U custom operator which enables offloading ML operations to the NPU. The quantized 8-bit TensorFlow Lite models used in the applications must be compiled with Arm’s vela compiler. The vela compiler collapses operations that are supported by the NPU into custom Ethos-U operators that can be dispatched to the NPU for efficient execution. Any operations that are not supported by the NPU will remain as is and fallback to running on the CPU.

This graph is an example of how complex the Vela Compiler can be.

^{(Left) Graph of 8-bit quantized TensorFlow Lite model}^{(Right) Graph of 8-bit quantized TensorFlow Lite model compiled with vela}

In the example above, all TensorFlow operators apart from the pad operations were converted by the vela compiler to Ethos-U custom operators which run on the Ethos-U55 NPU. The unconverted pad operators will fall back to running on the CPU.

Face Mesh application

This application captures a 320x240 image from the camera module and then estimates 468 key points of a person’s face. It can be used as a feature extraction layer for applications that need to identify familiar faces, monitor attention, identify emotions, or perform medical diagnosis. The application uses two ML models, the Google MediaPipe BlazeFace (short-range) model is first used to identify the location of faces in the image. After a face is detected, the Google MediaPipe Face Mesh model is used to identify the 468 key points in the largest face in the image.

Play the video below to see the models in action on the development board, the application achieves just under 11 frames per second (FPS) when using the Ethos-U55 NPU.

^{Recorded demo of Face Mesh model running with the Ethos-U55 enabled}

MediaPipe BlazeFace (short-range) model

The MediaPipe BlazeFace (short-range) model requires an RGB 128x128 image as its input and performs 31 million MAC operations per inference. A 8-bit quantized version of the BlazeFace (short-range) model in TensorFlow Lite format was obtained using Katsuya Hyodo’s PINTO_model_zoo GitHub repo.

	Cortex-M55 (without Helium)	Cortex-M55 + Ethos-U55
RAM (TFLM tensor arena size)	461 KB	391 KB
Flash (to store TFLite model)	203 KB	169 KB
Inference Latency (excludes pre and post processing)	3,380 ms	31 ms

The table above summarizes the RAM and flash memory requirements, and compares the inference latencies of using the Cortex-M55 CPU alone vs using the Cortex-M55 CPU with the Ethos-U55 NPU:

The Ethos-U55 NPU accelerates inference by 109 times!

MediaPipe Face Mesh model

The MediaPipe Face Mesh model requires an RGB 192x192 image of a cropped face with 25% padding as its input and performs 36.8 million MAC operations per inference. A 16-bit floating-point version of the model was downloaded from GitHub and converted to an 8-bit quantized model with the tflite2tensorflow tool.

	Cortex-M55 (without Helium)	Cortex-M55 + Ethos-U55
RAM (TFLM tensor arena size)	482 KB	434 KB
Flash (to store TFLite model)	748 KB	688 KB
Inference Latency (excludes pre and post processing)	4,436 ms	43 ms

The table above summaries the RAM and flash memory requirements, and compares the inference latencies of using the Cortex-M55 CPU alone vs using the Cortex-M55 CPU with the Ethos-U55 NPU:

The Ethos-U55 NPU accelerates inference by 103 times.

Benefits of Ethos-U55

Offloading ML computation to the Ethos-U55 allows this application to perform over 10 inferences per second. If it was deployed to Cortex-M55 CPU, the application would only be able to perform an inference every 8 seconds when a face is visible.

Pose Estimation application

This application captures a 320x240 image from the camera module and then estimates 17 key points of a person’s pose for each person detected in the image. It can be used as a feature extraction layer for applications that need to detect falls or movement, or as an input to a human machine interface.

Play the video below to see the models in action on the development board, the application achieves just over 10 FPS when using the Ethos-U55 NPU.

Recorded demo of Pose Estimation running with the Ethos-U55 enabled

This model was exported to an 8-bit quantized TensorFlow Lite model with a 256x256 RGB input using DeGirum's fork of the Ultralytic's YOLOv8 GitHub repository. DeGirum’s modifications enable the exported model to be better optimized for microcontrollers, this is achieved by removing the transpose operations and separating the model’s seven outputs for improved accuracy with quantization. This model requires 728 million MACs operations per inference.

	Cortex-M55 (without Helium)	Cortex-M55 + Ethos-U55
RAM (TFLM tensor arena size)	853 KB	551 KB
Flash (to store TFLite model)	3.23 MB	2.19 MB
Inference Latency (excludes pre and post processing)	61,486 ms	93 ms

The table above summaries the RAM and flash memory requirements, and compares the inference latencies of using the Cortex-M55 CPU alone vs using the Cortex-M55 CPU with the Ethos-U55 NPU:

The Ethos-U55 NPU accelerates inference by 611 times! Offloading ML computation to the Ethos-U55 allows this application to perform over 10 inferences per second, if it was deployed to Cortex-M55 CPU, the application would only be able to perform an inference once every 62 seconds. The decreased inference latency allows your application to react to human movement much faster.

Conclusion

This blog demonstrates that ML models developed for mobile applications and requiring 10 to 100’s of millions of multiply-accumulate (MACs) operations per inference can be deployed to a modern microcontroller equipped with an Arm Ethos-U55 NPU. These ML models required significantly more MAC operations, RAM and flash memory than the example applications that are included in the TFLM library. Using the NPU enables the applications to perform multiple inferences per second versus a single inference every few seconds or per minute without the NPU. This benefits applications and enables them to react much faster to the surrounding environment while running one or more ML models that are more complex than the model’s used in initial tinyML applications.

Learn more:

Acknowledgements

The applications shown were based on the work of the Himax team, which ported the Face Mesh and Yolov8n-pose models to the Himax WiseEye2 HX6539 microcontroller. Their work inspired me to create a compact portable battery-operated device with a TFT display and rechargeable battery.

0 comments
0 members are here

AI blog

Coaching AI coding agents: A guide for senior engineers

Alex Spinelli

Learn how senior engineers can coach AI coding agents to design, debug, and deliver high-quality code in immersive dev environments.
- June 30, 2025
Optimize Llama.cpp with Arm I8MM instruction

Yibo Cai

Boosted Llama.cpp Q6\_K & Q4\_K inference using Arm's I8MM (smmla) for faster, efficient int8 matrix multiplies on Neoverse-N2 CPUs.
- June 27, 2025
Build AI responsibly with the Yellow Teaming methodology and LLM assistant

Zach Lasiuk

Yellow Teaming helps developers build responsible AI by aligning products with long-term value, not just short-term success.
- June 6, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog