Take your neural networks to the next level with Arm's Machine Learning Inference Advisor

November 23, 2023

7 minute read time.

Arm is forging a path to the future with solutions designed to support the rapid development of AI. One challenge is to make the emerging technology available to the community. In this blog, we present the Arm ML Inference Advisor (Arm MLIA) and show you how it is used to improve model performance on Arm IP. We also explain some of the work leading up to it, and why it matters.

The unknown hardware side of Machine Learning

Designing networks is a challenge, ask anyone who has done it. You need to understand a number of complex concepts to get it right. In the ML space, many are familiar with the high-level API's such as TensorFlow and PyTorch. These powerful tools help us set up a pipeline for our use-cases: training, tweaking and generating the runtime. When the model is compiled for deployment, the assumption is that that's the end of the story. You did the work to tweak the model parameters during training and now your ML-pipeline is optimized. What happens when you deploy the model on a hardware target? Can we impact the performance on a processor level? Today we are here to learn about the rest of that story.

Low-level processor architecture is not an easy topic. It's easy to accept that it is important, but challenging to understand why. Learning about the hardware that runs the inference may not be a priority for ML developers. At the same time, embedded software developers may struggle to understand the Machine Learning model optimization space. With the Arm ML Inference Advisor, we aim to close that gap and make the Arm ML IP available to developers at all levels of abstraction. Before getting into the tool's capabilities, let's spend some time to understand this hardware perspective.

At Arm, we work hard to enable maximum performance, power and area efficiency for ML inference from Cloud to Edge to Endpoint.

We optimize neural network operators to have ML workloads run faster.
That means we look at kernel implementations and using the Arm instruction sets to replace generic calls. This accelerates the operator on Arm IP. See for example CMSIS-NN and Arm NN TensorFlow Lite Delegate.
We spend a lot of time on performance analysis to understand the bottle-necks. By keeping track of the cycles spent in each layer and practicing memory management, we maximize how the resources are used in our hardware and software.
We do model conditioning to minimize the memory footprints or inference time. For example, using quantization can shrink the model size, while maintaining the accuracy.

These aspects can make a significant difference in the performance of a neural network. An example is the Arm® Ethos processor, which is the implementation of a NPU (Neural Processing Unit) that is paired with an Arm® Cortex®-M processor. This NPU is designed to run machine learning networks on embedded devices, meaning that your network will perform at best when as many operators as possible are loaded in that space. For example, the Ethos®-U65 combined with the AI-capable Cortex®-M55 processor provides a 1916x uplift in ML performance over existing Cortex®-M CPUs for the quantized keyword spotting Micronet model.

Performance uplift using Ethos-U55 on the Micronet model

Now that we have established why model optimization for hardware targets matters, lets talk about how you can get started with enhancing the performance of your neural network.

Simplifying hardware performance analysis and optimization

Arm MLIA is a tool used to analyze how neural networks run on Arm, and to apply optimizations to the given model. It sprung out of a need to gather efforts in these areas across Arm into one tool, making it available to a wide range of developers with varying skill-sets. The two main inputs are a model file (Keras or TensorFlow Lite) and the hardware configuration which you intend to deploy on. The Arm MLIA analyzes the combination and generates advice on how to improve the model. It uses two base commands, check and optimize. The first allows you to have a look at those main parameters and what the compound could mean for inference. The second will apply optimizations to the model. The following image describes how these capabilities are applied.

Arm MLIA model optimization and performance analysis flow Pruning: Sets insignificant model weights to zero until the specified sparsity is reached.

All targets are all simulated using a backend. The backend can be explained as technology that is capable of either emulating the behavior or predicting the performance, or both, of the target hardware.

Target	Backend
Arm®Cortex®-A	Arm NN TensorFlow Lite Delegate
Ethos -U	Corstone-300, Corstone-310
TOSA	TOSA Checker

Let's have a look at what that can look like. We focus on two use-cases here, using the Ethos -U55 as target for the DS-CNN Small keyword spotting model from the Hello Edge paper. It is available on the Arm ML-Zoo, which is a collection of machine learning models optimized for Arm IP.

Compatibility

Analyzing a neural network for performance is all about identifying the bottle-necks. Arm MLIA provides an operator compatibility report for most targets supported by the tool. This means that we identify any operators in the network that don't have an optimized implementation on the given target, and thus is at risk of slowing down your inference. A compatibility table will communicate what operators are able to run on the NPU. The remaining will fall back to the software implementation which runs on the CPU (resulting in lower performance). By replacing those operators, the inference will run faster. You can check and compare the performance with a separate command.

mlia check -t ethos-u55-256 ../ML-zoo/models/keyword_spotting/ds_cnn_small/model_package_tf/model_archive/model_source/saved_model/ds_cnn_small/  \
    --compatibility

The command will create a compatibility table, which displays on which IP each layer will run. Additionally, the ratio of compatible vs non-compatible operators with regards to the NPU is reported. In this example, we see that the NPU supports 100% of all operators in the DS-CNN Small model, making it well-suited to run on the given target.

Output table running the compatibility command

Optimization

Arm MLIA offers a workflow to deploy optimization techniques to the given model. It can be used to try different combinations of hardware with your model, and to see what type of optimizations would benefit your use-case. Here we show an example with pruning, clustering and quantization. In the end, MLIA outputs the optimized model*, along with a performance report on how much improvement you can expect to see as a result. This end-to-end approach lowers the threshold for developers to apply hardware optimization techniques to their networks, without needing hardware access. This workflow can result in as much as 1.2-2x improved model performance. At the same time, it can reduce the model size by up to 4 times thanks to quantization and while maintaining model accuracy.

Model size reduction result applying quantization and running the optimize command Here is an example:

# Custom optimization parameters: pruning=0.6, clustering=16
mlia optimize ../ML-zoo/models/keyword_spotting/ds_cnn_small/model_package_tf/model_archive/model_source/saved_model/ds_cnn_small/ \
    --target-profile ethos-u55-256 \
    --pruning \
    --pruning-target 0.6 \
    --clustering \
    --clustering-target 16

The performance uplift is displayed in a table, and at the very end there will be further advice offered. A Keras model is currently required as input for the pruning- and clustering optimizations, while the check command also supports TFLite.

Output table running the optimize command

*Note that the optimized model will not preserve accuracy and is meant to be used for further development and debugging.

Wrapping up

We hope that we have piqued your interest in model optimization on Arm IP, and taking your neural networks to the next level. At Arm, we work hard to support tomorrow's AI technology, and to make it available to all developers that contribute to that mission. One part of this is providing the right tools. As we are getting closer to the end of this blog, we hope you'd like to try it out for yourself. Arm MLIA is open-source and is available through pip.

pip install mlia

To run the commands mentioned above, you can download the models from the Arm Model Zoo.

git clone https://github.com/ARM-software/ML-zoo.git

Note: For the files in Model Zoo to be cloned correctly, you may need to configure Git Large File Storage (LFS).

We are dedicated to improving the Arm MLIA. Future work includes adding more types of Arm IP to the tool, and automating some of the optimization advice. New suggestions and feedback are always welcome. Get in touch if you need help or have questions. Send an email to mlia@arm.com or use the AI and ML forum by marking your post with the MLIA tag.

2 comments
0 members are here

Top Comments

athleticalex195 over 2 years ago +1

Very good read! Easy to follow and understand even for someone like me :) looks like a good tool for understanding how models work and handy in improving them hope to see more in the future.

AI blog

Coaching AI coding agents: A guide for senior engineers

Alex Spinelli

Learn how senior engineers can coach AI coding agents to design, debug, and deliver high-quality code in immersive dev environments.
- June 30, 2025
Optimize Llama.cpp with Arm I8MM instruction

Yibo Cai

Boosted Llama.cpp Q6\_K & Q4\_K inference using Arm's I8MM (smmla) for faster, efficient int8 matrix multiplies on Neoverse-N2 CPUs.
- June 27, 2025
Build AI responsibly with the Yellow Teaming methodology and LLM assistant

Zach Lasiuk

Yellow Teaming helps developers build responsible AI by aligning products with long-term value, not just short-term success.
- June 6, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog