Arm is forging a path to the future with solutions designed to support the rapid development of AI. One challenge is to make the emerging technology available to the community. In this blog, we present the Arm ML Inference Advisor (Arm MLIA) and show you how it is used to improve model performance on Arm IP. We also explain some of the work leading up to it, and why it matters.
Designing networks is a challenge, ask anyone who has done it. You need to understand a number of complex concepts to get it right. In the ML space, many are familiar with the high-level API's such as TensorFlow and PyTorch. These powerful tools help us set up a pipeline for our use-cases: training, tweaking and generating the runtime. When the model is compiled for deployment, the assumption is that that's the end of the story. You did the work to tweak the model parameters during training and now your ML-pipeline is optimized. What happens when you deploy the model on a hardware target? Can we impact the performance on a processor level? Today we are here to learn about the rest of that story.
Low-level processor architecture is not an easy topic. It's easy to accept that it is important, but challenging to understand why. Learning about the hardware that runs the inference may not be a priority for ML developers. At the same time, embedded software developers may struggle to understand the Machine Learning model optimization space. With the Arm ML Inference Advisor, we aim to close that gap and make the Arm ML IP available to developers at all levels of abstraction. Before getting into the tool's capabilities, let's spend some time to understand this hardware perspective.
At Arm, we work hard to enable maximum performance, power and area efficiency for ML inference from Cloud to Edge to Endpoint.
These aspects can make a significant difference in the performance of a neural network. An example is the Arm® Ethos processor, which is the implementation of a NPU (Neural Processing Unit) that is paired with an Arm® Cortex®-M processor. This NPU is designed to run machine learning networks on embedded devices, meaning that your network will perform at best when as many operators as possible are loaded in that space. For example, the Ethos®-U65 combined with the AI-capable Cortex®-M55 processor provides a 1916x uplift in ML performance over existing Cortex®-M CPUs for the quantized keyword spotting Micronet model.
Now that we have established why model optimization for hardware targets matters, lets talk about how you can get started with enhancing the performance of your neural network.
Arm MLIA is a tool used to analyze how neural networks run on Arm, and to apply optimizations to the given model. It sprung out of a need to gather efforts in these areas across Arm into one tool, making it available to a wide range of developers with varying skill-sets. The two main inputs are a model file (Keras or TensorFlow Lite) and the hardware configuration which you intend to deploy on. The Arm MLIA analyzes the combination and generates advice on how to improve the model. It uses two base commands, check and optimize. The first allows you to have a look at those main parameters and what the compound could mean for inference. The second will apply optimizations to the model. The following image describes how these capabilities are applied.
check
optimize
All targets are all simulated using a backend. The backend can be explained as technology that is capable of either emulating the behavior or predicting the performance, or both, of the target hardware.
Let's have a look at what that can look like. We focus on two use-cases here, using the Ethos -U55 as target for the DS-CNN Small keyword spotting model from the Hello Edge paper. It is available on the Arm ML-Zoo, which is a collection of machine learning models optimized for Arm IP.
Analyzing a neural network for performance is all about identifying the bottle-necks. Arm MLIA provides an operator compatibility report for most targets supported by the tool. This means that we identify any operators in the network that don't have an optimized implementation on the given target, and thus is at risk of slowing down your inference. A compatibility table will communicate what operators are able to run on the NPU. The remaining will fall back to the software implementation which runs on the CPU (resulting in lower performance). By replacing those operators, the inference will run faster. You can check and compare the performance with a separate command.
mlia check -t ethos-u55-256 ../ML-zoo/models/keyword_spotting/ds_cnn_small/model_package_tf/model_archive/model_source/saved_model/ds_cnn_small/ \ --compatibility
The command will create a compatibility table, which displays on which IP each layer will run. Additionally, the ratio of compatible vs non-compatible operators with regards to the NPU is reported. In this example, we see that the NPU supports 100% of all operators in the DS-CNN Small model, making it well-suited to run on the given target.
Arm MLIA offers a workflow to deploy optimization techniques to the given model. It can be used to try different combinations of hardware with your model, and to see what type of optimizations would benefit your use-case. Here we show an example with pruning, clustering and quantization. In the end, MLIA outputs the optimized model*, along with a performance report on how much improvement you can expect to see as a result. This end-to-end approach lowers the threshold for developers to apply hardware optimization techniques to their networks, without needing hardware access. This workflow can result in as much as 1.2-2x improved model performance. At the same time, it can reduce the model size by up to 4 times thanks to quantization and while maintaining model accuracy.
Here is an example:
# Custom optimization parameters: pruning=0.6, clustering=16 mlia optimize ../ML-zoo/models/keyword_spotting/ds_cnn_small/model_package_tf/model_archive/model_source/saved_model/ds_cnn_small/ \ --target-profile ethos-u55-256 \ --pruning \ --pruning-target 0.6 \ --clustering \ --clustering-target 16
The performance uplift is displayed in a table, and at the very end there will be further advice offered. A Keras model is currently required as input for the pruning- and clustering optimizations, while the check command also supports TFLite.
*Note that the optimized model will not preserve accuracy and is meant to be used for further development and debugging.
We hope that we have piqued your interest in model optimization on Arm IP, and taking your neural networks to the next level. At Arm, we work hard to support tomorrow's AI technology, and to make it available to all developers that contribute to that mission. One part of this is providing the right tools. As we are getting closer to the end of this blog, we hope you'd like to try it out for yourself. Arm MLIA is open-source and is available through pip.
pip install mlia
To run the commands mentioned above, you can download the models from the Arm Model Zoo.
git clone https://github.com/ARM-software/ML-zoo.git
Note: For the files in Model Zoo to be cloned correctly, you may need to configure Git Large File Storage (LFS).We are dedicated to improving the Arm MLIA. Future work includes adding more types of Arm IP to the tool, and automating some of the optimization advice. New suggestions and feedback are always welcome. Get in touch if you need help or have questions. Send an email to mlia@arm.com or use the AI and ML forum by marking your post with the MLIA tag.
Very good read! Easy to follow and understand even for someone like me :) looks like a good tool for understanding how models work and handy in improving them hope to see more in the future.
I'm so glad to hear that, thank you for reading and sharing your thoughts! Yes, stay tuned for the future on our pypi homepage :)