Unlocking AI on Arm Microcontrollers with Deep Learning Model Optimization

April 21, 2020

6 minute read time.

*** All content in this blog provided by Davis Sawyer, Co-founder & CPO at Deeplite.ai ***
Japanese version - 日本語版

The emergence of AI and deep learning on embedded devices and platforms has created opportunities for exciting new ways to make products more intelligent. In domains such as computer vision and natural language, deep neural networks (DNNs) have become the de facto tool for performing complex tasks; even outperforming humans at recognizing objects in images. Therefore, DNNs have become much more complicated and computationally demanding in recent years, performing ever-more interesting and intelligent use cases such as semantic segmentation and facial recognition. This has rendered many state-of-the-art model architectures impractical for everyday devices. For the billions of microcontrollers (MCUs) currently in use, this is ultimately preventing people from using AI on their devices.

It is no secret that deep learning has a size problem. For example, MegatronLM, a massive transformer model for language tasks weighs in at over 8 billion parameters (that is 33GB of memory), and requires 500 V100 GPUs over 9 days to train. Although an extreme example, the resource demands for most modern DNN models simply cannot run on the low-power computing hardware that is prevalent all around us in edge devices such as phones, cars, and sensors.

The growth of deep learning model complexity

Figure 1: The growth of deep learning model complexity

For most deep learning teams, the initial focus is on creating a model that obtains a high accuracy on their use case. Training a model to return a desired result with high accuracy is an accomplishment in and of itself. Model size, latency, and power consumption considerations come next. These constraints are dependent on the hardware available and very difficult to achieve without impacting the accuracy level achieved in the original model. This has created a chasm when it comes to production deployment of deep learning models, as multiple design metrics must be considered to practically use deep learning on target hardware applicable for the device.

Fortunately, there is a solution. These barriers can be overcome with effective deep learning model optimization. Model optimization not only enables AI engineers to rapidly create highly compact, high-performance models, it also allows AI teams to deploy those models on the highly effective processors readily available for battery-powered and resource-limited devices like Arm Cortex-M MCUs. Model optimization enables mission-critical and real-time tasks, previously dependent on cloud connection and server-class hardware, to run locally within a standard MCU, thereby driving significant improvements in throughput and inference latency.

To demonstrate the impact of model optimization, let us consider the following deployment in a smart factory setting where we are using a low-power camera, Arm MCU and a convolutional neural net (CNN) to classify positive and negative images of product samples for quality control:

Arm Cortex-M4 (256KB of on chip memory, 1MB of flash)
MobileNetv1 CNN trained on proprietary binary classification dataset
Target optimization metrics are model size (in MB) and accuracy.

Optimization workflow for DNN deployment on ARM hardware

Figure 2: Optimization workflow for DNN deployment on ARM hardware

Today, the most common way to optimize a deep learning model is through a costly process of trial and error. Engineers can spend weeks to months applying pruning, hyperparameter tuning, quantization or more often, some combination of the earlier to find a commercially practical model for their hardware. At Deeplite, we are breaking the mold with an automated, push-button optimization process for DNN models.

Users simply reference their pretrained model, a dataset, and some constraints (like size or accuracy in this example) and press run. Our on-premise software engine uses proprietary design space exploration algorithms to efficiently converge on and find a new model architecture optimized for the specified constraints of their deployment.

What used to take weeks to months of manual effort can now be automated in a few hours or days with one simple and easy-to-use software engine. This is extremely valuable for a manufacturing process that can require either an individual and multi-step approach to inspect and ensure product quality. Often, defects are not realized until the end of the process, but automated optical inspection with rapid throughput of DNN models enables early detection (as highlighted within Step 2 of the following smart manufacturing process diagram).

Automated inspection pipeline for a smart manufacturing process.

1. Subjective and/or manual inspection of intermediates

2. Smart Camera: low-power camera and MCU enables automated optical analysis of product quality.

3. QR/barcode tracking of final product

Figure 3: Automated inspection pipeline for a smart manufacturing process.

A multi-objective approach to model optimization allows engineers to focus on accuracy and utilize Deeplite to seamlessly create a production-ready model for inference. Focusing on Step 2 above, the initial MobileNetv1 model (approximately 12.8MB and 92% accurate on the validation dataset), must run on the low-power camera with the Arm Cortex-M4. But the AI engineer needed to create a model that could fit parameters onto 256KB of on-chip memory, all with less than a 2% loss in Top-1 accuracy.

To achieve such accuracy retention and size reduction, methods like pruning, hyperparameter tuning and basic quantization just do not cut it. This is often because these techniques do not fully explore the network design space (number of filters, layers, operations, kernel size) required to find a feasible solution for real-world tasks. Industry applications require a repeatable, reproducible method of architecture search.

The Deeplite engine implements unique attention mechanisms to parse a pre-trained model architecture and identify meaningful sensitivities and network transformations that drastically reduce the design space required to find an acceptable solution. In addition, applying knowledge distillation greatly preserves accuracy for high-fidelity and mission critical use cases.

Our engine can rapidly converge using deterministic approaches on a new network design. It reduces the design space based on the user’s defined constraints and creates a new, highly compact network.In this case, the Deeplite engine automatically found and returned a new architecture of roughly 144KB and only a 1.84% drop in Top-1 accuracy as well as a significant reduction in number of MAC operations.

Furthermore, we found that the optimized model generalized better on unseen data, as training an initial large model and using Deeplite’s optimization engine has a regularizing effect on the model. We were able to exceed the metrics required for deployment and enable the user to run their classification model locally on the Arm-Cortex M4.By performing inference directly on the MCU, significant savings across key areas such as latency, bandwidth, and inference cost were achieved.

Additionally, Deeplite is interoperable with AI frameworks like Pytorch, TensorFlow and ONNX, as well as low-level tools like Arm NN and CMSIS-NN.

This enables design teams to easily port their model to a new hardware or stay within their existing configuration. In partnership with Arm, Deeplite’s approach to deep learning model optimization enables AI teams to leverage highly efficient hardware such as the Arm Cortex-M4 for their AI tasks at the edge.

By automating the development cycle for production-ready model architectures, Deeplite can also drastically accelerate time-to-market and enhance the productivity of model development teams through multi-objective optimization. Lastly, by coupling model optimization with Arm’s low-power hardware like the Cortex-M3, M4 or M55, users benefit from an unprecedented force multiplier in terms of throughput and energy savings.

Model	Size (bytes)	GMac	Parameters (millions)	Accuracy
Initial	12836104B	0.583	3.21	Top1:92.443%
Optimized	144186B (89.04x)	0.112 (5.21X)	0.14 (22.26X)	Top1:90.607%(-1.84%)

Table 1: Summary of optimization results for MobileNetv1 on proprietary dataset.

Figure 4: Sample binary classification images for quality inspection

The result of highly optimized models and new AI use cases is rapidly unfolding into the next generation of intelligent products. Deeplite empowers industry leaders from smart manufacturing, automotive, and consumer devices to break beyond conventional limits and deploy their AI models, where no AI teams have gone before. Together Deeplite and Arm are working hand in hand to truly unlock the potential for edge IoT by enabling on device inference on MCUs.For more details or to get in touch:

Visit Deeplite.ai

AI blog

Ethos-U and Beyond: How ExecuTorch 1.0 powers AI at the edge

Per Åstrand

AI meets the edge: ExecuTorch 1.0 brings PyTorch performance and portability to Arm’s tiniest, most efficient devices.
- October 22, 2025
Arm neural technology in ExecuTorch 1.0

Robert Elliott

With the announcement of Arm neural technology, Arm is enabling neural networks and a new class of neural graphics capabilities.
- October 22, 2025
ExecuTorch 1.0 is here and with SME2 optimizations through KleidiAI

Gian Marco Iodice

Today marks an exciting milestone with the official general availability (GA) release of ExecuTorch 1.0, a lightweight, production-ready runtime from the PyTorch ecosystem.
- October 22, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Unlocking AI on Arm Microcontrollers with Deep Learning Model Optimization

Ethos-U and Beyond: How ExecuTorch 1.0 powers AI at the edge

Arm neural technology in ExecuTorch 1.0

ExecuTorch 1.0 is here and with SME2 optimizations through KleidiAI