ExecuTorch and TOSA enabling PyTorch on Arm platforms

October 17, 2023

5 minute read time.

Arm has worked with Meta to introduce support for Arm platforms in ExecuTorch, a new end-to-end solution for enabling on-device AI for PyTorch.

At Arm, we are big proponents of efficient and easy development of AI workloads. We have worked hard to get the latest and greatest models from PyTorch running on our platforms.

Historically, while PyTorch has been the platform of choice for many new neural networks coming out of research teams, it has been a labor intensive and manual process to convert these workloads into something which can run efficiently on Arm platforms. This has been due to limitations in export flows and a long tail of ML operators, making mapping to embedded and resource constrained systems difficult.

With the introduction of the ExecuTorch codebase released by Meta which builds on the significant developments of PyTorch 2.0. It's now a lot easier to capture and run state of the art networks on anything from Arm CPUs in the server space, to Arm CPUs and GPUs in the mobile space, to Cortex-M microprocessors and the Ethos-U NPUs in embedded applications.

We have worked closely with Meta to introduce preliminary support for our devices into ExecuTorch, building on the significant investments we have already made in Tensor Operator Set Architecture (TOSA) to capture neural networks, and our Ethos NPUs that accelerate key ML workloads on mobile and embedded platforms.

Today we’ve released a TOSA compilation flow and runtime delegate for ExecuTorch, with limited prototype support Ethos-U55, enabling graphs to export directly from the PyTorch python environment to Ethos-U enabled platforms such as Corstone-300.

We’re looking forward to continuing this work to enable a capable export path across a range of machine learning use cases.

What does this mean for developers?

For typical users of PyTorch this now means that a small but growing list of networks can be captured as stand-alone models which run efficiently on the Cortex-M and Ethos-U enabled platforms.

There are two main components:

1) An ahead of time flow which allows for capture of neural networks as standalone file. The image below is a simplified export view. See the 'Exporting to ExecuTorch' documentation for more information.

2) An on-device ExecuTorch runtime which dispatches work to Cortex-M and Ethos-U.

3) The ExecuTorch API provides a functionality to partially delegate graphs and a full set of CPU operators to allow for incremental offload to the Ethos-U. This is hugely beneficial for development of networks and their transition to edge devices.

For example, the TOSA/Ethos-U55 flow does not need to compile the entire graph at once, and support can be progressively added. This is especially useful for large and complex graphs.

This flexibility has direct benefits for PyTorch users as well. With the consistent ExecuTorch API, users can compile and deploy their models, and underneath the hood, some parts of the graph can be delegated to a powerful Ethos-U55 as much as possible while the rest of the graph can still be executed on the Arm CPU. This improves the developer experience by allowing for fast iteration and model coverage, gaining performance without requiring any changes to user code.

A simple example of the flow

Coding example

It can be as simple as adding a small number of lines to an existing PyTorch model in python, which will capture a full representation of the model as a .pte file.

Critically for new model development, this flow will be developed to work for standard networks, and for custom torch.nn.Module’s written to process other workloads.

For more comprehensive examples and documentation on deploying this either on Arm Corstone-300 or other Ethos enabled platforms, we’ve prepared an example application which embeds the exported .pte file and executes it on the Ethos-U55 NPU.

Quantization

Many Arm platforms need quantization to realize the full performance benefits of the Ethos acceleration or new CPU instructions, so of course we’re working to introduce TOSA compliant quantization along with graph consumption. With the ExecuTorch EXIR graph infrastructure, we were able to easily parse the whole quantized graph and read all the required quantization information for calculating the rescale parameters in the process of TOSA quantization lowering.

In the initial quantization support of PyTorch FX graph to TOSA, we were using the new PyTorch 2 Export Post-Training Quantization to get a quantized example of MobileNetV2 which covers most of the common operators like Add, Linear, Convolution, and ReLU. Given its very simple API, we were able to achieve the quantization step in around 3-4 lines of code with a TOSA quantization flow subclassed from XNNPACK quantizer. In order to support more complicated graphs, a custom quantizer class can also be constructed to satisfy the majority of neural network models.

These graphs naturally feed into the Ethos-U compiler to produce models which can directly execute on the Ethos-U hardware.

The MobileNetV2 simple example (PyTorch->TOSA->Vela) has shown the feasibility and simplicity of running PyTorch quantized models on TOSA complainant Neural Network Accelerator hardware. For the next steps, Arm, Meta and the greater PyTorch community will continue to add support for the operator lowerings from PyTorch to TOSA. Alongside this, Arm continues to develop new flows which support TOSA to provide a reliable way to add new compilation targets.

What's next?

Take a look at the demo we’ve provided which gives a glimpse at the possibilities of using PyTorch and ExecuTorch to export graphs to Arm platforms with TOSA, and let us know your thoughts.

Arm is looking forward to adding more networks and operators, working closely with Meta to extend support for the full set of ExecuTorch features. We look forward to a future world where machine learning models can be easily developed with PyTorch and ExecuTorch, and TOSA can be seamlessly deployed to billions of Arm-based devices.

0 comments
0 members are here

AI blog

Coaching AI coding agents: A guide for senior engineers

Alex Spinelli

Learn how senior engineers can coach AI coding agents to design, debug, and deliver high-quality code in immersive dev environments.
- June 30, 2025
Optimize Llama.cpp with Arm I8MM instruction

Yibo Cai

Boosted Llama.cpp Q6\_K & Q4\_K inference using Arm's I8MM (smmla) for faster, efficient int8 matrix multiplies on Neoverse-N2 CPUs.
- June 27, 2025
Build AI responsibly with the Yellow Teaming methodology and LLM assistant

Zach Lasiuk

Yellow Teaming helps developers build responsible AI by aligning products with long-term value, not just short-term success.
- June 6, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

ExecuTorch and TOSA enabling PyTorch on Arm platforms

What does this mean for developers?

A simple example of the flow

Quantization

What's next?

Coaching AI coding agents: A guide for senior engineers

Optimize Llama.cpp with Arm I8MM instruction

Build AI responsibly with the Yellow Teaming methodology and LLM assistant