This blog post is published on behalf of Per Åstrand and Fredrik Knutsson
AI is getting leaner. No longer confined to the cloud or powerful smartphones, the next generation of intelligence is moving into the tiniest devices: smart sensors, wearables, and industrial systems that run on milliwatts and kilobytes. These environments are unforgiving because every cycle and every byte counts. Bringing modern AI into this space has long required trade-offs between accuracy, efficiency, and developer productivity.
With the General Availability (GA) release of ExecuTorch 1.0, those trade-offs are beginning to disappear. As part of the PyTorch ecosystem, ExecuTorch bridges the gap between innovation and embedded deployment, empowering developers to run state-of-the-art models on Arm-based edge devices, from power-efficient microcontrollers paired with Arm Ethos-U NPUs to high-performance industrial solutions built on Arm CPUs.
ExecuTorch brings the PyTorch strengths directly to the edge, built on three guiding principles:
With ExecuTorch 1.0, deploying AI at the edge is not just possible, it is PyTorch end-to-end. No new frameworks, no conversion, just a direct, optimized path from research to production.
ExecuTorch streamlines the journey from PyTorch models to efficient embedded execution. As shown in the diagram below, models can be exported and lowered through different backend delegates that map onto the full spectrum of Arm edge IPs, from NPUs to Cortex-A and Cortex-M CPUs. This unified stack means developers can begin in PyTorch and seamlessly deploy across diverse Arm-based edge devices with consistent tooling, predictable performance, and a clear path from research to production.
Figure 1. How Arm's edge IPs are supported via ExecuTorch backend delegation.
The TOSA (Tensor Operator Set Architecture) 1.0 specification and tooling, released earlier this year, establishes a consistent foundation for deploying AI workloads across Arm NPUs and neural technology. Within ExecuTorch, the TOSA backend ensures predictable behavior by lowering the majority of edge operators (int8 and float32) into a common, portable form. Any operators not yet covered fall back to CPU execution through reference kernels. This unified path makes TOSA the backbone for scalable embedded AI, enabling seamless acceleration across Arm’s NPU family.
ExecuTorch 1.0 delivers production-quality support for the Ethos-U family of NPUs, designed specifically for ultra-low-power AI acceleration. Key highlights include:
This makes Ethos-U the most complete path today for running advanced PyTorch models at the microcontroller level. For example, transformer-based architectures like the Conformer model can now run on Ethos-U85-based edge devices, something unthinkable just a few years ago.
ExecuTorch 1.0 also introduces support for Arm’s upcoming neural technology that will feature in 2026 Arm GPUs through the new VGF backend. This enables ahead-of-time export and execution of neural networks that will power use cases like Neural Super Sampling (NSS), denoising, and ML-driven rendering on future Arm GPUs. While this is covered in detail in this blog post, the important takeaway is that the same ExecuTorch and TOSA infrastructure supporting Ethos-U today also extends to next-generation neural graphics acceleration. This ensures that developers can start experimenting now and be ready for what is coming next, all in the same Python and PyTorch-based development flow.
Arm’s Cortex-M CPUs are the backbone of the embedded world, shipping in billions of devices annually. With ExecuTorch integrating CMSIS-NN, even CPU-only inference benefits from optimized kernels. Accelerated support for Cortex-M is already available, with further improvements underway to expand efficiency across the family.
For higher-performance Linux-based platforms, ExecuTorch 1.0 brings the same seamless flow to Cortex-A CPUs. Using XNNPACK, tuned by Arm KleidiAI, developers can achieve peak performance for edge workloads. This allows models to scale naturally from microcontrollers up to Cortex-A–based edge compute without changing the workflow.
Not every developer has hardware on hand, and getting started with edge AI should not depend on waiting for silicon. Across different Arm backends, developers can evaluate and validate their models even before hardware is available:
This layered approach makes it possible to prototype, validate, and refine AI workloads today, without needing physical hardware in hand. This is well aligned with the productivity goals of ExecuTorch.
One of the strengths of the Arm integration in ExecuTorch 1.0 is that the entire flow – from model lowering down to the backend compiler – is implemented in Python. This makes the backends transparent and hackable:
This design choice keeps the developer experience close to PyTorch, while still unlocking the efficiency of Arm hardware. It means that if you need to adapt ExecuTorch for your own model, workflow, or hardware configuration, the tools are right there at your fingertips.
For deployment, ExecuTorch provides a lightweight C++ runtime that is easy to integrate into any application. The runtime is designed for portability and efficiency, delivering predictable performance on resource-constrained devices while keeping the deployment footprint small. Together, the Python development flow and the C++ runtime create a seamless bridge from experimentation to production.
Figure 2. ExecuTorch has an efficient C++-based runtime and a hackable AoT compilation flow.
Here is what an Ethos-U–focused workflow looks like with ExecuTorch:
Turning a PyTorch model into a .pte artifact ready for deployment is straightforward. The snippet below captures the representative steps involved.
# Conformer model with the same hyper parameters as how we have trained it. model = Conformer(num_classes=vocab_size) dataset = torchaudio.datasets.LIBRISPEECH() # Pick 100 random indexes for calibration calibration_set = torch.utils.data.Subset(dataset, random.sample(range(len(dataset)), 100)) calibration_loader = torch.utils.data.DataLoader( calibration_set, batch_size=1, shuffle=False, collate_fn=collate_fn ) # Load the checkpoint data for the model weights checkpoint = torch.load(path_to_checkpoint, weights_only=True) model.load_state_dict(checkpoint["model"]) model.eval() exported_program = torch.export.export(model, example_inputs, strict=True) graph_module = exported_program.module() compile_spec = EthosUCompileSpec("ethos-u85-256") # Create the quantizer and use the ExecuTorch PT2E flow to quantize the model compile_spec = EthosUCompileSpec("ethos-u85-256") quantizer = EthosUQuantizer(compile_spec) quantizer.set_global(get_symmetric_quantization_config(is_per_channel=True)) quantized_graph_module = prepare_pt2e(graph_module, quantizer) # Do the post-training quantization calibration using the dataset for feats, feat_lens, *_ in calibration_loader: feats, feat_lens, *_ = next( iter(calibration_loader) ) quantized_graph_module(feats, feat_lens) # quantization parameters are captured and the model is re-exported quantized_exported_program = torch.export.export( convert_pt2e(quantized_graph_module), example_inputs, strict=True ) # Create partitioner that delegates the parts it can accelerate to the backend edge_program_manager = executorch.exir.to_edge_transform_and_lower( quantized_exported_program, partitioner=[EthosUPartitioner(compile_spec)], ) # Create the artifact representation of the quantized model executorch_program_manager = edge_program_manager.to_executorch( config=executorch.exir.ExecutorchBackendConfig(extract_delegate_segments=False) ) # And save to disk for deployment on the Ethos-U85 target executorch.exir.save_pte_program( executorch_program_manager, "conformer_quantized_ethos-u85-256.pte" )
This end-to-end flow demonstrates how ExecuTorch 1.0 makes Ethos-U deployment production-ready, while keeping developers in a familiar and flexible PyTorch environment. For complete examples, please find the ExecuTorch documentation and the PTQ part of the ASR example.
ExecuTorch 1.0 is just the beginning. The Arm roadmap includes:
Together, these advances will make AI on Arm not only more efficient but also more accessible to developers everywhere. ExecuTorch 1.0 already marks a milestone, delivering efficient CPU paths, a seamless PyTorch-to-Arm workflow, and simple prototyping on emulated hardware. The future of AI at the edge is here. Head over to the ExecuTorch site to start deploying your PyTorch models with ExecuTorch today.