Accelerate LLM Inference with ONNX Runtime on Arm Neoverse-powered Microsoft Cobalt 100

October 1, 2025

4 minute read time.

This blog post is co-authored by Na Li, Nobel Chowdary Mandepudi, and Koray Ozkal

Arm optimizations in ONNX Runtime, combined with Arm Neoverse-powered Microsoft Azure Cobalt 100 processors, expands AI performance for large language model (LLM) inference, leading up to 1.9x higher performance and 2.8x better price/performance compared to AMD Genoa-based instances.

As demand for scalable, cost-efficient LLM inference continues to rise, optimizing every layer of the stack becomes essential, from cloud compute infrastructure to runtime libraries. It starts with selecting the right compute power. Microsoft Azure Cobalt 100 processors, powered by Arm Neoverse unlock new opportunities for cost-efficient, high-performance cloud computing. Built on the Arm Neoverse N2 architecture, Cobalt 100-based Microsoft Azure instances are optimized for modern scale-out workloads. In this blog, we take a closer look at how Microsoft Cobalt 100 processors and Arm’s ONNX Runtime optimizations deliver significant performance gains for running LLMs.

Arm and Microsoft Supercharges AI Performance

To empower developers building large-scale AI applications, Arm has partnered with Microsoft to optimize the ONNX Runtime generative AI (GenAI) stack for Microsoft Cobalt 100. ONNX Runtime is a high-performance engine for running machine learning (ML) models across platforms and frameworks. By integrating Arm’s KleidiAI technology directly into the Microsoft Linear Algebra Subprograms (MLAS) backend of ONNX Runtime, GenAI workloads can now take full advantage of Arm’s architectural efficiency.

These Arm-optimized enhancements accelerate critical GEMM operations, such as matrix multiplication and convolution, and support multiple precision formats including int4, int8, bf16, and fp32. This enables faster, more efficient LLM execution on CPU-only infrastructure, without requiring code changes.

Boost Inference Performance with Arm KleidiAI Libraries

As a first step in our testing, we set out to measure the impact of KleidiAI optimizations on LLM inference. To capture performance across a range of configurations, we benchmarked the following Microsoft Azure Cobalt 100 instance sizes:

Standard_D8pls_v6 (8 vCPUs)
Standard_D16pls_v6 (16 vCPUs)
Standard_D32pls_v6 (32 vCPUs)
Standard_D64pls_v6 (64 vCPUs)

For these tests, we used the Phi-4-mini-instruct-onnx model (int4 quantization) downloaded from Hugging Face. Performance was measured by token generation rate, comparing baseline results of ONNX Runtime v1.21.0, which doesn’t have Arm-specific optimizations against ONNX Runtime v1.22.0, which integrates KleidiAI-optimized MLAS backend. The input and output length ranges from 16 to 128 tokens.

ONNX Runtime v1.21.0 (no Arm-specific optimizations)
ONNX Runtime v1.22.0 (includes KleidiAI-optimizations)

Results showed KleidiAI optimizations on ONNX Runtime delivered consistent performance uplifts, from 28% to 51%, across different instance sizes.

Figure 1: Performance uplift enabled by Arm KleidiAI technology. Grey and blue bars shows raw throughput (tokens per second, left axis) and line plot indicates performance uplift from KleidiAI (right axis)

Microsoft Cobalt 100 Outperforms AMD Genoa on Performance & Efficiency

We next compared Microsoft Cobalt 100 against x86 alternatives in a real-world LLM inference scenario using the Phi-4-mini model, which features improved multilingual support, reasoning, math, and function calling. For the comparison, we chose the INT4 version of the model since it offers a scalable and efficient option for serving models on CPU-based instances.

Performance was measured across:

Arm-based Cobalt 100 (Standard_D32pls_v6)
x86-based AMD Genoa (Standard_D32as_v6)

Cobalt 100 delivered about 1.9x faster token generation throughput compared to AMD Genoa, highlighting the performance benefits of Arm for scalable and cost-efficient LLM inference in the cloud.

Figure 2: Comparison of token generation rate between Arm-based Cobalt 100 and AMD Genoa.

When factoring performance against instance pricing[1],the Arm-powered Cobalt 100 instance delivers 2.8x higher performance-per-dollar compared to AMD Genoa, making it the clear cost-effective choice for large-scale LLM inference on CPUs.

Figure 3: Comparison of token per dollar between Arm-based Cobalt 100 and AMD Genoa.

These results demonstrate that running ONNX Runtime on Arm-based Microsoft Cobalt 100 processors form a powerful stack for GenAI workloads in production, combining performance and cost-efficiency at scale.

Get Started: Build Your AI Application on Arm

With Arm Neoverse N2 CPUs at its core, Microsoft Azure Cobalt 100-powered virtual machines (VMs) deliver leading performance and efficiency compared to AMD-based instances with the right mix of performance, cost efficiency, and scale.

Ready to Begin?

Try our demo and follow our Learning Path to experience ONNX Runtime on Arm-powered Microsoft Cobalt 100 processors.

Migration from x86 to Microsoft Cobalt 100 is easy!

Simplify your Migration to Microsoft Cobalt 100 with detailed guides.

Helpful Resources:

Arm Software Ecosystem Dashboard: Explore supported software tools and frameworks.
Arm Developer Hub: Access tutorials, SDKs, technical resources, forums, and community discussions.

Embrace the power, efficiency, and flexibility of Arm Neoverse and experience a new level of performance for your workloads. Visit the Microsoft Azure Portal to launch Cobalt 100 VMs for your workloads today!

Footnotes:

[1] Calculated prices are based on : https://azure.microsoft.com/en-us/pricing/calculator/ as of Aug. 6, 2025.

Servers and Cloud Computing blog

Refining MurmurHash64A for greater efficiency in Libstdc++

Zongyao Zhang

Discover how tuning MurmurHash64A’s memory access pattern yields up to 9% faster hashing performance.
- October 16, 2025
How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Marc Meunier

Discover how FUJITSU-MONAKA secures AI and HPC workloads with Arm v9 and Realm-based confidential computing.
- October 13, 2025
Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

odinlmshen

In this blog post, learn how to integrate virtual BMC and firmware simulation into CI pipelines to speed bring-up, testing, and developer onboarding.
- October 13, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Accelerate LLM Inference with ONNX Runtime on Arm Neoverse-powered Microsoft Cobalt 100

Refining MurmurHash64A for greater efficiency in Libstdc++

How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3