Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Accelerate LLM Inference with ONNX Runtime on Arm Neoverse-powered Microsoft Cobalt 100
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Cloud Computing
  • Artificial Intelligence (AI)
  • Neoverse CSS N2
  • KleidiAI
  • Microsoft Cobalt
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Accelerate LLM Inference with ONNX Runtime on Arm Neoverse-powered Microsoft Cobalt 100

Na Li
Na Li
October 1, 2025
4 minute read time.
This blog post is co-authored by Na Li, Nobel Chowdary Mandepudi, and Koray Ozkal

Arm optimizations in ONNX Runtime, combined with Arm Neoverse-powered Microsoft Azure Cobalt 100 processors, expands AI performance for large language model (LLM) inference, leading up to 1.9x higher performance and 2.8x better price/performance compared to AMD Genoa-based instances.

As demand for scalable, cost-efficient LLM inference continues to rise, optimizing every layer of the stack becomes essential, from cloud compute infrastructure to runtime libraries. It starts with selecting the right compute power. Microsoft Azure Cobalt 100 processors, powered by Arm Neoverse unlock new opportunities for cost-efficient, high-performance cloud computing. Built on the Arm Neoverse N2 architecture, Cobalt 100-based Microsoft Azure instances are optimized for modern scale-out workloads. In this blog, we take a closer look at how Microsoft Cobalt 100 processors and Arm’s ONNX Runtime optimizations deliver significant performance gains for running LLMs.

Arm and Microsoft Supercharges AI Performance

To empower developers building large-scale AI applications, Arm has partnered with Microsoft to optimize the ONNX Runtime generative AI (GenAI) stack for Microsoft Cobalt 100. ONNX Runtime is a high-performance engine for running machine learning (ML) models across platforms and frameworks. By integrating Arm’s KleidiAI technology directly into the Microsoft Linear Algebra Subprograms (MLAS) backend of ONNX Runtime, GenAI workloads can now take full advantage of Arm’s architectural efficiency.

These Arm-optimized enhancements accelerate critical GEMM operations, such as matrix multiplication and convolution, and support multiple precision formats including int4, int8, bf16, and fp32. This enables faster, more efficient LLM execution on CPU-only infrastructure, without requiring code changes.

Boost Inference Performance with Arm KleidiAI Libraries

As a first step in our testing, we set out to measure the impact of KleidiAI optimizations on LLM inference. To capture performance across a range of configurations, we benchmarked the following Microsoft Azure Cobalt 100 instance sizes:

  • Standard_D8pls_v6 (8 vCPUs)
  • Standard_D16pls_v6 (16 vCPUs)
  • Standard_D32pls_v6 (32 vCPUs)
  • Standard_D64pls_v6 (64 vCPUs)

For these tests, we used the Phi-4-mini-instruct-onnx model (int4 quantization) downloaded from Hugging Face. Performance was measured by token generation rate, comparing baseline results of ONNX Runtime v1.21.0, which doesn’t have Arm-specific optimizations against ONNX Runtime v1.22.0, which integrates KleidiAI-optimized MLAS backend. The input and output length ranges from 16 to 128 tokens.

  • ONNX Runtime v1.21.0 (no Arm-specific optimizations) 
  • ONNX Runtime v1.22.0 (includes KleidiAI-optimizations) 

Results showed KleidiAI optimizations on ONNX Runtime delivered consistent performance uplifts, from 28% to 51%, across different instance sizes.

Figure 1: Performance uplift enabled by Arm KleidiAI technology. Grey and blue bars shows raw throughput (tokens per second, left axis) and line plot indicates performance uplift from KleidiAI (right axis)

Microsoft Cobalt 100 Outperforms AMD Genoa on Performance & Efficiency

We next compared Microsoft Cobalt 100 against x86 alternatives in a real-world LLM inference scenario using the Phi-4-mini model, which features improved multilingual support, reasoning, math, and function calling. For the comparison, we chose the INT4 version of the model since it offers a scalable and efficient option for serving models on CPU-based instances.

Performance was measured across:

  • Arm-based Cobalt 100 (Standard_D32pls_v6)
  • x86-based AMD Genoa (Standard_D32as_v6)

Cobalt 100 delivered about 1.9x faster token generation throughput compared to AMD Genoa, highlighting the performance benefits of Arm for scalable and cost-efficient LLM inference in the cloud.

Figure 2: Comparison of token generation rate between Arm-based Cobalt 100 and AMD Genoa.

When factoring performance against instance pricing[1],the Arm-powered Cobalt 100 instance delivers 2.8x higher performance-per-dollar compared to AMD Genoa, making it the clear cost-effective choice for large-scale LLM inference on CPUs.

Figure 3: Comparison of token per dollar between Arm-based Cobalt 100 and AMD Genoa.

These results demonstrate that running ONNX Runtime on Arm-based Microsoft Cobalt 100 processors form a powerful stack for GenAI workloads in production, combining performance and cost-efficiency at scale.

Get Started: Build Your AI Application on Arm

With Arm Neoverse N2 CPUs at its core, Microsoft Azure Cobalt 100-powered virtual machines (VMs) deliver leading performance and efficiency compared to AMD-based instances with the right mix of performance, cost efficiency, and scale.

Ready to Begin?

  • Try our demo and follow our Learning Path to experience ONNX Runtime on Arm-powered Microsoft Cobalt 100 processors.

Migration from x86 to Microsoft Cobalt 100 is easy!

  • Simplify your Migration to Microsoft Cobalt 100 with detailed guides.

Helpful Resources:

  • Arm Software Ecosystem Dashboard: Explore supported software tools and frameworks.
  • Arm Developer Hub: Access tutorials, SDKs, technical resources, forums, and community discussions. 

Embrace the power, efficiency, and flexibility of Arm Neoverse and experience a new level of performance for your workloads. Visit the Microsoft Azure Portal to launch Cobalt 100 VMs for your workloads today!

Footnotes:

[1] Calculated prices are based on : https://azure.microsoft.com/en-us/pricing/calculator/ as of Aug. 6, 2025.

Anonymous
Servers and Cloud Computing blog
  • Accelerate LLM Inference with ONNX Runtime on Arm Neoverse-powered Microsoft Cobalt 100

    Na Li
    Na Li
    In this blog, we take a closer look at how Microsoft Cobalt 100 processors and Arm’s ONNX Runtime optimizations deliver significant performance gains for running LLMs.
    • October 1, 2025
  • Redefining storage with Arm Cortex-R82 and Neoverse CMN-S3

    John Xavier Lionel
    John Xavier Lionel
    Explore how Cortex-R82 and CMN-S3 enable secure, reliable, and scalable storage architectures for the future.
    • September 30, 2025
  • Advancing Chiplet Innovation for Data Centers: Novatek’s CSS N2 SoC in Arm Total Design

    Marc Meunier
    Marc Meunier
    Novatek’s CSS N2 SoC, built with Arm Total Design, drives AI, cloud, and automotive innovation with chiplet-based, scalable compute.
    • September 24, 2025