Arm optimizations in ONNX Runtime, combined with Arm Neoverse-powered Microsoft Azure Cobalt 100 processors, expands AI performance for large language model (LLM) inference, leading up to 1.9x higher performance and 2.8x better price/performance compared to AMD Genoa-based instances.
As demand for scalable, cost-efficient LLM inference continues to rise, optimizing every layer of the stack becomes essential, from cloud compute infrastructure to runtime libraries. It starts with selecting the right compute power. Microsoft Azure Cobalt 100 processors, powered by Arm Neoverse unlock new opportunities for cost-efficient, high-performance cloud computing. Built on the Arm Neoverse N2 architecture, Cobalt 100-based Microsoft Azure instances are optimized for modern scale-out workloads. In this blog, we take a closer look at how Microsoft Cobalt 100 processors and Arm’s ONNX Runtime optimizations deliver significant performance gains for running LLMs.
Arm and Microsoft Supercharges AI Performance
To empower developers building large-scale AI applications, Arm has partnered with Microsoft to optimize the ONNX Runtime generative AI (GenAI) stack for Microsoft Cobalt 100. ONNX Runtime is a high-performance engine for running machine learning (ML) models across platforms and frameworks. By integrating Arm’s KleidiAI technology directly into the Microsoft Linear Algebra Subprograms (MLAS) backend of ONNX Runtime, GenAI workloads can now take full advantage of Arm’s architectural efficiency.
These Arm-optimized enhancements accelerate critical GEMM operations, such as matrix multiplication and convolution, and support multiple precision formats including int4, int8, bf16, and fp32. This enables faster, more efficient LLM execution on CPU-only infrastructure, without requiring code changes.
Boost Inference Performance with Arm KleidiAI Libraries
As a first step in our testing, we set out to measure the impact of KleidiAI optimizations on LLM inference. To capture performance across a range of configurations, we benchmarked the following Microsoft Azure Cobalt 100 instance sizes:
For these tests, we used the Phi-4-mini-instruct-onnx model (int4 quantization) downloaded from Hugging Face. Performance was measured by token generation rate, comparing baseline results of ONNX Runtime v1.21.0, which doesn’t have Arm-specific optimizations against ONNX Runtime v1.22.0, which integrates KleidiAI-optimized MLAS backend. The input and output length ranges from 16 to 128 tokens.
Results showed KleidiAI optimizations on ONNX Runtime delivered consistent performance uplifts, from 28% to 51%, across different instance sizes.
Figure 1: Performance uplift enabled by Arm KleidiAI technology. Grey and blue bars shows raw throughput (tokens per second, left axis) and line plot indicates performance uplift from KleidiAI (right axis)
Microsoft Cobalt 100 Outperforms AMD Genoa on Performance & Efficiency
We next compared Microsoft Cobalt 100 against x86 alternatives in a real-world LLM inference scenario using the Phi-4-mini model, which features improved multilingual support, reasoning, math, and function calling. For the comparison, we chose the INT4 version of the model since it offers a scalable and efficient option for serving models on CPU-based instances.
Performance was measured across:
Cobalt 100 delivered about 1.9x faster token generation throughput compared to AMD Genoa, highlighting the performance benefits of Arm for scalable and cost-efficient LLM inference in the cloud.
Figure 2: Comparison of token generation rate between Arm-based Cobalt 100 and AMD Genoa.
When factoring performance against instance pricing[1],the Arm-powered Cobalt 100 instance delivers 2.8x higher performance-per-dollar compared to AMD Genoa, making it the clear cost-effective choice for large-scale LLM inference on CPUs.
Figure 3: Comparison of token per dollar between Arm-based Cobalt 100 and AMD Genoa.
These results demonstrate that running ONNX Runtime on Arm-based Microsoft Cobalt 100 processors form a powerful stack for GenAI workloads in production, combining performance and cost-efficiency at scale.
Get Started: Build Your AI Application on Arm
With Arm Neoverse N2 CPUs at its core, Microsoft Azure Cobalt 100-powered virtual machines (VMs) deliver leading performance and efficiency compared to AMD-based instances with the right mix of performance, cost efficiency, and scale.
Ready to Begin?
Migration from x86 to Microsoft Cobalt 100 is easy!
Helpful Resources:
Embrace the power, efficiency, and flexibility of Arm Neoverse and experience a new level of performance for your workloads. Visit the Microsoft Azure Portal to launch Cobalt 100 VMs for your workloads today!
Footnotes:
[1] Calculated prices are based on : https://azure.microsoft.com/en-us/pricing/calculator/ as of Aug. 6, 2025.