Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Redefining Datacenter Performance for AI: The Arm Neoverse Advantage
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • training
  • Neoverse CSS
  • Artificial Intelligence (AI)
  • AI Inference
  • Server and Infrastructure
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Redefining Datacenter Performance for AI: The Arm Neoverse Advantage

Shivangi Agrawal
Shivangi Agrawal
September 8, 2025
8 minute read time.

This blog is co-authored by Shivangi Agarwal (Product Manager, Infrastructure) and Rohit Gupta (Senior Manager Ecosystem Development, Infrastructure) at Arm.

AI is reshaping the datacenter landscape. From large language models (LLMs) such as LLaMA and GPT, to real-time inference engines, recommendation systems, and retrieval-augmented generation (RAG) pipelines, AI workloads are now the defining measure of Infrastructure performance. Traditional general-purpose processors, designed primarily for scalar workloads and batch processing, are struggling to keep pace with the data intensity, compute diversity, and mathematical complexity of AI-driven applications. The new imperative is clear: deliver uncompromising performance and scalability while maintaining sustainable efficiency.

Arm Neoverse: Purpose-Built for Modern Infrastructure

Arm Neoverse is a family of compute platforms purpose-built for datacenter and infrastructure workloads - cloud, AI/ML inference and training, 5G and Edge Networking, High-Performance Computing (HPC), and beyond. The Neoverse architecture is designed to scale, offering the flexibility to support a broad spectrum of workloads while consistently delivering best-in-class efficiency per watt. This balance makes it the foundation of choice for hyperscalers, cloud providers, and enterprises seeking to future-proof their infrastructure.

Why Neoverse V-Series Leads for AI Workloads

At the center of the AI compute revolution is the demand for CPU architectures that deliver not only raw performance but also uncompromising energy efficiency. Arm’s Neoverse CPUs and Compute Subsystems (CSS) are engineered to meet this challenge head-on, combining scalability, flexibility, and power efficiency to serve the full spectrum of modern infrastructure workloads. Designed for both AI training and inference, Neoverse platforms provide a robust foundation for hyperscalers, cloud service providers, and enterprise AI deployments. The Neoverse V-Series, in particular, is optimized for maximum single-threaded performance, making it ideal for latency-sensitive inference and compute-intensive training workloads. The following key architectural innovations establish Neoverse V series as the platform of choice for AI Workloads and make Arm Neoverse a leading choice for building next-generation AI platforms that require high throughput, predictable latency, and sustainable performance per watt.

  • Wider execution pipelines: Pipelining is a fundamental technique in CPU microarchitecture that improves performance by dividing instruction execution into sequential stages, allowing multiple instructions to be processed in parallel, but at different stages. Each stage of the pipeline is responsible for a specific function, such as fetch- retrieve the next instruction from memory, decode- interpret the instruction and determine what resources it needs, execute- perform arithmetic, logic, or branch operations, memory access- read from or write to memory (if needed), and writeback- save the result to the appropriate register. Think of a CPU pipeline like an assembly line. While one instruction is being decoded, another is being fetched, and a third is being executed- all in parallel. This boosts throughput and reduces idle CPU cycles, increases instruction level parallelism, improves core utilization, and enables better latency hiding for memory-bound code.
  • Enhanced branch predictions and speculative execution: Branch prediction is a technique used in CPU microarchitecture to anticipate the outcome of conditional instructions (e.g., if, loop, switch) before the actual condition is evaluated. Since instructions are fetched and pipelined in advance, a wrong guess (called a misprediction) can stall or flush the pipeline, hurting performance. A correct prediction, on the other hand, allows the CPU to continue fetching and executing instructions without delay, preserving throughput.
  • Out of order window operations: In modern CPU microarchitecture, the Out-of-Order (OoO) execution window refers to the number of in-flight instructions the processor can track, schedule, and execute independently of program order. Rather than waiting for one instruction to complete before starting the next (in-order execution), OoO execution allows the CPU to reorder instructions, so long as data dependencies and control flow integrity are preserved. A wider OoO window means- more instructions can be examined simultaneously, greater flexibility to schedule independent work while waiting on slower operations (e.g., memory loads, cache misses, pipeline bubbles), better utilization of execution ports (ALUs, vector units, load/store), and enhanced ability to hide latency, especially in memory-bound or branch-heavy code.
  • Vector processing: Vector processing is a method of performing operations on multiple data elements simultaneously using a single instruction, useful for applications such as AI/ML inference, image and signal processing, scientific simulations, and cryptographic operations. Particularly, INT8 Matrix Multiply enables fast matrix multiplication using 8-bit integer operands, especially beneficial in quantized neural networks and FMMLA (Fused Multiply-Accumulate Matrix Multiply for Floating Point) used for low-precision floating point (especially BF16 or FP16) matrix math. These are part of GEMM (General Matrix Multiplication)-optimized pipelines in PyTorch, TFLite, OneDNN, and Arm Compute Library.
  • Single Instruction Multiple Data (SIMD): SIMD (Single Instruction, Multiple Data) is a CPU execution model that allows a single instruction to operate on multiple pieces of data simultaneously. It’s a foundational technique for accelerating data-parallel workloads, and is widely used in modern processors including Arm Neoverse cores.
  • Scalable Vector Extension (SVE): Scalable Vector Extension (SVE) is an Arm developed SIMD extension to the AArch64 architecture. It provides a flexible, vector-length agnostic (VLA) model for high-performance vector processing, where the vector width can be scaled at runtime or hardware design time, ranging from 128 to 2048 bits in 128-bit increments.
  • Larger L2 cache support & Load/Store Bandwidth: Neoverse V1 introduced a high-performance OoO pipeline with basic L2 scaling with upto 1 MB per core. Neoverse V2 then doubled the private L2 cache and supported wider mesh connectivity. Neoverse V3 offered 3 MB/core + ECC, tuned for AI-first datacenters with enterprise-class reliability, improving data locality and reducing DRAM pressure.
  • Load/store bandwidth: It refers to the rate at which a CPU core can read (load) and write (store) data to and from memory (registers, cache, or DRAM). With every generation of Neoverse platform, the load/ store bandwidth has only gone up.

Impact on real life workloads with Arm Neoverse V-series

The generational improvements in Neoverse V series, allow higher throughput for applications such as LLMs, BERT, ResNet, XGBoost, LightGBM, Data analytics (e.g., Spark), Vector math (GEMM, Conv), Cryptographic workloads (AES GCM), and many others. There has been 30-40% uplift in AI inference performance in Neoverse V2 vs V1 and >2x IPC improvements in specific ML benchmarks. Neoverse V3’s early benchmarks reveal a double digit jump in performance from previous generation. Let’s look at some specific workloads:

LLaMA (Large language Model from Meta AI):

A LLaMA workload refers to the process of running inference on a pre-trained LLaMA model, generating responses to text prompts by predicting the next tokens in a sequence using deep learning operations.

Phase

Operations Performed

Dominant Compute Task

Arm Architecture / Microarchitecture Stressed

Data Preparation

Tokenization, formatting

Data transformation 

CPU scalar ALUs, memory controller, cache 

Model Loading

Reading weights, model setup

Memory operations 

DRAM bandwidth, memory controller, L1/L2 cache

Prefill/Encoder

Batched matrix multiplication

GEMM (matrix multiply) 

SIMD units (SVE/NEON), I8MM/SDOT instructions, L2/L3 cache, KleidiAI-optimized kernels

Decode/Generation

Iterative token generation

GEMM, sampling 

SIMD units, I8MM/SDOT, memory controller, thread/core scheduler 

Post-Processing

Detokenization

String handling 

CPU scalar units, cache, OS Syscalls 

LLMs are workloads commonly used in chatbots, document summarization, and other generative AI applications. CPUs supported specialized SIMD instructions such as I8MM (Integer 8-bit Matrix Multiply-Accumulate), SDOT, which significantly speed up quantized matrix multiplications that dominate LLaMA model inference. These instructions allow efficient low-bit integer math with high throughput and reduced power consumption compared to traditional floating-point operations.

Redis (In-memory database):

Redis is a fast, open-source, in-memory NoSQL database primarily used as a key-value store. Unlike traditional relational databases that store data on disk, Redis keeps data directly in RAM, enabling extremely low-latency and high-throughput operations. Redis workloads can be characterized as memory-bound, CPU-bound, or network-bound depending on the nature of the request patterns, dataset size, and concurrency levels. Common Redis workloads include caching, real-time analytics, session management, and message brokering.

Redis Operation Aspect

Dominant Compute Task

Arm Architecture/Microarchitecture Stressed Components

Memory-bound workloads (dataset in RAM)

Memory access, address pointer calculations

Cache hierarchy (L1/L2 cache), memory controller, prefetch units

CPU-bound workloads (complex commands)

Integer arithmetic, command parsing

Integer ALUs, pipeline, branch predictor, instruction decoder

Single-thread command execution

Instruction decoding, branch prediction

CPU core IPC, pipeline efficiency, branch misprediction penalty

Auxiliary/background tasks

Context switching, synchronization

Multi-core interconnect, cache coherence, context switching

Network-bound workloads

I/O processing, DMA data transfer

Network interfaces, DMA controllers, interrupt controllers

SpecJBB (Java Business Benchmark):

SPECjbb (Standard Performance Evaluation Corporation Java Business Benchmark) is a server-side Java benchmark designed to measure the performance of server-side Java by simulating a three-tier client-server business application. It models key workloads typical of enterprise Java applications, focusing on the middle-tier business logic rather than network or disk I/O. SPECjbb is designed to stress Java Virtual Machine (JVM) performance, especially in server environments with multiple threads and complex object manipulations.

SPECjbb Step

Dominant Compute Task

Arm Architecture/ Microarchitecture Stressed Components

Transaction Generation

Random number generation, branch prediction

Branch predictor, integer ALUs, instruction decoder

Business Logic Execution

Object creation/deletion, integer arithmetic

ALUs (integer units), load/store units, memory hierarchy (L1/L2 cache)

Data Structure Manipulation

Pointer chasing, memory access, hashing

Cache hierarchy, memory controller, TLB (translation lookaside buffer)

Synchronization and Threading

Lock management, context switching

CPU pipeline management, multi-core interconnect, cache coherence

Garbage Collection (JVM overhead)

Memory scanning, pointer updates

Memory subsystem, branch prediction, arithmetic units

The road ahead

As AI becomes the dominant driver of datacenter architecture, infrastructure must evolve beyond one size fits all design thinking. The Neoverse V series shows how workload-optimized design—wider pipelines, advanced branch prediction, deeper out-of-order execution, scalable vectors, SIMD acceleration, and expanded memory systems—translates directly into measurable gains across AI inference, enterprise software, and real time services.

For hyperscalers, cloud providers, and enterprises, the choice is no longer between peak performance or efficiency. With Arm Neoverse, both are achievable together. Generation over generation improvements demonstrate that sustainable performance per watt, coupled with workload aware microarchitecture, is the path to scaling AI responsibly.

Looking forward, Neoverse is positioned not just as a CPU family, but as the architectural foundation for the AI-first datacenter era- delivering the scalability, flexibility, and efficiency required to power the next decade of innovation.

Anonymous
Servers and Cloud Computing blog
  • Redefining Datacenter Performance for AI: The Arm Neoverse Advantage

    Shivangi Agrawal
    Shivangi Agrawal
    In this blog post, explore the features that make Neoverse V series the choice of compute platform for AI.
    • September 8, 2025
  • Migrating our GenAI pipeline to AWS Graviton powered by Arm Neoverse: A 40% cost reduction story

    Hrudu Shibu
    Hrudu Shibu
    This blog post explains how Esankethik.com, an IT and AI solutions company, successfully migrated its internal GenAI pipeline to AWS Graviton Arm64.
    • August 28, 2025
  • Using GitHub Arm-hosted runners to install Arm Performance Libraries

    Waheed Brown
    Waheed Brown
    In this blog post, learn how Windows developers can set up and use Arm-hosted Windows runners in GitHub Action.
    • August 21, 2025