This blog is co-authored by Shivangi Agarwal (Product Manager, Infrastructure) and Rohit Gupta (Senior Manager Ecosystem Development, Infrastructure) at Arm.
AI is reshaping the datacenter landscape. From large language models (LLMs) such as LLaMA and GPT, to real-time inference engines, recommendation systems, and retrieval-augmented generation (RAG) pipelines, AI workloads are now the defining measure of Infrastructure performance. Traditional general-purpose processors, designed primarily for scalar workloads and batch processing, are struggling to keep pace with the data intensity, compute diversity, and mathematical complexity of AI-driven applications. The new imperative is clear: deliver uncompromising performance and scalability while maintaining sustainable efficiency.
Arm Neoverse is a family of compute platforms purpose-built for datacenter and infrastructure workloads - cloud, AI/ML inference and training, 5G and Edge Networking, High-Performance Computing (HPC), and beyond. The Neoverse architecture is designed to scale, offering the flexibility to support a broad spectrum of workloads while consistently delivering best-in-class efficiency per watt. This balance makes it the foundation of choice for hyperscalers, cloud providers, and enterprises seeking to future-proof their infrastructure.
At the center of the AI compute revolution is the demand for CPU architectures that deliver not only raw performance but also uncompromising energy efficiency. Arm’s Neoverse CPUs and Compute Subsystems (CSS) are engineered to meet this challenge head-on, combining scalability, flexibility, and power efficiency to serve the full spectrum of modern infrastructure workloads. Designed for both AI training and inference, Neoverse platforms provide a robust foundation for hyperscalers, cloud service providers, and enterprise AI deployments. The Neoverse V-Series, in particular, is optimized for maximum single-threaded performance, making it ideal for latency-sensitive inference and compute-intensive training workloads. The following key architectural innovations establish Neoverse V series as the platform of choice for AI Workloads and make Arm Neoverse a leading choice for building next-generation AI platforms that require high throughput, predictable latency, and sustainable performance per watt.
The generational improvements in Neoverse V series, allow higher throughput for applications such as LLMs, BERT, ResNet, XGBoost, LightGBM, Data analytics (e.g., Spark), Vector math (GEMM, Conv), Cryptographic workloads (AES GCM), and many others. There has been 30-40% uplift in AI inference performance in Neoverse V2 vs V1 and >2x IPC improvements in specific ML benchmarks. Neoverse V3’s early benchmarks reveal a double digit jump in performance from previous generation. Let’s look at some specific workloads:
A LLaMA workload refers to the process of running inference on a pre-trained LLaMA model, generating responses to text prompts by predicting the next tokens in a sequence using deep learning operations.
Phase
Operations Performed
Dominant Compute Task
Arm Architecture / Microarchitecture Stressed
Data Preparation
Tokenization, formatting
Data transformation
CPU scalar ALUs, memory controller, cache
Model Loading
Reading weights, model setup
Memory operations
DRAM bandwidth, memory controller, L1/L2 cache
Prefill/Encoder
Batched matrix multiplication
GEMM (matrix multiply)
SIMD units (SVE/NEON), I8MM/SDOT instructions, L2/L3 cache, KleidiAI-optimized kernels
Decode/Generation
Iterative token generation
GEMM, sampling
SIMD units, I8MM/SDOT, memory controller, thread/core scheduler
Post-Processing
Detokenization
String handling
CPU scalar units, cache, OS Syscalls
LLMs are workloads commonly used in chatbots, document summarization, and other generative AI applications. CPUs supported specialized SIMD instructions such as I8MM (Integer 8-bit Matrix Multiply-Accumulate), SDOT, which significantly speed up quantized matrix multiplications that dominate LLaMA model inference. These instructions allow efficient low-bit integer math with high throughput and reduced power consumption compared to traditional floating-point operations.
Redis is a fast, open-source, in-memory NoSQL database primarily used as a key-value store. Unlike traditional relational databases that store data on disk, Redis keeps data directly in RAM, enabling extremely low-latency and high-throughput operations. Redis workloads can be characterized as memory-bound, CPU-bound, or network-bound depending on the nature of the request patterns, dataset size, and concurrency levels. Common Redis workloads include caching, real-time analytics, session management, and message brokering.
Redis Operation Aspect
Arm Architecture/Microarchitecture Stressed Components
Memory-bound workloads (dataset in RAM)
Memory access, address pointer calculations
Cache hierarchy (L1/L2 cache), memory controller, prefetch units
CPU-bound workloads (complex commands)
Integer arithmetic, command parsing
Integer ALUs, pipeline, branch predictor, instruction decoder
Single-thread command execution
Instruction decoding, branch prediction
CPU core IPC, pipeline efficiency, branch misprediction penalty
Auxiliary/background tasks
Context switching, synchronization
Multi-core interconnect, cache coherence, context switching
Network-bound workloads
I/O processing, DMA data transfer
Network interfaces, DMA controllers, interrupt controllers
SPECjbb (Standard Performance Evaluation Corporation Java Business Benchmark) is a server-side Java benchmark designed to measure the performance of server-side Java by simulating a three-tier client-server business application. It models key workloads typical of enterprise Java applications, focusing on the middle-tier business logic rather than network or disk I/O. SPECjbb is designed to stress Java Virtual Machine (JVM) performance, especially in server environments with multiple threads and complex object manipulations.
SPECjbb Step
Arm Architecture/ Microarchitecture Stressed Components
Transaction Generation
Random number generation, branch prediction
Branch predictor, integer ALUs, instruction decoder
Business Logic Execution
Object creation/deletion, integer arithmetic
ALUs (integer units), load/store units, memory hierarchy (L1/L2 cache)
Data Structure Manipulation
Pointer chasing, memory access, hashing
Cache hierarchy, memory controller, TLB (translation lookaside buffer)
Synchronization and Threading
Lock management, context switching
CPU pipeline management, multi-core interconnect, cache coherence
Garbage Collection (JVM overhead)
Memory scanning, pointer updates
Memory subsystem, branch prediction, arithmetic units
As AI becomes the dominant driver of datacenter architecture, infrastructure must evolve beyond one size fits all design thinking. The Neoverse V series shows how workload-optimized design—wider pipelines, advanced branch prediction, deeper out-of-order execution, scalable vectors, SIMD acceleration, and expanded memory systems—translates directly into measurable gains across AI inference, enterprise software, and real time services.
For hyperscalers, cloud providers, and enterprises, the choice is no longer between peak performance or efficiency. With Arm Neoverse, both are achievable together. Generation over generation improvements demonstrate that sustainable performance per watt, coupled with workload aware microarchitecture, is the path to scaling AI responsibly.
Looking forward, Neoverse is positioned not just as a CPU family, but as the architectural foundation for the AI-first datacenter era- delivering the scalability, flexibility, and efficiency required to power the next decade of innovation.