Google’s Axion powered by Arm Neoverse: Faster inference and higher performance for AI workloads

October 30, 2024

4 minute read time.

We are excited to see Google Axion processors, powered by Arm Neoverse V2 CPU platform, is now generally available to the public on Google Cloud. The first Axion based cloud VMs, C4A, delivers giant leaps in performance for CPU-based AI inferencing and general-purpose cloud workloads.

Axion CPUs continue Google Cloud’s custom silicon initiatives, aiming to boost workload performance and energy efficiency. marking a significant advancement in reworking the cloud landscape for AI. Google chose Arm Neoverse for its performance, efficiency, and innovation flexibility, ensuring easy integration with existing applications thanks to a strong software ecosystem and broad industry use.

The Arm Neoverse V2 platform features new hardware extensions, such as SVE/2, BF16, and I8MM, which significantly enhance machine learning performance compared to the previous generation Neoverse N1. These extensions improve vector processing, BFloat16 operations, and integer matrix multiplication, enabling the V2-based CPUs to perform up to 4x more MAC operations per cycle than the N1.

From GenAI to computer vision: Faster inference and higher performance for AI workloads

AI is founded on open-source principles, featuring numerous leading open-source projects. At Arm, we have collaborated with our partners to enhance the performance of these projects over recent years. In many instances, we have utilized Arm Kleidi technology to improve performance on Arm Neoverse, which is accessible through the Arm Compute Library and the KleidiAI library.

LLM (LLaMa 3.1 8B 4-bit quantized model inference with Llama.cpp)

The LLaMA model, developed by Meta, consists of a family of state-of-the-art large language models that are designed for various generative tasks, with sizes ranging from 1 billion to 405 billion parameters. These models are optimized for performance and can be fine-tuned for specific applications, making them versatile in natural language processing tasks.

Llama.cpp is a C++ implementation that facilitates efficient inference of these models across different hardware platforms. It supports Q4_0 quantization scheme that reduces model weights to 4-bit integers.

To demonstrate the capability of Arm-based server CPUs for LLM inferencing, Arm software teams and our partners optimized the int4 kernels in llama.cpp to leverage these newer instructions. Specifically, we added 3 new quantization formats Q4_0_4_4 for devices with only NEON support, Q4_0_4_8 for devices with SVE/2 and I8MM support and Q4_0_8_8 for devices with SVE 256-bit support.

So, it should not come as a surprise that Axion-based VMs deliver up to 2x better performance compared to current generation x86 instances for prompt processing and token generation.

Llama CPP Benchmarks on Google Axion for AI Inference Blog

We ran the LLaMA 3.1 8B model on all instances using the recommended 4-bit quantization scheme for each instance. Axion numbers were generated on a c4a-standard-48 instance with the Q4_0_4_8 quantization scheme, while Ampere Altra numbers were generated with Q4_0_4_4 on a t2a-standard-48 instance. The x86 numbers were generated on c4-standard-48 (Intel Emerald Rapids) and c3d-standard-60 (Genoa) using the Q4_0 quantization format. On all instances, the number of threads was consistently set to 48.

BERT

Running BERT on C4A VMs showcased impressive speed-ups, reducing latency and increasing throughput significantly. In this case, we ran the MLPerf BERT model with PyTorch 2.2.1 in Single Stream mode (where the batch size is 1) on various GCP instances and measured the 90th percentile latency.

BERT Benchmark on Google Axion for AI Inference Blog

RESNET-50

Moreover, Google Axion's capabilities extend beyond LLMs to image recognition models, with ResNet-50 benefiting from the hardware's advanced features. The integration of BF16 and I8MM instructions enabled higher precision and faster training times, showcasing Axion's performance advantages over x86 based instances.

RESNET Benchmark on Google Axion for AI Inference Blog

Here, we ran the MLPerf RESNET-50 PyTorch model with PyTorch 2.2.1 in Single Stream mode (where the batch size is 1) on various GCP instances.

XGBoost

XGBoost, which is a leading machine learning library of algorithms for regression, classification, and ranking problems, takes 24% to 48% lower time to train and predict on Axion as compared to similar x86 instances on Google Cloud.

XGBoost Training Benchmark on Google Axion for AI Inference Blog

Conclusion

Google Cloud C4A VMs are an excellent choice for AI inference, capable of handling a wide range of workloads from traditional machine learning tasks like XGBoost to generative AI applications such as LLaMa. This blog highlights how Axion-based VMs outperform previous-generation Neoverse N1 based VMs and other x86 alternatives on Google Cloud.

Arm Resources: Supporting your Cloud transition

To maximize your experience with Google Axion, we have assembled a variety of resources:
• Migration to Axion with Arm Learning Paths: Simplify your shift to Axion instances using detailed guides and best practices.
• Arm Software Ecosystem Dashboard: Keep informed about the latest software support available for Arm.
• Arm Developer Hub: Whether you're new to Arm or seeking resources to develop high-performing software solutions, the Arm Developer Hub offers everything you need to build better software and provide rich experiences across billions of devices. Engage with our growing global developer community through downloads, learning opportunities, and discussions.

Try C4A on Google Cloud

0 comments
0 members are here

Servers and Cloud Computing blog

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025
Arm CMN S3: Driving CXL storage innovation

John Xavier Lionel

CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
- February 24, 2025
Streamline Arm adoption with GitHub Copilot and Arm64 Runners

Michael Gamble

The Arm for GitHub Copilot extension is here to change the way developers approach architecture migration.
- February 19, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog