We are excited to see Google Axion processors, powered by Arm Neoverse V2 CPU platform, is now generally available to the public on Google Cloud. The first Axion based cloud VMs, C4A, delivers giant leaps in performance for CPU-based AI inferencing and general-purpose cloud workloads.
Axion CPUs continue Google Cloud’s custom silicon initiatives, aiming to boost workload performance and energy efficiency. marking a significant advancement in reworking the cloud landscape for AI. Google chose Arm Neoverse for its performance, efficiency, and innovation flexibility, ensuring easy integration with existing applications thanks to a strong software ecosystem and broad industry use.
The Arm Neoverse V2 platform features new hardware extensions, such as SVE/2, BF16, and I8MM, which significantly enhance machine learning performance compared to the previous generation Neoverse N1. These extensions improve vector processing, BFloat16 operations, and integer matrix multiplication, enabling the V2-based CPUs to perform up to 4x more MAC operations per cycle than the N1.
AI is founded on open-source principles, featuring numerous leading open-source projects. At Arm, we have collaborated with our partners to enhance the performance of these projects over recent years. In many instances, we have utilized Arm Kleidi technology to improve performance on Arm Neoverse, which is accessible through the Arm Compute Library and the KleidiAI library.
The LLaMA model, developed by Meta, consists of a family of state-of-the-art large language models that are designed for various generative tasks, with sizes ranging from 1 billion to 405 billion parameters. These models are optimized for performance and can be fine-tuned for specific applications, making them versatile in natural language processing tasks.
Llama.cpp is a C++ implementation that facilitates efficient inference of these models across different hardware platforms. It supports Q4_0 quantization scheme that reduces model weights to 4-bit integers.
To demonstrate the capability of Arm-based server CPUs for LLM inferencing, Arm software teams and our partners optimized the int4 kernels in llama.cpp to leverage these newer instructions. Specifically, we added 3 new quantization formats Q4_0_4_4 for devices with only NEON support, Q4_0_4_8 for devices with SVE/2 and I8MM support and Q4_0_8_8 for devices with SVE 256-bit support.
So, it should not come as a surprise that Axion-based VMs deliver up to 2x better performance compared to current generation x86 instances for prompt processing and token generation.
We ran the LLaMA 3.1 8B model on all instances using the recommended 4-bit quantization scheme for each instance. Axion numbers were generated on a c4a-standard-48 instance with the Q4_0_4_8 quantization scheme, while Ampere Altra numbers were generated with Q4_0_4_4 on a t2a-standard-48 instance. The x86 numbers were generated on c4-standard-48 (Intel Emerald Rapids) and c3d-standard-60 (Genoa) using the Q4_0 quantization format. On all instances, the number of threads was consistently set to 48.
Running BERT on C4A VMs showcased impressive speed-ups, reducing latency and increasing throughput significantly. In this case, we ran the MLPerf BERT model with PyTorch 2.2.1 in Single Stream mode (where the batch size is 1) on various GCP instances and measured the 90th percentile latency.
Moreover, Google Axion's capabilities extend beyond LLMs to image recognition models, with ResNet-50 benefiting from the hardware's advanced features. The integration of BF16 and I8MM instructions enabled higher precision and faster training times, showcasing Axion's performance advantages over x86 based instances.
Here, we ran the MLPerf RESNET-50 PyTorch model with PyTorch 2.2.1 in Single Stream mode (where the batch size is 1) on various GCP instances.
XGBoost, which is a leading machine learning library of algorithms for regression, classification, and ranking problems, takes 24% to 48% lower time to train and predict on Axion as compared to similar x86 instances on Google Cloud.
Google Cloud C4A VMs are an excellent choice for AI inference, capable of handling a wide range of workloads from traditional machine learning tasks like XGBoost to generative AI applications such as LLaMa. This blog highlights how Axion-based VMs outperform previous-generation Neoverse N1 based VMs and other x86 alternatives on Google Cloud.
To maximize your experience with Google Axion, we have assembled a variety of resources:• Migration to Axion with Arm Learning Paths: Simplify your shift to Axion instances using detailed guides and best practices.• Arm Software Ecosystem Dashboard: Keep informed about the latest software support available for Arm.• Arm Developer Hub: Whether you're new to Arm or seeking resources to develop high-performing software solutions, the Arm Developer Hub offers everything you need to build better software and provide rich experiences across billions of devices. Engage with our growing global developer community through downloads, learning opportunities, and discussions.
Try C4A on Google Cloud