Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Cloud Computing
  • Google Axion
  • Artificial Intelligence (AI)
  • neoverse V2
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li
Na Li
April 7, 2025
8 minute read time.
This blog post was co-authored by Na Li and Koray Ozkal.

Curious how to prevent AI chatbots from occasionally providing outdated or inaccurate answers? Retrieval-Augmented Generation (RAG) offers a powerful solution for enhancing their accuracy and relevance. 

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm® NeoverseTM-based Google Axion Processors for optimized AI workloads. In our testing, Google Axion processors delivered up to 2.5x higher performance and 64% cost savings compared to x86 alternatives. They accelerate the inferencing process with better RAG performance, enabling faster knowledge lookups, lower-latency responses, and more efficient AI inference—critical for real-time, dynamic AI applications.

Understanding RAG: An Efficient Approach to AI Text Generation

RAG is a popular AI framework for retrieving relevant external knowledge in real-time, enhancing the quality and relevance of the generated text for large language models (LLMs). Instead of relying solely on static, pre-trained datasets, RAG dynamically integrates up-to-date information with external resources, resulting in more precise and contextually relevant outputs. This makes RAG highly effective for real-world applications such as customer support chatbots, agentic tools, and dynamic content generation.

When Should You Choose RAG over Fine-Tuning or Re-Training?

Foundation LLMs have reshaped AI with human-like text-generation, but their effectiveness depends on having an up-to-date information that your organization requires. Re-training and fine-tuning a pre-trained LLM model are two popular options for integrating additional knowledge. Re-training an LLM is a resource-intensive and complex process. Fine-tuning, on the other hand, allows adjustments to an LLM by training it on specific datasets and tailors the model's weights to better perform on the task. However, it still requires periodic redeployment to stay current.

In general, it’s essential to evaluate LLMs capabilities and limitations when integrating them into your AI strategy.

Key considerations include: 

  • Training dataset limitations: LLMs may struggle to provide accurate or current information on topics not included in their training dataset.
  • High resource requirements: Re-training these large models requires extensive compute power and engineering resources, making frequent updates impractical.
  • Restricted access to internal knowledge: Since business-critical data is protected behind firewalls, LLMs cannot incorporate proprietary information through periodic retraining, potentially limiting their relevance for internal use.

The RAG Advantage

Instead of modifying the LLM, RAG updates only the knowledge base with external data sources, combining dynamic information retrieval with the generative capabilities of language models. If your domain knowledge frequently evolves, RAG is an ideal solution for maintaining accuracy and relevance, and for mitigating hallucinations.

RAG In Action: A Side-by-Side Comparison

Now let's look at an example comparing a chatbot powered by a general-purpose LLM (left) with one enhanced by RAG (right). On the left, the chatbot struggles to provide accurate responses to the user’s query due to outdated information or lack of domain-specific knowledge. However, the RAG-enhanced chatbot delivers a precise, and relevant response by retrieving the latest information from the uploaded document.

Image 1: An example of a chatbot enabled with a LLM (left) and a chatbot enhanced with a RAG (right).

Why Choose Google Axion for RAG?

Arm Neoverse-based Google Axion processors provide a great platform for running AI inferencing of LLMs, offering high performance and efficiency for running your RAG applications.

Optimized AI Acceleration: Arm Neoverse-based CPUs are built with high-throughput vector processing and matrix multiplication capabilities, essential for handling RAG efficiently.

Efficiency and Scalability for Cloud: Arm Neoverse-based CPUs are engineered to maximize performance per watt, providing a balance of high-speed processing and power efficiency. This makes them particularly suited for RAG applications that require both rapid inference and cost-effectiveness in the cloud. Neoverse processors are also designed to scale across AI workloads, ensuring seamless integration across various RAG use cases.

Software Ecosystem for AI Developers: For developers looking to leverage the latest AI features on Arm-based infrastructure, Arm® Kleidi technology enhances performance and improves efficiency for RAG applications. Already integrated into open-source AI and ML frameworks like PyTorch, TensorFlow, and llama.cpp, Arm Kleidi enables developers to achieve out-of-the-box inference performance by default—eliminating the need for vendor add-ons or complex optimizations.

Combination of these features translate into significant performance gains - the first Google Axion-based cloud VM, C4A, delivers substantial performance gains for CPU-based AI inferencing and general-purpose cloud workloads compared to x86 alternatives, making C4A VMs a great choice for running RAG applications on Google Cloud [Reference 1,2].

Google Axion Performance Benchmarks: Faster Processing & Higher Efficiency

Inference with a RAG system involves two key stages: information retrieval and response generation.

  1. Information retrieval: The system searches a vector database to find relevant content based on a user’s query.
  2. Response generation: The retrieved content is combined with a user query to generate a precise, contextually relevant response.

Generally, the retrieval speed depends on the database size and search algorithm efficiency. Optimized algorithms can return results within milliseconds when running on Arm Neoverse-based CPUs. The retrieved information is then combined with the user's input to construct a new prompt, which is sent to an LLM for inference and response generation. Compared to retrieval, response generation takes longer, and overall inference latency in RAG systems is heavily influenced by the speed of LLM inferencing.

We evaluated the RAG inferencing performance across multiple Google Cloud VMs using llama.cpp benchmark and the Llama 3.1 8B model with Q4_0 quantization scheme. We conducted all tests using 48 threads, with input token sizes of 2058 and output token sizes of 256. Below are the test configurations:

Google Axion (C4A, Neoverse V2): Evaluated on c4a-standard-48 instance.

Intel Xeon (C4, Emerald Rapids): Performance was tested on c4-standard-48.

AMD EPYC (C3D, Genoa): Tested on c3d-standard-60 with 48 cores enabled.

Faster Processing & Higher Efficiency with Google Axion Processors: Up to 2.5x Higher Performance and 64% Cost Savings Compared to x86 Alternatives

Inferencing performance was measured based on prompt processing speed and token generation speed. The benchmark results below reveal that Google Axion-based C4A VMs achieve up to 2.5x higher performance in both prompt processing and token generation compared to current-generation x86 instances [Figure 1].

Google Axion Rag Benchmarks

Figure 1: Performance comparison of prompt processing (left) and token generation (right) with current generation of x86 instances when running the Llama 3.1 8B/Q4 model.

Cost Efficiency: Lowering RAG Inference Costs

To evaluate instance costs for inference tasks, we measured the latency from prompt submission to response generation. Several factors affect latency, including retrieval speed, prompt processing efficiency, token generation rate, input and output token sizes, and user batch size. Since information retrieval latency is typically in the millisecond range and negligible compared to other factors, it was excluded from our calculations. We selected a batch size of 1 to ensure a fair comparison at the single-user level. To maintain consistency with benchmarking, we set the input and output token sizes to 2048 and 256, respectively. We first calculated the latency for prompt processing and token generation using prompt encoding speed and token generation speed, calculated the cost per request using the instance pricing chart on Google Cloud [Reference 3], and then normalized the numbers to the maximal cost across all 3 instances. The results indicate that Axion-based VMs provide up to 64% cost savings, requiring only about one-third of the cost to process each request compared to current-generation x86 instances [Figure 2].

Normalized Cost Comparison

Figure 2: Normalized cost comparison of processing an inference request with RAG.1

Get Started: Build Your RAG Application on Arm

With Arm Neoverse at its core, Google Axion-powered instances deliver high performance at a lower cost, enabling enterprises to build scalable, efficient RAG applications while minimizing infrastructure expenses compared to x86 alternatives.

To help you get started, we have developed a step-by-step demo and a Learning Path [Reference 4] that walks you through building a basic RAG system using LLM and data sources of your choice.

Ready to Begin?

  • Try our demo and follow our Learning Path to experience RAG on Arm-powered Google Axion processors

If you are new to Arm ecosystem here are additional resources to help with your journey:

  • Migration to Axion with Arm Learning Paths: Simplify your shift to Axion instances using detailed guides and best practices.
  • Arm Software Ecosystem Dashboard: Stay informed about the latest software supported on Arm.
  • Arm Developer Hub: Whether you're new to Arm or seeking resources to develop high-performing software solutions, the Arm Developer Hub offers everything you need to build better software and deliver rich experiences across billions of devices. Engage with our growing global developer community through resources, learning opportunities, and discussions.

                                                                                  Try C4A on Google Cloud

Join Us at Google Cloud Next 2025 to Experience the Power of Arm Neoverse-based Google Axion Processors

We're thrilled to showcase the power of Google Axion processors at Google Cloud Next this week in Las Vegas from April 9-11. Attendees can experience firsthand the unmatched performance and efficiency that  Google Axion processors bring to cloud workloads through live demonstrations, interactive breakout sessions, and expert-led discussions.

Join us at Booth #1611, meet with Arm specialists, and explore how the Arm Cloud Migration initiative can streamline your migration to Google Axion-based C4A VMs. Begin your migration journey today and unlock the full potential of your cloud and AI workloads with Arm Neoverse.

References:

  1. Google Axion processors are now generally available on Google Cloud. See https://cloud.google.com/blog/products/compute/try-c4a-the-first-google-axion-processor for more details.
  2. https://community.arm.com/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/ai-inference-on-google-axion-cpu
  3. https://cloud.google.com/compute/vm-instance-pricing, as of 03/05/2025.
  4. https://learn.arm.com/learning-paths/servers-and-cloud-computing/rag/

Footnotes:

[1] Cost calculation is based on published instance pricing as of 2025/03/05, https://cloud.google.com/compute/vm-instance-pricing.

Anonymous
Servers and Cloud Computing blog
  • Advancing Chiplet Innovation for Data Centers: Novatek’s CSS N2 SoC in Arm Total Design

    Marc Meunier
    Marc Meunier
    Novatek’s CSS N2 SoC, built with Arm Total Design, drives AI, cloud, and automotive innovation with chiplet-based, scalable compute.
    • September 24, 2025
  • How we cut LLM inference costs by 35% migrating to Arm-Based AWS Graviton

    Cornelius Maroa
    Cornelius Maroa
    The monthly wake-up call. Learn how Arm-based Graviton3 reduced costs 40%, cut power use 23%, and unlocked faster, greener AI at scale.
    • September 24, 2025
  • Hands-on with MPAM: Deploying and verifying on Ubuntu

    Howard Zhang
    Howard Zhang
    In this blog post, Howard Zhang walks through how to configure and verify MPAM on Ubuntu Linux.
    • September 24, 2025