Curious how to prevent AI chatbots from occasionally providing outdated or inaccurate answers? Retrieval-Augmented Generation (RAG) offers a powerful solution for enhancing their accuracy and relevance.
This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm® NeoverseTM-based Google Axion Processors for optimized AI workloads. In our testing, Google Axion processors delivered up to 2.5x higher performance and 64% cost savings compared to x86 alternatives. They accelerate the inferencing process with better RAG performance, enabling faster knowledge lookups, lower-latency responses, and more efficient AI inference—critical for real-time, dynamic AI applications.
RAG is a popular AI framework for retrieving relevant external knowledge in real-time, enhancing the quality and relevance of the generated text for large language models (LLMs). Instead of relying solely on static, pre-trained datasets, RAG dynamically integrates up-to-date information with external resources, resulting in more precise and contextually relevant outputs. This makes RAG highly effective for real-world applications such as customer support chatbots, agentic tools, and dynamic content generation.
When Should You Choose RAG over Fine-Tuning or Re-Training?
Foundation LLMs have reshaped AI with human-like text-generation, but their effectiveness depends on having an up-to-date information that your organization requires. Re-training and fine-tuning a pre-trained LLM model are two popular options for integrating additional knowledge. Re-training an LLM is a resource-intensive and complex process. Fine-tuning, on the other hand, allows adjustments to an LLM by training it on specific datasets and tailors the model's weights to better perform on the task. However, it still requires periodic redeployment to stay current.
In general, it’s essential to evaluate LLMs capabilities and limitations when integrating them into your AI strategy.
Key considerations include:
The RAG Advantage
Instead of modifying the LLM, RAG updates only the knowledge base with external data sources, combining dynamic information retrieval with the generative capabilities of language models. If your domain knowledge frequently evolves, RAG is an ideal solution for maintaining accuracy and relevance, and for mitigating hallucinations.
RAG In Action: A Side-by-Side Comparison
Now let's look at an example comparing a chatbot powered by a general-purpose LLM (left) with one enhanced by RAG (right). On the left, the chatbot struggles to provide accurate responses to the user’s query due to outdated information or lack of domain-specific knowledge. However, the RAG-enhanced chatbot delivers a precise, and relevant response by retrieving the latest information from the uploaded document.
Image 1: An example of a chatbot enabled with a LLM (left) and a chatbot enhanced with a RAG (right).
Arm Neoverse-based Google Axion processors provide a great platform for running AI inferencing of LLMs, offering high performance and efficiency for running your RAG applications.
Optimized AI Acceleration: Arm Neoverse-based CPUs are built with high-throughput vector processing and matrix multiplication capabilities, essential for handling RAG efficiently.
Efficiency and Scalability for Cloud: Arm Neoverse-based CPUs are engineered to maximize performance per watt, providing a balance of high-speed processing and power efficiency. This makes them particularly suited for RAG applications that require both rapid inference and cost-effectiveness in the cloud. Neoverse processors are also designed to scale across AI workloads, ensuring seamless integration across various RAG use cases.
Software Ecosystem for AI Developers: For developers looking to leverage the latest AI features on Arm-based infrastructure, Arm® Kleidi technology enhances performance and improves efficiency for RAG applications. Already integrated into open-source AI and ML frameworks like PyTorch, TensorFlow, and llama.cpp, Arm Kleidi enables developers to achieve out-of-the-box inference performance by default—eliminating the need for vendor add-ons or complex optimizations.
Combination of these features translate into significant performance gains - the first Google Axion-based cloud VM, C4A, delivers substantial performance gains for CPU-based AI inferencing and general-purpose cloud workloads compared to x86 alternatives, making C4A VMs a great choice for running RAG applications on Google Cloud [Reference 1,2].
Google Axion Performance Benchmarks: Faster Processing & Higher Efficiency
Inference with a RAG system involves two key stages: information retrieval and response generation.
Generally, the retrieval speed depends on the database size and search algorithm efficiency. Optimized algorithms can return results within milliseconds when running on Arm Neoverse-based CPUs. The retrieved information is then combined with the user's input to construct a new prompt, which is sent to an LLM for inference and response generation. Compared to retrieval, response generation takes longer, and overall inference latency in RAG systems is heavily influenced by the speed of LLM inferencing.
We evaluated the RAG inferencing performance across multiple Google Cloud VMs using llama.cpp benchmark and the Llama 3.1 8B model with Q4_0 quantization scheme. We conducted all tests using 48 threads, with input token sizes of 2058 and output token sizes of 256. Below are the test configurations:
Google Axion (C4A, Neoverse V2): Evaluated on c4a-standard-48 instance.
Intel Xeon (C4, Emerald Rapids): Performance was tested on c4-standard-48.
AMD EPYC (C3D, Genoa): Tested on c3d-standard-60 with 48 cores enabled.
Inferencing performance was measured based on prompt processing speed and token generation speed. The benchmark results below reveal that Google Axion-based C4A VMs achieve up to 2.5x higher performance in both prompt processing and token generation compared to current-generation x86 instances [Figure 1].
Figure 1: Performance comparison of prompt processing (left) and token generation (right) with current generation of x86 instances when running the Llama 3.1 8B/Q4 model.
Cost Efficiency: Lowering RAG Inference Costs
To evaluate instance costs for inference tasks, we measured the latency from prompt submission to response generation. Several factors affect latency, including retrieval speed, prompt processing efficiency, token generation rate, input and output token sizes, and user batch size. Since information retrieval latency is typically in the millisecond range and negligible compared to other factors, it was excluded from our calculations. We selected a batch size of 1 to ensure a fair comparison at the single-user level. To maintain consistency with benchmarking, we set the input and output token sizes to 2048 and 256, respectively. We first calculated the latency for prompt processing and token generation using prompt encoding speed and token generation speed, calculated the cost per request using the instance pricing chart on Google Cloud [Reference 3], and then normalized the numbers to the maximal cost across all 3 instances. The results indicate that Axion-based VMs provide up to 64% cost savings, requiring only about one-third of the cost to process each request compared to current-generation x86 instances [Figure 2].
Figure 2: Normalized cost comparison of processing an inference request with RAG.1
With Arm Neoverse at its core, Google Axion-powered instances deliver high performance at a lower cost, enabling enterprises to build scalable, efficient RAG applications while minimizing infrastructure expenses compared to x86 alternatives.
To help you get started, we have developed a step-by-step demo and a Learning Path [Reference 4] that walks you through building a basic RAG system using LLM and data sources of your choice.
Ready to Begin?
If you are new to Arm ecosystem here are additional resources to help with your journey:
Try C4A on Google Cloud
We're thrilled to showcase the power of Google Axion processors at Google Cloud Next this week in Las Vegas from April 9-11. Attendees can experience firsthand the unmatched performance and efficiency that Google Axion processors bring to cloud workloads through live demonstrations, interactive breakout sessions, and expert-led discussions.
Join us at Booth #1611, meet with Arm specialists, and explore how the Arm Cloud Migration initiative can streamline your migration to Google Axion-based C4A VMs. Begin your migration journey today and unlock the full potential of your cloud and AI workloads with Arm Neoverse.
References:
Footnotes:
[1] Cost calculation is based on published instance pricing as of 2025/03/05, https://cloud.google.com/compute/vm-instance-pricing.