Demoing LLM Inference with PyTorch on Arm using Llama and AWS Graviton4

September 16, 2024

4 minute read time.

Generative AI is playing a key role in the tech sector and businesses have begun integrating Large Language Models (LLMs) into their applications on both cloud and edge. Many frameworks and libraries have evolved after the introduction of Gen AI. However, PyTorch stands out as a popular deep learning framework used by many organizations and as a library of choice for their AI applications. Through the deployment of Arm Kleidi technology, Arm is making efforts to optimize PyTorch to accelerate the performance of running LLMs on Arm-based processors. Arm is simplifying how developers access Kleidi technology by integrating it directly into PyTorch.

In this blog post, we use a demo application to show the performance uplift of KleidiAI running an LLM on PyTorch. The demo application runs Llama 3.1 on an Arm Neoverse V2-based AWS Graviton4 R8g.4xlarge EC2 instance. Readers can recreate this demo themselves, using this Learning Path.

Figure 1: Demo Dashboard

Demo Application

Our demo application is an LLM based chatbot which can answer wide variety of questions from the user. The demo runs the Meta Llama3.1 model using PyTorch framework on Arm. It is exposed as a browser application using a Streamlit frontend. Streamlit feeds into the Torchat framwork, which runs PyTorch and serves as the LLM backend. Output from Torchat feeds into the attention layer and generates tokens. These tokens are streamed using the OpenAI framework streaming function to the frontend and are displayed on browser application to the user. The architecture of the demo is shown in Figure 2 below.

Figure 2: Demo Architecture

The demo application measures and displays the following performance metrics after the LLM inference.

Time to Generate first token (sec): For LLM inferencing, its important to generate first token quickly, to minimize latency and provide prompt output to users.

Decode Speed or Text Generation (tokens or sec): Tokens per second is the rate at which tokens are generated by the GenAI model. Time to generate next token can be a maximum value up to 100 milliseconds, which is the industry standard for an interactive chatbot. It signifies that the decode speed must be at least 10 tokens/sec. This is important for a better user experience on-real time applications.

Cost to generate million tokens ($): Using the decode speed and the hourly cost of an EC2 instance on the AWS cloud, we can calculate the cost for generating 1 million tokens, which is a popular comparison metric. Since the hourly cost is fixed, a higher decode speed will make it less expensive to generate one million tokens.

Total time to generate the prompt (sec): This is the total time taken to generate the prompt with all the tokens.

Total cost to generate the prompt ($): This is calculated based on the total time to generate the complete prompt with all the tokens, the decode speed, and the machine cost in the cloud.

Figure 3 displays the sample response and can be considered as an example to validate the chatbot with the metrics shown. The time to generate first token is less than 1 sec and decode rate is 33 tokens/sec which are both highly satisfactory and meet the industry standards for interactive chatbots.

Figure 3: Demo with Sample Response and Metrics

KleidiAI Optimizations for PyTorch

KleidiAI libraries provide several optimizations for Arm. For loading the model, Kleidi provides a new operator in Torch Aten layer. This layer packs model weights in memory in a format that KleidiAI GEMM kernels can use to improve performance. Similarly, optimizations for model execution use another operator in the Aten layer. This operator does quantization of matmul operations on the previously packed model weights.

For our demo, the model is downloaded from Meta Hugging Face repository. This model is packed in memory using INT4 kernel layout and is then quantized using optimized INT4 KleidiAI kernels for PyTorch.

The architecture of this demo is shown in Figure 4 below. For more details about the KleidiAI optimizations used in this demo, please see our webinar.

Figure 4: KleidiAI Optimizations for PyTorch Implementation

These KleidiAI optimizations can be applied to the PyTorch, Torchchat and Torchao using the patches¹ included in our Learning Path. You can use these patches to see LLM Inferencing performance gains with PyTorch on Arm for your workloads. To replicate and test this demo on your own Arm machine, you can follow this Learning Path.

Performance

To demonstrate the performance benefits of KleidiAI, we run the same chatbot application using PyTorch – first without and then with KleidiAI optimizations. The “before” and “after” KleidiAI optimization rates for Tokens/Sec and Time-to-First-Token are shown in the below graphs.

Figure 5: Performance Comparison

As you can see, leveraging KleidiAI libraries into existing GenAI technology stacks can have a huge performance uplift for token generation rate and for time to generate first token for different GenAI models.

Conclusion

Running LLM inferencing on CPUs is practical and effective for real-time workloads such as chatbots. We demonstrated this using Llama.cpp in a previous blog. In this blog post, we showed how good LLM inference performance can be achieved using KleidiAI libraries for PyTorch on Arm. As we showed using an AWS Graviton4-based R8g instance with Neoverse V2 cores, KleidiAI delivers a massive performance improvement for running LLM Inference with PyTorch on Arm. Developers can take advantage of Arm’s KleidiAI optimizations for PyTorch today, for new or existing AI applications.

The Arm KleidiAI PyTorch patches are in the process of being merged with the upstream PyTorch and will be available in an official future PyTorch release

0 comments
0 members are here

Servers and Cloud Computing blog

How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

Peter Ma

Migrating to Arm-based AWS Graviton3 improved SiteMana’s scalability, latency, and costs while enabling real-time ML inference at scale.
- July 4, 2025
Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Chris Goodyer

In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
- June 17, 2025
Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Demoing LLM Inference with PyTorch on Arm using Llama and AWS Graviton4

Demo Application

KleidiAI Optimizations for PyTorch

Performance

Conclusion

How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors