Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Demoing LLM Inference with PyTorch on Arm using Llama and AWS Graviton4
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Cloud Computing
  • Machine Learning (ML)
  • Graviton
  • Neoverse
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Demoing LLM Inference with PyTorch on Arm using Llama and AWS Graviton4

Nobel Chowdary Mandepudi
Nobel Chowdary Mandepudi
September 16, 2024
4 minute read time.

Generative AI is playing a key role in the tech sector and businesses have begun integrating Large Language Models (LLMs) into their applications on both cloud and edge. Many frameworks and libraries have evolved after the introduction of Gen AI. However, PyTorch stands out as a popular deep learning framework used by many organizations and as a library of choice for their AI applications. Through the deployment of Arm Kleidi technology, Arm is making efforts to optimize PyTorch to accelerate the performance of running LLMs on Arm-based processors. Arm is simplifying how developers access Kleidi technology by integrating it directly into PyTorch. 

In this blog post, we use a demo application to show the performance uplift of KleidiAI running an LLM on PyTorch. The demo application runs Llama 3.1 on an Arm Neoverse V2-based AWS Graviton4 R8g.4xlarge EC2 instance. Readers can recreate this demo themselves, using this Learning Path.

 Figure 1: Demo Dashboard

Demo Application

Our demo application is an LLM based chatbot which can answer wide variety of questions from the user. The demo runs the Meta Llama3.1 model using PyTorch framework on Arm. It is exposed as a browser application using a Streamlit frontend. Streamlit feeds into the Torchat framwork, which runs PyTorch and serves as the LLM backend. Output from Torchat feeds into the attention layer and generates tokens.  These tokens are streamed using the OpenAI framework streaming function to the frontend and are displayed on browser application to the user. The architecture of the demo is shown in Figure 2 below.

Figure 2: Demo Architecture

The demo application measures and displays the following performance metrics after the LLM inference.

Time to Generate first token (sec): For LLM inferencing, its important to generate first token quickly, to minimize latency and provide prompt output to users.

Decode Speed or Text Generation (tokens or sec): Tokens per second is the rate at which tokens are generated by the GenAI model. Time to generate next token can be a maximum value up to 100 milliseconds, which is the industry standard for an interactive chatbot. It signifies that the decode speed must be at least 10 tokens/sec. This is important for a better user experience on-real time applications. 

Cost to generate million tokens ($): Using the decode speed and the hourly cost of an EC2 instance on the AWS cloud, we can calculate the cost for generating 1 million tokens, which is a popular comparison metric. Since the hourly cost is fixed, a higher decode speed will make it less expensive to generate one million tokens.

Total time to generate the prompt (sec): This is the total time taken to generate the prompt with all the tokens. 

Total cost to generate the prompt ($): This is calculated based on the total time to generate the complete prompt with all the tokens, the decode speed, and the machine cost in the cloud. 

Figure 3 displays the sample response and can be considered as an example to validate the chatbot with the metrics shown. The time to generate first token is less than 1 sec and decode rate is 33 tokens/sec which are both highly satisfactory and meet the industry standards for interactive chatbots.

 Figure 3: Demo with Sample Response and Metrics

KleidiAI Optimizations for PyTorch

KleidiAI libraries provide several optimizations for Arm. For loading the model, Kleidi provides a new operator in Torch Aten layer. This layer packs model weights in memory in a format that KleidiAI GEMM kernels can use to improve performance. Similarly, optimizations for model execution use another operator in the Aten layer. This operator does quantization of matmul operations on the previously packed model weights.

For our demo, the model is downloaded from Meta Hugging Face repository. This model is packed in memory using INT4 kernel layout and is then quantized using optimized INT4 KleidiAI kernels for PyTorch.

The architecture of this demo is shown in Figure 4 below. For more details about the KleidiAI optimizations used in this demo, please see our webinar.

Figure 4: KleidiAI Optimizations for PyTorch Implementation

These KleidiAI optimizations can be applied to the PyTorch, Torchchat and Torchao using the patches1 included in our Learning Path. You can use these patches to see LLM Inferencing performance gains with PyTorch on Arm for your workloads. To replicate and test this demo on your own Arm machine, you can follow this Learning Path.

Performance 

To demonstrate the performance benefits of KleidiAI, we run the same chatbot application using PyTorch – first without and then with KleidiAI optimizations. The “before” and “after” KleidiAI optimization rates for Tokens/Sec and Time-to-First-Token are shown in the below graphs. 

Figure 5: Performance Comparison

As you can see, leveraging KleidiAI libraries into existing GenAI technology stacks can have a huge performance uplift for token generation rate and for time to generate first token for different GenAI models.

Conclusion 

Running LLM inferencing on CPUs is practical and effective for real-time workloads such as chatbots.  We demonstrated this using Llama.cpp in a previous blog. In this blog post, we showed how good LLM inference performance can be achieved using KleidiAI libraries for PyTorch on Arm. As we showed using an AWS Graviton4-based R8g instance with Neoverse V2 cores, KleidiAI delivers a massive performance improvement for running LLM Inference with PyTorch on Arm. Developers can take advantage of Arm’s KleidiAI optimizations for PyTorch today, for new or existing AI applications.

  1. The Arm KleidiAI PyTorch patches are in the process of being merged with the upstream PyTorch and will be available in an official future PyTorch release
Anonymous
Servers and Cloud Computing blog
  • Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

    Chris Goodyer
    Chris Goodyer
    In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
    • June 17, 2025
  • Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

    Na Li
    Na Li
    This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
    • April 7, 2025
  • Arm CMN S3: Driving CXL storage innovation

    John Xavier Lionel
    John Xavier Lionel
    CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
    • February 24, 2025