Introduction
Generative AI is playing a key role in the tech sector and businesses have begun integrating Large Language Models (LLMs) into their applications on both cloud and edge. Many frameworks and libraries have evolved after the introduction of Gen AI. However, PyTorch stands out as a popular deep learning framework used by many organizations and as a library of choice for their AI applications. Through the deployment of Arm Kleidi technology, Arm is making efforts to optimize PyTorch to accelerate the performance of running LLMs on Arm-based processors. Arm is simplifying how developers access Kleidi technology by integrating it directly into PyTorch.
In this blog, we use a demo application to show the performance uplift of KleidiAI running an LLM on PyTorch. The demo application runs Llama 3.1 on an Arm Neoverse V2-based AWS Graviton4 R8g.4xlarge EC2 instance. Readers can recreate this demo themselves, using this Learning Path.
Figure 1: Demo Dashboard
Demo Application
Our demo application is an LLM based chatbot which can answer wide variety of questions from the user. The demo runs the Meta Llama3.1 model using PyTorch framework on Arm. It is exposed as a browser application using a Streamlit frontend. Streamlit feeds into the Torchat framwork, which runs PyTorch and serves as the LLM backend. Output from Torchat feeds into the attention layer and generates tokens. These tokens are streamed using the OpenAI framework streaming function to the frontend and are displayed on browser application to the user. The architecture of the demo is shown in Figure 2 below.
Figure 2: Demo Architecture
The demo application measures and displays the following performance metrics after the LLM inference.
Time to Generate first token (sec): For LLM inferencing, its important to generate first token quickly, to minimize latency and provide prompt output to users.
Decode Speed/Text Generation (tokens/sec): Tokens per second is the rate at which tokens are generated by the GenAI model. Time to generate next token can be a maximum value up to 100 milliseconds, which is the industry standard for an interactive chatbot. It signifies that the decode speed must be at least 10 tokens/sec. This is important for a better user experience on-real time applications.
Cost to generate million tokens ($): Using the decode speed and the hourly cost of an EC2 instance on the AWS cloud, we can calculate the cost for generating 1 million tokens, which is a popular comparison metric. Since the hourly cost is fixed, a higher decode speed will make it less expensive to generate one million tokens.
Total time to generate the prompt (sec): This is the total time taken to generate the prompt with all the tokens.
Total cost to generate the prompt ($): This is calculated based on the total time to generate the complete prompt with all the tokens, the decode speed, and the machine cost in the cloud.
Figure 3 displays the sample response and can be considered as an example to validate the chatbot with the metrics shown. The time to generate first token is less than 1 sec and decode rate is 33 tokens/sec which are both highly satisfactory and meet the industry standards for interactive chatbots.
Figure 3: Demo with Sample Response and Metrics
KleidiAI Optimizations for PyTorch
KleidiAI libraries provide several optimizations for Arm. For loading the model, Kleidi provides a new operator in Torch Aten layer. This layer packs model weights in memory in a format that KleidiAI GEMM kernels can use to improve performance. Similarly, optimizations for model execution use another operator in the Aten layer. This operator does quantization of matmul operations on the previously packed model weights.
For our demo, the model is downloaded from Meta Hugging Face repository. This model is packed in memory using INT4 kernel layout and is then quantized using optimized INT4 KleidiAI kernels for PyTorch.
The architecture of this demo is shown in Figure 4 below. For more details about the KleidiAI optimizations used in this demo, please see our webinar.
Figure 4: KleidiAI Optimizations for PyTorch Implementation
These KleidiAI optimizations can be applied to the PyTorch, Torchchat and Torchao using the patches1 included in our Learning Path. You can use these patches to see LLM Inferencing performance gains with PyTorch on Arm for your workloads. To replicate and test this demo on your own Arm machine, you can follow this Learning Path.
Performance
To demonstrate the performance benefits of KleidiAI, we run the same chatbot application using PyTorch – first without and then with KleidiAI optimizations. The “before” and “after” KleidiAI optimization rates for Tokens/Sec and Time-to-First-Token are shown in the below graphs.
Figure 5: Performance Comparison
As you can see, leveraging KleidiAI libraries into existing GenAI technology stacks can have a huge performance uplift for token generation rate and for time to generate first token for different GenAI models.
Conclusion
Running LLM inferencing on CPUs is practical and effective for real-time workloads such as chatbots. We demonstrated this using Llama.cpp in a previous blog. In this blog, we showed how good LLM inference performance can be achieved using KleidiAI libraries for PyTorch on Arm. As we showed using an AWS Graviton4-based R8g instance with Neoverse V2 cores, KleidiAI delivers a massive performance improvement for running LLM Inference with PyTorch on Arm. Developers can take advantage of Arm’s KleidiAI optimizations for PyTorch today, for new or existing AI applications.