Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Faster PyTorch Inference using Kleidi on Arm Neoverse
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Machine Learning (ML)
  • Graviton
  • Kleidi
  • Neoverse
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Faster PyTorch Inference using Kleidi on Arm Neoverse

Ashok Bhat
Ashok Bhat
September 16, 2024
4 minute read time.

PyTorch is a widely used open-source library for machine learning. At Arm, along with our partners, we have been enhancing PyTorch’s inference performance over the past few years. In this blog post, we will describe how PyTorch inference performance on Arm Neoverse has been improved using Kleidi technology, available in the Arm Compute Library and KleidiAI library.

PyTorch offers two primary execution modes: Eager Mode and Graph Mode. Eager Mode is a dynamic execution mode where operations are executed immediately as they are written in Python code, making it ideal for experimentation and debugging. Graph Mode, on the other hand, compiles a sequence of PyTorch operations into a static computation graph before execution, enabling performance optimization and efficient hardware acceleration. The torch.compile function provides a convenient way to convert your PyTorch code into Graph Mode, often leading to significant speedups.

PyTorch Eager mode - Up to 3x improvement in CPU Inference 

PyTorch Eager mode is optimized for Arm Neoverse processors with Arm Compute Library (ACL) kernels using oneDNN. To understand how, let us look at the PyTorch software stack.

PyTorch Software Stack

Figure 1: PyTorch Software Stack

FX Graph in PyTorch is an intermediate representation that is used to visualize and optimize PyTorch models.

Aten is the foundational tensor library that underpins the PyTorch framework. It provides the core Tensor class and a vast array of mathematical operations that form the building blocks of PyTorch models.

oneDNN is a performance library that provides optimized implementations of common deep learning primitives for various hardware architectures including Arm and x86. On these architectures, Aten uses oneDNN as a performance enhancing backend. This means that when PyTorch encounters a supported operation, it delegates the computation to oneDNN, which can execute it more efficiently using hardware-specific optimizations.

Arm Compute Library, first released in 2016, provides Arm-optimized key machine learning primitives including convolution, pooling, activation functions, fully connected layers, normalization. These primitives leverage ML-specific hardware-specific features and instructions available on Arm Neoverse cores to achieve high performance. We have integrated Arm Compute Library into oneDNN such that Aten operations are accelerated on Arm.

Arm Neoverse CPUs include hardware extensions that help accelerate ML. These include NEON, SVE/2, BF16, and I8MM to accelerate machine learning tasks by efficiently handling vector processing, BFloat16 operations, and matrix multiplication.

PyTorch uplift from KleidiAI integration

Figure 2: Performance uplift using Eager mode for various models

PyTorch Graph mode (using torch.compile) - Further up to 2x improvement over PyTorch Eager mode

PyTorch 2.0 introduced torch.compile to enhance the speed of PyTorch code compared to the default eager mode. Unlike eager mode, torch.compile pre-compiles the entire model into a single graph optimized for specific hardware platforms. From PyTorch 2.3.1 onwards, the official AArch64 wheel includes torch.compile optimizations. These optimizations can deliver up to 2x better performance over Eager mode for TorchBench model inference across various natural language processing (NLP), computer vision (CV), and recommendation models on AWS Graviton3-based Amazon EC2 instances.  Further details of the optimization is available in the PyTorch Blog “Accelerated PyTorch inference with torch.compile on AWS Graviton processors”. 

PyTorch Inference uplift with torch.compile over Eager mode

Figure 3: Performance uplift in Compile mode for various models

What is next: Improved GenAI inference performance with KleidiAI library

So far, we have looked at how Arm Compute Library enhances PyTorch inference performance in both eager and compile modes. Now, let us look at what is coming soon to PyTorch. Arm is currently working to improve LLM inference performance in PyTorch, with Llama and Gemma as key LLM examples.

Optimal INT4 kernels

Earlier this year, Arm software teams and partners optimized the int4 and int8 kernels implemented in llama.cpp to leverage newer DOT and MLA instructions. On AWS Graviton3 processors, these kernels resulted in a 2.5x improvement in prompt evaluation over the existing GEMM MMLA kernels, as well as a 2x improvement in text generation over the default vec_dot kernel. These new optimized kernels are also part of KleidAI library.

KleidiAI library, announced at Computex 2024, is an open-source library with optimized micro-kernels for AI tasks on Arm CPUs. Think of a micro-kernel as a small piece of software that boosts the performance of a specific ML operation. Developers can use these micro-kernels by including the relevant .c and .h files along with a common header file. No need to include the rest of the library.  

Kleidi Integration with PyTorch

Kleidi Integration with PyTorch

Figure 4: Kleidi technology integration with PyTorch

We have introduced two new ATen operations torch.ops.aten._kai_weights_pack_int4() and torch.ops.aten._kai_input_quant_mm_int4() that are using highly optimised packing and GEMM kernels that are available in KleidiAI library. gpt-fast leverages these PyTorch operators to (1) quantize weights to INT4 by using symmetric per-channel quantization and add additional array containing quantization scales (2) dynamically quantize activation matrix and execute INT8 matrix multiplication of activation matrix and weights by using AArch64 I8MM extension.

PyTorch with KleidiAI uplift for various LLMs

Figure 5: 4-bit quantized LLM model performance boost with KleidiAI integration in PyTorch

With this approach, we can improve the inference performance of Llama by up to 18x and Gemma by 14x compared to default PyTorch implementation available today. 

Conclusion

Arm and partners have improved PyTorch inference performance on Arm Neoverse using Kleidi technology available in Arm Compute Library. We see up to 2x uplift in Eager mode, a further up to 2x in Graph mode (using torch.compile). Further, work is in progress to improve GenAI model (Llama and Gemma) inference by up to 18x.

Resources:

https://pytorch.org/tutorials/recipes/inference_tuning_on_aws_graviton.html
https://pytorch.org/blog/accelerated-pytorch-inference/
https://pytorch.org/blog/optimized-pytorch-w-graviton/

Anonymous
Servers and Cloud Computing blog
  • Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

    Chris Goodyer
    Chris Goodyer
    In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
    • June 17, 2025
  • Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

    Na Li
    Na Li
    This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
    • April 7, 2025
  • Arm CMN S3: Driving CXL storage innovation

    John Xavier Lionel
    John Xavier Lionel
    CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
    • February 24, 2025