PyTorch is a widely used open-source library for machine learning. At Arm, along with our partners, we have been enhancing PyTorch’s inference performance over the past few years. In this blog post, we will describe how PyTorch inference performance on Arm Neoverse has been improved using Kleidi technology, available in the Arm Compute Library and KleidiAI library.
PyTorch offers two primary execution modes: Eager Mode and Graph Mode. Eager Mode is a dynamic execution mode where operations are executed immediately as they are written in Python code, making it ideal for experimentation and debugging. Graph Mode, on the other hand, compiles a sequence of PyTorch operations into a static computation graph before execution, enabling performance optimization and efficient hardware acceleration. The torch.compile function provides a convenient way to convert your PyTorch code into Graph Mode, often leading to significant speedups.
torch.compile
PyTorch Eager mode is optimized for Arm Neoverse processors with Arm Compute Library (ACL) kernels using oneDNN. To understand how, let us look at the PyTorch software stack.
Figure 1: PyTorch Software Stack
FX Graph in PyTorch is an intermediate representation that is used to visualize and optimize PyTorch models.
Aten is the foundational tensor library that underpins the PyTorch framework. It provides the core Tensor class and a vast array of mathematical operations that form the building blocks of PyTorch models.
oneDNN is a performance library that provides optimized implementations of common deep learning primitives for various hardware architectures including Arm and x86. On these architectures, Aten uses oneDNN as a performance enhancing backend. This means that when PyTorch encounters a supported operation, it delegates the computation to oneDNN, which can execute it more efficiently using hardware-specific optimizations.
Arm Compute Library, first released in 2016, provides Arm-optimized key machine learning primitives including convolution, pooling, activation functions, fully connected layers, normalization. These primitives leverage ML-specific hardware-specific features and instructions available on Arm Neoverse cores to achieve high performance. We have integrated Arm Compute Library into oneDNN such that Aten operations are accelerated on Arm.
Arm Neoverse CPUs include hardware extensions that help accelerate ML. These include NEON, SVE/2, BF16, and I8MM to accelerate machine learning tasks by efficiently handling vector processing, BFloat16 operations, and matrix multiplication.
Figure 2: Performance uplift using Eager mode for various models
PyTorch 2.0 introduced torch.compile to enhance the speed of PyTorch code compared to the default eager mode. Unlike eager mode, torch.compile pre-compiles the entire model into a single graph optimized for specific hardware platforms. From PyTorch 2.3.1 onwards, the official AArch64 wheel includes torch.compile optimizations. These optimizations can deliver up to 2x better performance over Eager mode for TorchBench model inference across various natural language processing (NLP), computer vision (CV), and recommendation models on AWS Graviton3-based Amazon EC2 instances. Further details of the optimization is available in the PyTorch Blog “Accelerated PyTorch inference with torch.compile on AWS Graviton processors”.
Figure 3: Performance uplift in Compile mode for various models
So far, we have looked at how Arm Compute Library enhances PyTorch inference performance in both eager and compile modes. Now, let us look at what is coming soon to PyTorch. Arm is currently working to improve LLM inference performance in PyTorch, with Llama and Gemma as key LLM examples.
Optimal INT4 kernels
Earlier this year, Arm software teams and partners optimized the int4 and int8 kernels implemented in llama.cpp to leverage newer DOT and MLA instructions. On AWS Graviton3 processors, these kernels resulted in a 2.5x improvement in prompt evaluation over the existing GEMM MMLA kernels, as well as a 2x improvement in text generation over the default vec_dot kernel. These new optimized kernels are also part of KleidAI library.
DOT
MLA
vec_dot
KleidiAI library, announced at Computex 2024, is an open-source library with optimized micro-kernels for AI tasks on Arm CPUs. Think of a micro-kernel as a small piece of software that boosts the performance of a specific ML operation. Developers can use these micro-kernels by including the relevant .c and .h files along with a common header file. No need to include the rest of the library.
Figure 4: Kleidi technology integration with PyTorch
We have introduced two new ATen operations torch.ops.aten._kai_weights_pack_int4() and torch.ops.aten._kai_input_quant_mm_int4() that are using highly optimised packing and GEMM kernels that are available in KleidiAI library. gpt-fast leverages these PyTorch operators to (1) quantize weights to INT4 by using symmetric per-channel quantization and add additional array containing quantization scales (2) dynamically quantize activation matrix and execute INT8 matrix multiplication of activation matrix and weights by using AArch64 I8MM extension.
ATen
torch.ops.aten._kai_weights_pack_int4()
torch.ops.aten._kai_input_quant_mm_int4()
gpt-fast
Figure 5: 4-bit quantized LLM model performance boost with KleidiAI integration in PyTorch
With this approach, we can improve the inference performance of Llama by up to 18x and Gemma by 14x compared to default PyTorch implementation available today.
Arm and partners have improved PyTorch inference performance on Arm Neoverse using Kleidi technology available in Arm Compute Library. We see up to 2x uplift in Eager mode, a further up to 2x in Graph mode (using torch.compile). Further, work is in progress to improve GenAI model (Llama and Gemma) inference by up to 18x.
https://pytorch.org/tutorials/recipes/inference_tuning_on_aws_graviton.htmlhttps://pytorch.org/blog/accelerated-pytorch-inference/https://pytorch.org/blog/optimized-pytorch-w-graviton/