Accelerated LLM inference on Arm Neoverse N2

June 18, 2024

4 minute read time.

This blog is co-authored by Willen Yang and Tianyu Li

Artificial Intelligence (AI) has been making waves across industries, with its exponential growth prominently represented by the advent of Large Language Models (LLMs). These models have revolutionized the way we interact with technology, offering unprecedented capabilities in natural language understanding and generation. While GPUs have been instrumental in training generative AI models, in the domain of inference, there can be more options beyond GPUs and accelerators. CPUs which have been long used in the traditional AI and machine learning (ML) use cases, can handle a wide range of tasks and are more flexible in terms of deployment, making them an attractive option for businesses and developers looking to integrate LLM into their products and services. In this blog post, we will explore the capabilities of Arm Neoverse N2 based Alibaba Yitian710 CPUs running industry-standard Large Language Models (LLMs), such as LLaMa3 [1] and Qwen1.5 [2], with flexibility and scalability. Additionally, the blog will present a comparative analysis that showcases the key advantages of the Yitian710 CPUs over other CPU-based server platforms.

Performance of LLMs on AliCloud Yitian710 cloud instance

GEMM (General Matrix Multiplications) is fundamental operation used extensively in deep learning computations, including those within LLMs. It performs a complex multiplication of two input matrices together to get one output. Arm v8.6-A architecture adds SMMLA instruction which multiplies the 2x8 matrix of signed 8-bit integer values in the first source vector by the 8x2 matrix of signed 8-bit integer values in the second source vector. The resulting 2x2 32-bit integer matrix product is then added to the 32-bit integer matrix accumulator in the destination vector. This is equivalent to performing an 8-way dot product per destination element. SMMLA instruction is part of the Arm Neoverse N2-based Alibaba Yitian710 CPUs.

In the past few months, Arm software teams and our partners have optimized the int4 and int8 GEMM kernels implemented in llama.cpp leveraging the SMMLA instruction mentioned above. We have recently performed several experiments on the AliCloud Yitian710 cloud instance with the latest optimizations [3][4] to measure the performance of llama.cpp in different scenarios.

All experiments were carried out on an AliCloud ecs.g8y.16xlarge instance with 64 vCPUs (virtual CPUs) and 256 GB of memory. The model used was the LLaMa3-8B and Qwen1.5-4B with int4 quantization.

Prompt Processing

Processing the prompt tokens is typically done in parallel and uses all the available cores, even for a single operation (batch=1), and prompt processing rate keeps flat with increased batch size. Arm optimizations here help speed up the tokens processed per second by up to 2.7x.

Optimization Update for Prompt Processing

Token Generation

Token generation is done in an auto-regressive manner and the total token generation time is relevant with the length of output needed to be generated. Arm optimizations here shows more obvious benefit with larger batch sizes, increasing the throughput by up to 1.9x.

Optimization Uplift for Token Generation

Latency

Latency of token generation is very important for interactive deployments of LLMs. 100ms latency for time-to-next-token is a key target metric and is based on typical human reading speed of 5-10 words per second. In the below charts, we see that AliCloud Yitian710 cloud instance meets that 100ms latency requirement for both single as well as batched scenarios, making it a suitable deployment target for smaller LLMs. We used two different sets of latest models – Llama3-8B and Qwen1.5-4B to show the latency for different sizes of smaller LLMs.

Time-to-next-token latency

Performance comparisons

We also compared the performance of the LLaMa-3 8B int4 quantized model on Yitian710 vs. other server CPUs on AliCloud ^[*].

AliCloud Yitian710: ecs.g8y.16xlarge, 64 VCPUs, 256 GB memory, $1.77/hr
Intel Icelake: ecs.g7.16xlarge, 64 VCPUs, 256 GB memory, $2.31/hr
Intel Sapphire-Rapids: ecs.g8i.16xlarge, 64 VCPUs, 256 GB memory, $2.43/hr

* AliCloud Yitian710 with the optimizations in [3][4], off the shelf llama.cpp implementation for Intel Icelake and Sapphire-Rapids.

We found that AliCloud Yitian710 delivers up to 3.2x better performance for prompt processing and 2.2x better performance for token generation.

Prompt processing comparison

Token generation comparison

It is also important to note that the AliCloud Yitian710 platform is much more cost effective than Icelake and Sapphire-Rapids, which is reflected in the lower pricing of AliCloud Yitian710 instances. This has given AliCloud Yitian710 a significant advantage here in terms of TCO for LLM inferencing, providing up to 3 times higher tokens/$, which provides a compelling advantage to users looking to start small and scale in their LLM adoption journey.

TCO comparison for LLM Inferencing

Conclusion

For developers looking to deploy smaller, focused Large Language Models (LLMs) within their applications, server CPUs provide a flexible, cost-effective and streamlined deployment process. Arm has incorporated several pivotal enhancements to substantially boost the performance of LLMs. These improvements allow Arm Neoverse-based server processors, such as the AliCloud Yitian710, to deliver best-in-class LLM performance over other server CPUs. Additionally, they help to reduce the barriers to entry for LLM adoption, making it more accessible to a broader spectrum of application developers.

Reference

Parents

Jerome Decamps - 杜尚杰 5 months ago

Great report, thank's for this sharing
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

Jerome Decamps - 杜尚杰 5 months ago

Great report, thank's for this sharing
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

Infrastructure Solutions blog

Virtual Networking Solution and Performance on Arm Neoverse

Yanqin Wei

An introduction to the Virtual Networking Solution and Performance on Arm Neoverse white paper.
- November 14, 2024
Use Case: How to Enable Real-Time Sentiment Analysis on Arm Neoverse-Based Kubernetes Clusters

Na Li

Learn how to build a distributed kubernetes cluster on Arm Neoverse-based instances.
- November 11, 2024
Google’s Axion Powered by Arm Neoverse: Faster Inference and Higher Performance for AI Workloads

Ashok Bhat

Google Axion is an excellent choice for AI inference, capable of handling a wide range of workloads from traditional machine learning tasks like XGBoost to generative AI applications such as LLaMa.
- October 30, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog