This blog is co-authored by Willen Yang and Tianyu Li
Artificial Intelligence (AI) has been making waves across industries, with its exponential growth prominently represented by the advent of Large Language Models (LLMs). These models have revolutionized the way we interact with technology, offering unprecedented capabilities in natural language understanding and generation. While GPUs have been instrumental in training generative AI models, in the domain of inference, there can be more options beyond GPUs and accelerators. CPUs which have been long used in the traditional AI and machine learning (ML) use cases, can handle a wide range of tasks and are more flexible in terms of deployment, making them an attractive option for businesses and developers looking to integrate LLM into their products and services. In this blog post, we will explore the capabilities of Arm Neoverse N2 based Alibaba Yitian710 CPUs running industry-standard Large Language Models (LLMs), such as LLaMa3 [1] and Qwen1.5 [2], with flexibility and scalability. Additionally, the blog will present a comparative analysis that showcases the key advantages of the Yitian710 CPUs over other CPU-based server platforms.
GEMM (General Matrix Multiplications) is fundamental operation used extensively in deep learning computations, including those within LLMs. It performs a complex multiplication of two input matrices together to get one output. Arm v8.6-A architecture adds SMMLA instruction which multiplies the 2x8 matrix of signed 8-bit integer values in the first source vector by the 8x2 matrix of signed 8-bit integer values in the second source vector. The resulting 2x2 32-bit integer matrix product is then added to the 32-bit integer matrix accumulator in the destination vector. This is equivalent to performing an 8-way dot product per destination element. SMMLA instruction is part of the Arm Neoverse N2-based Alibaba Yitian710 CPUs.
In the past few months, Arm software teams and our partners have optimized the int4 and int8 GEMM kernels implemented in llama.cpp leveraging the SMMLA instruction mentioned above. We have recently performed several experiments on the AliCloud Yitian710 cloud instance with the latest optimizations [3][4] to measure the performance of llama.cpp in different scenarios.
All experiments were carried out on an AliCloud ecs.g8y.16xlarge instance with 64 vCPUs (virtual CPUs) and 256 GB of memory. The model used was the LLaMa3-8B and Qwen1.5-4B with int4 quantization.
Processing the prompt tokens is typically done in parallel and uses all the available cores, even for a single operation (batch=1), and prompt processing rate keeps flat with increased batch size. Arm optimizations here help speed up the tokens processed per second by up to 2.7x.
Token generation is done in an auto-regressive manner and the total token generation time is relevant with the length of output needed to be generated. Arm optimizations here shows more obvious benefit with larger batch sizes, increasing the throughput by up to 1.9x.
Latency of token generation is very important for interactive deployments of LLMs. 100ms latency for time-to-next-token is a key target metric and is based on typical human reading speed of 5-10 words per second. In the below charts, we see that AliCloud Yitian710 cloud instance meets that 100ms latency requirement for both single as well as batched scenarios, making it a suitable deployment target for smaller LLMs. We used two different sets of latest models – Llama3-8B and Qwen1.5-4B to show the latency for different sizes of smaller LLMs.
We also compared the performance of the LLaMa-3 8B int4 quantized model on Yitian710 vs. other server CPUs on AliCloud [*].
* AliCloud Yitian710 with the optimizations in [3][4], off the shelf llama.cpp implementation for Intel Icelake and Sapphire-Rapids.
We found that AliCloud Yitian710 delivers up to 3.2x better performance for prompt processing and 2.2x better performance for token generation.
It is also important to note that the AliCloud Yitian710 platform is much more cost effective than Icelake and Sapphire-Rapids, which is reflected in the lower pricing of AliCloud Yitian710 instances. This has given AliCloud Yitian710 a significant advantage here in terms of TCO for LLM inferencing, providing up to 3 times higher tokens/$, which provides a compelling advantage to users looking to start small and scale in their LLM adoption journey.
For developers looking to deploy smaller, focused Large Language Models (LLMs) within their applications, server CPUs provide a flexible, cost-effective and streamlined deployment process. Arm has incorporated several pivotal enhancements to substantially boost the performance of LLMs. These improvements allow Arm Neoverse-based server processors, such as the AliCloud Yitian710, to deliver best-in-class LLM performance over other server CPUs. Additionally, they help to reduce the barriers to entry for LLM adoption, making it more accessible to a broader spectrum of application developers.
Great report, thank's for this sharing