Generative AI has captured the attention of the tech industry over the past year, and everyone is finding ways to deploy LLMs into their applications on both cloud and edge servers. The default platform of choice for these deployments has been GPUs and accelerators, which provide the best performance. This blog post describes the capabilities of Arm Neoverse V1-based AWS Graviton3 CPUs in running industry standard LLMs like LLaMa 3 [1] and Phi3 [2] flexibly and at scale and showcases their key advantages compared to other CPU-based server platforms.
To demonstrate the capability of Arm-based server CPUs for LLM inferencing, Arm software teams and our partners optimized the int4 and int8 kernels implemented in llama.cpp to leverage these newer instructions [3][4]. We conducted several experiments on the AWS Graviton3 platform to measure the performance impact in different scenarios and isolate the impact.
All experiments were carried out on an AWS c7g.16xlarge instance with 64 vCPUs (virtual CPUs) and 512 GB of memory. The model used was the LLaMa-3 8B with int4 quantization.
Processing the prompt tokens is done in parallel and uses all the available cores, even for a single operation (batch=1). Arm optimizations here help speed up the tokens processed per second by up to 2.5x, with minor improvements from increased batch size.
Figure 1. Optimization Uplift for Prompt Processing
Token generation is done in an auto-regressive manner and is highly sensitive to the length of output needed to be generated. Arm optimizations help here with larger batch sizes, increasing the throughput by up to 2x.
Figure 2. Optimization Uplift for Token Generation
Latency of token generation is very important for interactive deployments of LLMs. 100ms latency for time-to-next-token is a key target metric and is based on typical human reading speed of 5-10 words per second. In the below charts, we see that AWS Graviton3 meets that 100ms latency requirement for both single as well as batched scenarios, making it a suitable deployment target for LLMs.
We used two different sets of latest models, Llama3-8B and Phi3-mini (3.8B) to show the latency for different sizes of smaller LLMs.
Figure 3. Time-to-next-token latency for AWS Graviton3
Even previous generation Arm server platforms like AWS Graviton2 (launched in 2019) can run latest LLMs up to 8B parameters and still meet the 100ms latency requirement for both single and batched scenarios.
Figure 4. Time-to-next-token latency for AWS Graviton2
We also compared the performance of the LLaMa-3 8B int4 quantized model on AWS Graviton3 vs. other latest-generation server CPUs on AWS.
We found that AWS Graviton3 delivers up to 3x better performance, both for prompt processing as well as for token generation.
Figure 5. Prompt processing comparison
Figure 6. Token generation comparison
It is also important to note that AWS Graviton3 CPUs are much more cost effective than 4th generation x86 CPUs, and that is reflected in the lower pricing of Graviton3 instances. Given the already high compute requirements for LLMs, calculating the TCO in terms of tokens/$ is a key factor driving LLM adoption in the data-center.
AWS Graviton3 has a significant advantage here, providing up to 3 times higher tokens/$, which is not only class leading amongst CPUs but also provides a compelling advantage to users looking to start small and scale in their LLM adoption journey.
Figure 7. TCO comparison for LLM Inferencing
Server CPUs provide a flexible, cost-effective and quick start for developers looking to deploy smaller, focused LLMs in their applications. Arm has added some key features to help improve the performance of LLMs significantly. These enable Arm Neoverse-based server processors like AWS Graviton3 to provide both best-in-class LLM performance compared to other server CPUs as well as lower the entry barrier for LLM adoption for a much wider set of application developers.