Best-in-class LLM performance on Arm Neoverse V1 based AWS Graviton3 CPUs

May 22, 2024

3 minute read time.

Generative AI has captured the attention of the tech industry over the past year, and everyone is finding ways to deploy LLMs into their applications on both cloud and edge servers. The default platform of choice for these deployments has been GPUs and accelerators, which provide the best performance. This blog post describes the capabilities of Arm Neoverse V1-based AWS Graviton3 CPUs in running industry standard LLMs like LLaMa 3 [1] and Phi3 [2] flexibly and at scale and showcases their key advantages compared to other CPU-based server platforms.

Performance of LLMs on AWS Graviton3

To demonstrate the capability of Arm-based server CPUs for LLM inferencing, Arm software teams and our partners optimized the int4 and int8 kernels implemented in llama.cpp to leverage these newer instructions [3][4]. We conducted several experiments on the AWS Graviton3 platform to measure the performance impact in different scenarios and isolate the impact.

All experiments were carried out on an AWS c7g.16xlarge instance with 64 vCPUs (virtual CPUs) and 512 GB of memory. The model used was the LLaMa-3 8B with int4 quantization.

Prompt processing

Processing the prompt tokens is done in parallel and uses all the available cores, even for a single operation (batch=1). Arm optimizations here help speed up the tokens processed per second by up to 2.5x, with minor improvements from increased batch size.

Optimization Uplift for Prompt Processing

^{Figure 1. Optimization Uplift for Prompt Processing}

Token generation

Token generation is done in an auto-regressive manner and is highly sensitive to the length of output needed to be generated. Arm optimizations help here with larger batch sizes, increasing the throughput by up to 2x.

Optimization Uplift for Token Generation

^{Figure 2. Optimization Uplift for Token Generation}

Latency

Latency of token generation is very important for interactive deployments of LLMs. 100ms latency for time-to-next-token is a key target metric and is based on typical human reading speed of 5-10 words per second. In the below charts, we see that AWS Graviton3 meets that 100ms latency requirement for both single as well as batched scenarios, making it a suitable deployment target for LLMs.

We used two different sets of latest models, Llama3-8B and Phi3-mini (3.8B) to show the latency for different sizes of smaller LLMs.

Time-to-next-token latency, Graviton3

^{Figure 3. Time-to-next-token latency for AWS Graviton3}

Even previous generation Arm server platforms like AWS Graviton2 (launched in 2019) can run latest LLMs up to 8B parameters and still meet the 100ms latency requirement for both single and batched scenarios.

Time-to-next-token latency, Graviton2

^{Figure 4. Time-to-next-token latency for AWS Graviton2}

Performance comparisons

We also compared the performance of the LLaMa-3 8B int4 quantized model on AWS Graviton3 vs. other latest-generation server CPUs on AWS.

AWS Graviton3: c7g.16xlarge, 64 VCPUs, 512 GB memory, $2.31/hr
4th Gen Intel Xeon: c7i.16xlarge, 64 VCPUs, 512 GB memory, $2.86/hr
4th Gen AMD EPYC: c7a.16xlarge, 64 VCPUs (SMT-off), 512 GB memory, $3.28/hr

We found that AWS Graviton3 delivers up to 3x better performance, both for prompt processing as well as for token generation.

Prompt processing comparison

^{Figure 5. Prompt processing comparison}

Token generation comparison

^{Figure 6. Token generation comparison}

It is also important to note that AWS Graviton3 CPUs are much more cost effective than 4th generation x86 CPUs, and that is reflected in the lower pricing of Graviton3 instances. Given the already high compute requirements for LLMs, calculating the TCO in terms of tokens/$ is a key factor driving LLM adoption in the data-center.

AWS Graviton3 has a significant advantage here, providing up to 3 times higher tokens/$, which is not only class leading amongst CPUs but also provides a compelling advantage to users looking to start small and scale in their LLM adoption journey.

TCO comparison for LLM Inferencing

^{Figure 7. TCO comparison for LLM Inferencing}

Conclusion

Server CPUs provide a flexible, cost-effective and quick start for developers looking to deploy smaller, focused LLMs in their applications. Arm has added some key features to help improve the performance of LLMs significantly. These enable Arm Neoverse-based server processors like AWS Graviton3 to provide both best-in-class LLM performance compared to other server CPUs as well as lower the entry barrier for LLM adoption for a much wider set of application developers.

References

0 comments
0 members are here

Servers and Cloud Computing blog

How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

Peter Ma

Migrating to Arm-based AWS Graviton3 improved SiteMana’s scalability, latency, and costs while enabling real-time ML inference at scale.
- July 4, 2025
Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Chris Goodyer

In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
- June 17, 2025
Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog