Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Best-in-class LLM performance on Arm Neoverse V1 based AWS Graviton3 CPUs
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Cloud Computing
  • Machine Learning (ML)
  • Graviton2
  • Neoverse
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Best-in-class LLM performance on Arm Neoverse V1 based AWS Graviton3 CPUs

Ravi Malhotra
Ravi Malhotra
May 22, 2024
3 minute read time.

Generative AI has captured the attention of the tech industry over the past year, and everyone is finding ways to deploy LLMs into their applications on both cloud and edge servers. The default platform of choice for these deployments has been GPUs and accelerators, which provide the best performance. This blog post describes the capabilities of Arm Neoverse V1-based AWS Graviton3 CPUs in running industry standard LLMs like LLaMa 3 [1] and Phi3 [2] flexibly and at scale and showcases their key advantages compared to other CPU-based server platforms.

Performance of LLMs on AWS Graviton3

To demonstrate the capability of Arm-based server CPUs for LLM inferencing, Arm software teams and our partners optimized the int4 and int8 kernels implemented in llama.cpp to leverage these newer instructions [3][4]. We conducted several experiments on the AWS Graviton3 platform to measure the performance impact in different scenarios and isolate the impact.

All experiments were carried out on an AWS c7g.16xlarge instance with 64 vCPUs (virtual CPUs) and 512 GB of memory. The model used was the LLaMa-3 8B with int4 quantization.

Prompt processing

Processing the prompt tokens is done in parallel and uses all the available cores, even for a single operation (batch=1). Arm optimizations here help speed up the tokens processed per second by up to 2.5x, with minor improvements from increased batch size.

Optimization Uplift for Prompt Processing

Figure 1. Optimization Uplift for Prompt Processing

Token generation

Token generation is done in an auto-regressive manner and is highly sensitive to the length of output needed to be generated. Arm optimizations help here with larger batch sizes, increasing the throughput by up to 2x.

Optimization Uplift for Token Generation

Figure 2. Optimization Uplift for Token Generation

Latency

Latency of token generation is very important for interactive deployments of LLMs. 100ms latency for time-to-next-token is a key target metric and is based on typical human reading speed of 5-10 words per second. In the below charts, we see that AWS Graviton3 meets that 100ms latency requirement for both single as well as batched scenarios, making it a suitable deployment target for LLMs.

We used two different sets of latest models, Llama3-8B and Phi3-mini (3.8B) to show the latency for different sizes of smaller LLMs.

Time-to-next-token latency, Graviton3

Figure 3. Time-to-next-token latency for AWS Graviton3

Even previous generation Arm server platforms like AWS Graviton2 (launched in 2019) can run latest LLMs up to 8B parameters and still meet the 100ms latency requirement for both single and batched scenarios.

Time-to-next-token latency, Graviton2

Figure 4. Time-to-next-token latency for AWS Graviton2

Performance comparisons

We also compared the performance of the LLaMa-3 8B int4 quantized model on AWS Graviton3 vs. other latest-generation server CPUs on AWS.

  • AWS Graviton3: c7g.16xlarge, 64 VCPUs, 512 GB memory, $2.31/hr
  • 4th Gen Intel Xeon: c7i.16xlarge, 64 VCPUs, 512 GB memory, $2.86/hr
  • 4th Gen AMD EPYC: c7a.16xlarge, 64 VCPUs (SMT-off), 512 GB memory, $3.28/hr

We found that AWS Graviton3 delivers up to 3x better performance, both for prompt processing as well as for token generation.

Prompt processing comparison

Figure 5. Prompt processing comparison

Token generation comparison

Figure 6. Token generation comparison

It is also important to note that AWS Graviton3 CPUs are much more cost effective than 4th generation x86 CPUs, and that is reflected in the lower pricing of Graviton3 instances. Given the already high compute requirements for LLMs, calculating the TCO in terms of tokens/$ is a key factor driving LLM adoption in the data-center.

AWS Graviton3 has a significant advantage here, providing up to 3 times higher tokens/$, which is not only class leading amongst CPUs but also provides a compelling advantage to users looking to start small and scale in their LLM adoption journey.

TCO comparison for LLM Inferencing

Figure 7. TCO comparison for LLM Inferencing

Conclusion

Server CPUs provide a flexible, cost-effective and quick start for developers looking to deploy smaller, focused LLMs in their applications. Arm has added some key features to help improve the performance of LLMs significantly. These enable Arm Neoverse-based server processors like AWS Graviton3 to provide both best-in-class LLM performance compared to other server CPUs as well as lower the entry barrier for LLM adoption for a much wider set of application developers.

References

  • https://ai.meta.com/blog/meta-llama-3/
  • https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/
  • https://github.com/ggerganov/llama.cpp/pull/5780
  • https://github.com/ggerganov/llama.cpp/pull/4966
Anonymous
Servers and Cloud Computing blog
  • Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

    Na Li
    Na Li
    This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
    • April 7, 2025
  • Arm CMN S3: Driving CXL storage innovation

    John Xavier Lionel
    John Xavier Lionel
    CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
    • February 24, 2025
  • Streamline Arm adoption with GitHub Copilot and Arm64 Runners

    Michael Gamble
    Michael Gamble
    The Arm for GitHub Copilot extension is here to change the way developers approach architecture migration.
    • February 19, 2025