Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Running Llama 3 70B on the AWS Graviton4 CPU with Human Readable Performance
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Cloud Computing
  • Machine Learning (ML)
  • Graviton
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Running Llama 3 70B on the AWS Graviton4 CPU with Human Readable Performance

Na Li
Na Li
October 21, 2024
4 minute read time.
This blog post was co-authored by Na Li and Masoud Koleini.

Amazon Web Services (AWS) introduced its latest generation of custom-designed Arm-based CPUs, the AWS Graviton4 processors, which became generally available on July 9, 2024. These state-of-the-art processors are developed using the Arm Neoverse V2 core, based on the 64-bit Arm instruction set architecture. This architecture positions Graviton4 as a highly efficient and powerful solution for a variety of cloud-based applications [1].

In our analysis, we assessed the inferencing performance of language models on AWS EC2 instances powered by Graviton4 processors, specifically the C8g instance types. We ran models ranging from 3.8 billion to 70 billion parameters [2-5], utilizing Q_4_0_4_8 quantization techniques optimized for Arm kernels, and used llama.cpp for benchmarking [6]. Additionally, we performed comparative evaluations by running models on Graviton4-powered instances and comparing them to EC2 instances with the previous generation Graviton3 processors.

The 70B Llama model, known for its computational intensity, was successfully deployed on CPUs using quantization techniques, achieving a token generation rate faster than human readability, at 5-10 tokens per second. Graviton4 instances provided the flexibility to support models up to 70 billion parameters, a capacity that Graviton3 instances could not accommodate.

Llama 3 70B performs faster than human readability on AWS Graviton4

The AWS Graviton4 processor provides potential capability for executing larger-scale language models compared with their predecessor Graviton3. To assess the performance of Graviton4 processor in running different parameter sizes of LLMs, we deployed three models — the Llama 3 70B, Phi3-Mini 3.8B, and Llama 3 8B — on the Graviton4 C8g.16xlarge instance and we measured its inferencing performance. The primary performance metric was the latency of next-token generation, as illustrated in Figure 1. Although the Llama 3 70B model exhibited higher latency relative to the smaller models, it still achieved a human readability level of 5-10 tokens per second with batch size of 1, closely meeting the target service-level agreement (SLA) of 100ms for next token generation latency.

Time to next token comparisonFigure 1: Performance of time-to-next-token-generation of running Llama 3 70B, Phi3-Mini 3.8B, and Llama 3 8B models on C8g.16xlarge instance. The batch size simulates the scenario where one or multiple users call the model simultaneously.  

According to Meta [7], the Llama 3 70B model exhibits outstanding performance in the tasks that demand advanced logical reasoning (see examples in Image 1), although it has higher latency of next token generation compared to the Llama 3 8B model (Figure 1). For applications where low-latency response times are critical but require less complex logical reasoning, the Llama 3 8B model is a suitable option. Conversely, if the application allows for more flexibility in latency but requires high-level reasoning or creative capabilities, the Llama 3 70B model can be an appropriate choice.

Demonstration of LLM logic reasoning

Figure 1: The top row shows an example where both Llama 3 8B and models produce a good answer for a question which is knowledge based; while the bottom row shows an example where only Llama 3 70B produces a correct answer since the solution to the question requires logic reasoning.

Supporting different language models with Graviton3 and Graviton4 processors

To assess the performance of AWS Graviton processors in running language models (LLMs), we deployed various models ranging from 3.8 billion to 70 billion parameters on the Graviton3 (C7g.16xlarge) and Graviton4 (C8g.16xlarge) instances and measured their inference capabilities.

As detailed in Table 1, both Graviton3 and Graviton4-based instances are capable of supporting models up to 27 billion parameters, including Phi 3-mini 3.8B, Llama 3 8B, and Gemma 2 27B. However, Graviton4 is able to manage the largest model evaluated, the Llama 3 70B.

Support of various LLMs on Graviton3 and Graviton4

Table 1: Support of various language models with Graviton3 and Graviton4 processors.

Performance improvement from Graviton3 to Graviton4 processors

To evaluate performance enhancements, we deployed the Llama 3 model (8 billion parameters) on both the Graviton3 (C7g.16xlarge) and Graviton4 (C8g.16xlarge) instances. Performance was assessed based on prompt encoding, which measures the speed at which user inputs are processed and interpreted by the language model, as illustrated in Figure 2. The Graviton4 consistently exhibited performance improvements in prompt encoding for all tested user batch sizes, achieving a 14-26% improvement (Figure 2, right axis).

Prompt Encoding Comparison on Llama 3 8B

Figure 2: Performance improvement of prompt encoding running Llama 3 8B model from Graviton3 (C7g.16xlarge) to Graviton4 (C8g.16xlarge).

Token generation, which assesses the speed at which the language model responds and produces text when running Llama 3 8B, as depicted in Figure 3, also demonstrated significant gains. The performance curve revealed a marked improvement for all tested user batch sizes, with the Graviton4 displaying more substantial efficiency gains at smaller user batch sizes, achieving 5-50% improvement (Figure 3, right axis).

Token Generation Comparison on Llama 3 8B

Figure 3: Performance improvement of token generation running Llama 3 8B model from Graviton 3 (C7g.16xlarge) to Graviton 4 (C8g.16xlarge).

Conclusion

Running Llama 3 70B on an AWS Graviton4 C8g.x16large instance exceeds human readability levels by generating 10 tokens per second. This enhanced performance enables Graviton4 to handle a broader range of generative AI tasks, including those requiring advanced reasoning, compared to Graviton3. When running a Llama 3 8B model, Graviton4 demonstrates a 14-26% improvement in prompt encoding and a 5-50% boost in token generation over Graviton3. To explore this performance further, you can try the demo available in the learning path:  https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama-cpu/_demo. 

References:

  1. https://aws.amazon.com/ec2/graviton/ 
  2. https://huggingface.co/NousResearch/Meta-Llama-3-70B-GGUF/resolve/main/Meta-Llama-3-70B-Q4_K_M.gguf
  3. https://huggingface.co/bartowski/gemma-2-27b-it-GGUF/resolve/main/gemma-2-27b-it-Q4_K_S.gguf 
  4. https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF/resolve/main/Meta-Llama-3-8B.Q4_0.gguf 
  5. https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf
  6. https://github.com/ggerganov/llama.cpp (Release: b3440)
  7. https://ai.meta.com/blog/meta-llama-3/
Anonymous
Servers and Cloud Computing blog
  • How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

    Peter Ma
    Peter Ma
    Migrating to Arm-based AWS Graviton3 improved SiteMana’s scalability, latency, and costs while enabling real-time ML inference at scale.
    • July 4, 2025
  • Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

    Chris Goodyer
    Chris Goodyer
    In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
    • June 17, 2025
  • Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

    Na Li
    Na Li
    This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
    • April 7, 2025