Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Distributed Generative AI Inference on Arm
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Artificial Intelligence (AI)
  • Machine Learning (ML)
  • AI Inference
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Distributed Generative AI Inference on Arm

Waheed Brown
Waheed Brown
August 18, 2025
4 minute read time.

Generative AI is becoming more efficient, and large language models (LLMs) are shrinking in size. This creates new opportunities to run LLMs on more efficient hardware. For example, cloud services can now run AI inference on Arm-based CPUs.

What is distributed AI inference?

AI inference happens when an AI application processes a user’s prompt. With LLMs, users commonly enter text into a chatbot's prompt window, press "send", and then this request is processed on an AI server. AI inference does not need to run on a single machine or virtual machine (VM). To scale inference beyond a single machine's memory (RAM or VRAM), LLM weights and computations are often distributed across many machines.

How are LLM Weights and Computations Distributed?

LLM weights and computations are often distributed using a client-server model. One machine is appointed as the main node (client). The remaining machines function as worker nodes (servers). Each worker loads a shard of the model and participates in parallel computation.

Using an AI framework like llama.cpp, LLM weights and computations can be distributed across machines using RPC:

# A llama.cpp Worker node listen for inference from the Main node.
rpc-server -p 50052 -H 0.0.0.0 -t 64
# The llama.cpp Main node initiates distributed LLM inference
llama-cli -m model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 99

Explanation of rpc-server parameters:

  • -p 50052 The listening TCP port on the Worker.
  • -H 0.0.0.0 The IP address of the Main node; 0.0.0.0 accepts requests from any IP address.
  • -t 64 CPU thread count.

Explanation of llama-cli parameters:

  • -m model.gguf Specifies the quantized LLaMA model file (GGUF format) to load.
  • -p "Tell me a joke" The prompt passed to the model for generation.
  • -n 128 Maximum number of tokens to generate in the output.
  • --rpc "$worker_ips" A parameter assigned the value of a comma-separated list of worker IP addresses.
  • -ngl 99 The number of the LLM's neural network layers to delegate to a GPU.

On CPU-only Arm cloud machines, distributed inference runs entirely on CPUs. This makes a worker node's thread count, -t 64, the key parameter. Set the number of threads to match the number of CPU cores on each worker node. Note that on CPU-only Arm machines the GPU delegation parameter, -ngl 99, is ignored.

What Arm Cloud Machines are Available for Distributed Inference?

All three major cloud providers have Arm machines that are suitable for distributed inference:

Cloud Provider Arm Cloud Machine Type Example Instances
Amazon Web Services AWS Graviton CPU VMs NVIDIA Grace Arm CPU VMs AWS Graviton 2 VMs:
  • C6g, C6gn
  • G5g (NVIDIA T4G Tensor Core GPUs)
  • I4g, Im4gn, Is4gen
  • M6g
  • R6g
  • T4g
  • X2gd
AWS Graviton 3 VMs:
  • C7g, C7gn
  • HPC7g
  • M7g
  • R7g
AWS Graviton 4 VMs:
  • C8g, C8gn
  • I8g
  • M8g
  • R8g
  • X8g
NVIDIA Grace Arm CPU VMs:
  • P6e (NVIDIA GB200 GPUs)
Google Cloud Google Axion CPU VMs NVIDIA Grace Arm CPU VMs Google Axion CPU VMs:
  • C4A
  • Tau T2A
NVIDIA Grace Arm CPU VMs:
  • A4x (NVIDIA GB200 GPUs)
Microsoft Azure Azure Cobalt 100 CPU VMs NVIDIA Grace Arm CPU VMs Azure Cobalt 100 CPU VMs:
  • Dplsv6, Dpldsv6
  • Dpsv6, Dpdsv6
  • Epsv6, Epdsv6
NVIDIA Grace Arm CPU VMs:
  • ND GB200-v6

What is Next?

To learn more about inference using llama.cpp on Arm, visit our Arm Learning Path.

Arm Learning Path

  • https://learn.arm.com/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/
  • View
  • Hide
Anonymous
Servers and Cloud Computing blog
  • Advancing Chiplet Innovation for Data Centers: Novatek’s CSS N2 SoC in Arm Total Design

    Marc Meunier
    Marc Meunier
    Novatek’s CSS N2 SoC, built with Arm Total Design, drives AI, cloud, and automotive innovation with chiplet-based, scalable compute.
    • September 24, 2025
  • How we cut LLM inference costs by 35% migrating to Arm-Based AWS Graviton

    Cornelius Maroa
    Cornelius Maroa
    The monthly wake-up call. Learn how Arm-based Graviton3 reduced costs 40%, cut power use 23%, and unlocked faster, greener AI at scale.
    • September 24, 2025
  • Hands-on with MPAM: Deploying and verifying on Ubuntu

    Howard Zhang
    Howard Zhang
    In this blog post, Howard Zhang walks through how to configure and verify MPAM on Ubuntu Linux.
    • September 24, 2025