Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Distributed Generative AI Inference on Arm
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Artificial Intelligence (AI)
  • Machine Learning (ML)
  • AI Inference
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Distributed Generative AI Inference on Arm

Waheed Brown
Waheed Brown
August 18, 2025
4 minute read time.

Generative AI is becoming more efficient, and large language models (LLMs) are shrinking in size. This creates new opportunities to run LLMs on more efficient hardware. For example, cloud services can now run AI inference on Arm-based CPUs.

What is distributed AI inference?

AI inference happens when an AI application processes a user’s prompt. With LLMs, users commonly enter text into a chatbot's prompt window, press "send", and then this request is processed on an AI server. AI inference does not need to run on a single machine or virtual machine (VM). To scale inference beyond a single machine's memory (RAM or VRAM), LLM weights and computations are often distributed across many machines.

How are LLM Weights and Computations Distributed?

LLM weights and computations are often distributed using a client-server model. One machine is appointed as the main node (client). The remaining machines function as worker nodes (servers). Each worker loads a shard of the model and participates in parallel computation.

Using an AI framework like llama.cpp, LLM weights and computations can be distributed across machines using RPC:

# A llama.cpp Worker node listen for inference from the Main node.
rpc-server -p 50052 -H 0.0.0.0 -t 64
# The llama.cpp Main node initiates distributed LLM inference
llama-cli -m model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 99

Explanation of rpc-server parameters:

  • -p 50052 The listening TCP port on the Worker.
  • -H 0.0.0.0 The IP address of the Main node; 0.0.0.0 accepts requests from any IP address.
  • -t 64 CPU thread count.

Explanation of llama-cli parameters:

  • -m model.gguf Specifies the quantized LLaMA model file (GGUF format) to load.
  • -p "Tell me a joke" The prompt passed to the model for generation.
  • -n 128 Maximum number of tokens to generate in the output.
  • --rpc "$worker_ips" A parameter assigned the value of a comma-separated list of worker IP addresses.
  • -ngl 99 The number of the LLM's neural network layers to delegate to a GPU.

On CPU-only Arm cloud machines, distributed inference runs entirely on CPUs. This makes a worker node's thread count, -t 64, the key parameter. Set the number of threads to match the number of CPU cores on each worker node. Note that on CPU-only Arm machines the GPU delegation parameter, -ngl 99, is ignored.

What Arm Cloud Machines are Available for Distributed Inference?

All three major cloud providers have Arm machines that are suitable for distributed inference:

Cloud Provider Arm Cloud Machine Type Example Instances
Amazon Web Services AWS Graviton CPU VMs NVIDIA Grace Arm CPU VMs AWS Graviton 2 VMs:
  • C6g, C6gn
  • G5g (NVIDIA T4G Tensor Core GPUs)
  • I4g, Im4gn, Is4gen
  • M6g
  • R6g
  • T4g
  • X2gd
AWS Graviton 3 VMs:
  • C7g, C7gn
  • HPC7g
  • M7g
  • R7g
AWS Graviton 4 VMs:
  • C8g, C8gn
  • I8g
  • M8g
  • R8g
  • X8g
NVIDIA Grace Arm CPU VMs:
  • P6e (NVIDIA GB200 GPUs)
Google Cloud Google Axion CPU VMs NVIDIA Grace Arm CPU VMs Google Axion CPU VMs:
  • C4A
  • Tau T2A
NVIDIA Grace Arm CPU VMs:
  • A4x (NVIDIA GB200 GPUs)
Microsoft Azure Azure Cobalt 100 CPU VMs NVIDIA Grace Arm CPU VMs Azure Cobalt 100 CPU VMs:
  • Dplsv6, Dpldsv6
  • Dpsv6, Dpdsv6
  • Epsv6, Epdsv6
NVIDIA Grace Arm CPU VMs:
  • ND GB200-v6

What is Next?

To learn more about inference using llama.cpp on Arm, visit our Arm Learning Path.

Arm Learning Path

  • https://learn.arm.com/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/
  • View
  • Hide
Anonymous
Servers and Cloud Computing blog
  • Refining MurmurHash64A for greater efficiency in Libstdc++

    Zongyao Zhang
    Zongyao Zhang
    Discover how tuning MurmurHash64A’s memory access pattern yields up to 9% faster hashing performance.
    • October 16, 2025
  • How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

    Marc Meunier
    Marc Meunier
    Discover how FUJITSU-MONAKA secures AI and HPC workloads with Arm v9 and Realm-based confidential computing.
    • October 13, 2025
  • Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

    odinlmshen
    odinlmshen
    In this blog post, learn how to integrate virtual BMC and firmware simulation into CI pipelines to speed bring-up, testing, and developer onboarding.
    • October 13, 2025