Distributed Generative AI Inference on Arm

August 18, 2025

4 minute read time.

Generative AI is becoming more efficient, and large language models (LLMs) are shrinking in size. This creates new opportunities to run LLMs on more efficient hardware. For example, cloud services can now run AI inference on Arm-based CPUs.

What is distributed AI inference?

AI inference happens when an AI application processes a user’s prompt. With LLMs, users commonly enter text into a chatbot's prompt window, press "send", and then this request is processed on an AI server. AI inference does not need to run on a single machine or virtual machine (VM). To scale inference beyond a single machine's memory (RAM or VRAM), LLM weights and computations are often distributed across many machines.

How are LLM Weights and Computations Distributed?

LLM weights and computations are often distributed using a client-server model. One machine is appointed as the main node (client). The remaining machines function as worker nodes (servers). Each worker loads a shard of the model and participates in parallel computation.

Using an AI framework like llama.cpp, LLM weights and computations can be distributed across machines using RPC:

# A llama.cpp Worker node listen for inference from the Main node.
rpc-server -p 50052 -H 0.0.0.0 -t 64

# The llama.cpp Main node initiates distributed LLM inference
llama-cli -m model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 99

Explanation of rpc-server parameters:

-p 50052 The listening TCP port on the Worker.
-H 0.0.0.0 The IP address of the Main node; 0.0.0.0 accepts requests from any IP address.
-t 64 CPU thread count.

Explanation of llama-cli parameters:

-m model.gguf Specifies the quantized LLaMA model file (GGUF format) to load.
-p "Tell me a joke" The prompt passed to the model for generation.
-n 128 Maximum number of tokens to generate in the output.
--rpc "$worker_ips" A parameter assigned the value of a comma-separated list of worker IP addresses.
-ngl 99 The number of the LLM's neural network layers to delegate to a GPU.

On CPU-only Arm cloud machines, distributed inference runs entirely on CPUs. This makes a worker node's thread count, -t 64, the key parameter. Set the number of threads to match the number of CPU cores on each worker node. Note that on CPU-only Arm machines the GPU delegation parameter, -ngl 99, is ignored.

What Arm Cloud Machines are Available for Distributed Inference?

All three major cloud providers have Arm machines that are suitable for distributed inference:

Cloud Provider	Arm Cloud Machine Type	Example Instances
Amazon Web Services	AWS Graviton CPU VMs NVIDIA Grace Arm CPU VMs	AWS Graviton 2 VMs: C6g, C6gn G5g (NVIDIA T4G Tensor Core GPUs) I4g, Im4gn, Is4gen M6g R6g T4g X2gd AWS Graviton 3 VMs: C7g, C7gn HPC7g M7g R7g AWS Graviton 4 VMs: C8g, C8gn I8g M8g R8g X8g NVIDIA Grace Arm CPU VMs: P6e (NVIDIA GB200 GPUs)
Google Cloud	Google Axion CPU VMs NVIDIA Grace Arm CPU VMs	Google Axion CPU VMs: C4A Tau T2A NVIDIA Grace Arm CPU VMs: A4x (NVIDIA GB200 GPUs)
Microsoft Azure	Azure Cobalt 100 CPU VMs NVIDIA Grace Arm CPU VMs	Azure Cobalt 100 CPU VMs: Dplsv6, Dpldsv6 Dpsv6, Dpdsv6 Epsv6, Epdsv6 NVIDIA Grace Arm CPU VMs: ND GB200-v6

What is Next?

To learn more about inference using llama.cpp on Arm, visit our Arm Learning Path.

Arm Learning Path

Servers and Cloud Computing blog

Refining MurmurHash64A for greater efficiency in Libstdc++

Zongyao Zhang

Discover how tuning MurmurHash64A’s memory access pattern yields up to 9% faster hashing performance.
- October 16, 2025
How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Marc Meunier

Discover how FUJITSU-MONAKA secures AI and HPC workloads with Arm v9 and Realm-based confidential computing.
- October 13, 2025
Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

odinlmshen

In this blog post, learn how to integrate virtual BMC and firmware simulation into CI pipelines to speed bring-up, testing, and developer onboarding.
- October 13, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Distributed Generative AI Inference on Arm

What is distributed AI inference?

How are LLM Weights and Computations Distributed?

What Arm Cloud Machines are Available for Distributed Inference?

What is Next?

Refining MurmurHash64A for greater efficiency in Libstdc++

How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3