Generative AI is becoming more efficient, and large language models (LLMs) are shrinking in size. This creates new opportunities to run LLMs on more efficient hardware. For example, cloud services can now run AI inference on Arm-based CPUs.
AI inference happens when an AI application processes a user’s prompt. With LLMs, users commonly enter text into a chatbot's prompt window, press "send", and then this request is processed on an AI server. AI inference does not need to run on a single machine or virtual machine (VM). To scale inference beyond a single machine's memory (RAM or VRAM), LLM weights and computations are often distributed across many machines.
LLM weights and computations are often distributed using a client-server model. One machine is appointed as the main node (client). The remaining machines function as worker nodes (servers). Each worker loads a shard of the model and participates in parallel computation.
Using an AI framework like llama.cpp, LLM weights and computations can be distributed across machines using RPC:
# A llama.cpp Worker node listen for inference from the Main node.rpc-server -p 50052 -H 0.0.0.0 -t 64
# The llama.cpp Main node initiates distributed LLM inferencellama-cli -m model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 99
# The llama.cpp Main node initiates distributed LLM inference
llama-cli -m model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 99
Explanation of rpc-server parameters:
rpc-server
-p 50052
-H 0.0.0.0
-t 64
Explanation of llama-cli parameters:
llama-cli
-m model.gguf
-p "Tell me a joke"
-n 128
--rpc "$worker_ips"
-ngl 99
On CPU-only Arm cloud machines, distributed inference runs entirely on CPUs. This makes a worker node's thread count, -t 64, the key parameter. Set the number of threads to match the number of CPU cores on each worker node. Note that on CPU-only Arm machines the GPU delegation parameter, -ngl 99, is ignored.
All three major cloud providers have Arm machines that are suitable for distributed inference:
To learn more about inference using llama.cpp on Arm, visit our Arm Learning Path.
Arm Learning Path