Deploying PyTorch models on Arm edge devices: A step-by-step tutorial

April 22, 2025

2 minute read time.

AI is being rapidly adopted in edge computing. As a result, it is increasingly important to deploy machine learning models on Arm edge devices. Arm-based processors are common in embedded systems because of their low power consumption and efficiency. This tutorial shows you how to deploy PyTorch models on Arm edge devices, such as the Raspberry Pi or NVIDIA Jetson Nano.

Prerequisites

Before you begin, make sure you have the following:

Hardware: An Arm-based device such as Raspberry Pi, NVIDIA Jetson Nano, or a similar edge device.
Software
- Python 3.7 or later must be installed on your device.
- A version of PyTorch compatible with Arm architecture.
- A trained PyTorch model.
Dependencies: You must install libraries such as torch, torchvision, and other required Python packages.

Step 1: Prepare your PyTorch model

Train or load your model
- Train your model on a development machine or load a pre-trained model from PyTorch’s model zoo:

Fullscreen

1
2
3
4
5
6
import torch
import torchvision.models as models
# Load a pre-trained model
model = models.resnet18(pretrained=True)
model.eval()
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

import torch
import torchvision.models as models

# Load a pre-trained model
model = models.resnet18(pretrained=True)
model.eval()

Optimize the model
- Convert the model to a TorchScript format for better compatibility and performance:

Fullscreen

1
2
3
scripted_model = torch.jit.script(model)
torch.jit.save(scripted_model, "resnet18_scripted.pt")
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

scripted_model = torch.jit.script(model)

torch.jit.save(scripted_model, "resnet18_scripted.pt")

Step 2: Set up the Arm edge device

Install Dependencies
- Ensure your Arm device has Python installed.
Install PyTorch. Use a version specifically built for Arm devices. For example, Raspberry Pi users can use the following command:

pip install torch torchvision

Verify the Installation

Fullscreen

1
2
3
4
5
import torch
print(torch.__version__)
print(torch.cuda.is_available()) # Check if CUDA is supported (for devices like Jetson Nano)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

import torch

print(torch.__version__)

print(torch.cuda.is_available()) # Check if CUDA is supported (for devices like Jetson Nano)

Step 3: Deploy the model to the device

Transfer the scripted model
- Use scp or a USB drive to copy the model file (resnet18_scripted.pt) to the Arm device:

scp resnet18_scripted.pt user@device_ip:/path/to/destination

Run inference
- Write a Python script to load the model and run inference:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
 import torch
from PIL import Image
from torchvision import transforms
# Load the model
model = torch.jit.load("resnet18_scripted.pt")
model.eval()
# Preprocess an input image
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
img = Image.open("test_image.jpg")
img_tensor = preprocess(img).unsqueeze(0)  # Add batch dimension
# Perform inference
with torch.no_grad():
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

 import torch
from PIL import Image
from torchvision import transforms

# Load the model
model = torch.jit.load("resnet18_scripted.pt")
model.eval()

# Preprocess an input image
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

img = Image.open("test_image.jpg")
img_tensor = preprocess(img).unsqueeze(0)  # Add batch dimension

# Perform inference
with torch.no_grad():
    output = model(img_tensor)
print("Predicted class:", output.argmax(1).item())

Step 4: Optimize for edge performance

Quantization
- Use PyTorch’s quantization techniques to reduce the model size and improve inference speed:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
from torch.quantization import quantize_dynamic
quantized_model = quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)
torch.jit.save(quantized_model, "resnet18_quantized.pt")
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

from torch.quantization import quantize_dynamic



quantized_model = quantize_dynamic(

    model, {torch.nn.Linear}, dtype=torch.qint8

)

torch.jit.save(quantized_model, "resnet18_quantized.pt")

Leverage hardware acceleration
- For devices with GPUs (e.g., NVIDIA Jetson Nano), ensure you’re using CUDA for accelerated computation.
- Install the appropriate PyTorch version with GPU support.
Benchmark performance
- Measure latency and throughput to validate the model’s performance on the edge device:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import time
start_time = time.time()
with torch.no_grad():
    for _ in range(100):
        output = model(img_tensor)
end_time = time.time()
print("Average Inference Time:", (end_time - start_time) / 100)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

import time



start_time = time.time()

with torch.no_grad():

    for _ in range(100):

        output = model(img_tensor)

end_time = time.time()



print("Average Inference Time:", (end_time - start_time) / 100)

Step 5: Deploy at scale

Containerize the application
- Use Docker to create a portable deployment environment.

Example Dockerfile:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
FROM python:3.8-slim
RUN pip install torch torchvision pillow
COPY resnet18_scripted.pt /app/
COPY app.py /app/
WORKDIR /app
CMD ["python", "app.py"]
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

FROM python:3.8-slim



RUN pip install torch torchvision pillow

COPY resnet18_scripted.pt /app/

COPY app.py /app/

WORKDIR /app



CMD ["python", "app.py"]

Monitor and update
- Implement logging and monitoring to ensure your application runs smoothly.
- Use tools like Prometheus or Grafana for real-time insights.

Conclusion

To deploy PyTorch models on Arm edge devices, you need to optimize the model, prepare the software, and use the right hardware. These steps help you deploy AI applications at the edge. This allows fast, efficient inference close to where the data is generated.

0 comments
0 members are here

AI blog

Coaching AI coding agents: A guide for senior engineers

Alex Spinelli

Learn how senior engineers can coach AI coding agents to design, debug, and deliver high-quality code in immersive dev environments.
- June 30, 2025
Optimize Llama.cpp with Arm I8MM instruction

Yibo Cai

Boosted Llama.cpp Q6\_K & Q4\_K inference using Arm's I8MM (smmla) for faster, efficient int8 matrix multiplies on Neoverse-N2 CPUs.
- June 27, 2025
Build AI responsibly with the Yellow Teaming methodology and LLM assistant

Zach Lasiuk

Yellow Teaming helps developers build responsible AI by aligning products with long-term value, not just short-term success.
- June 6, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Deploying PyTorch models on Arm edge devices: A step-by-step tutorial

Prerequisites

Step 1: Prepare your PyTorch model

Step 2: Set up the Arm edge device

Step 3: Deploy the model to the device

Step 4: Optimize for edge performance

Step 5: Deploy at scale

Conclusion

Coaching AI coding agents: A guide for senior engineers

Optimize Llama.cpp with Arm I8MM instruction

Build AI responsibly with the Yellow Teaming methodology and LLM assistant