Using Arm development solutions to bring On-Device Machine Learning inference to embedded world

January 30, 2018

13 minute read time.

Machine Learning (ML) is moving to edge no doubt. A lot of embedded devices want to have edge computing capability to provide more advanced services, like voice recognition for smart speaker and face detection on surveillance camera. Constructing a solid performance exploration methodology during the early stage becomes a significant topic to enable a machine learning application.

In this blog I will try to use a sample Convolutional Neural Network (CNN) as an example to describe how to use Arm development solutions to enable Machine Learning (ML) on the device. I summarize different NN implementations by four elements: 1) execution time 2) NN code size, 3) CPU loading and 4) memory access usage.

NN Device Inference Performance Table Template
	Execution Time (instr)	NN code size (byte)	Convolution function usage	CPU Load/Store
NN#1	(Total inference time)	(NN device code size)	(CPU loading)	(CPU memory utilization)

The table above is a blank template showing how I will use four factors to analyze each kind of Neural Network inference implementation. Execution time and code size are definitely two important factors for application development on resource constrained devices. I will present how to use Arm’s development tools to optimize programming. Also, application memory access usage is another crucial factor of AI performance at the edge, as typically there is heavy data flow between input and output and huge amount of fetching of training model parameters (e.g. convolutional kernel filter, fully_connection, …) on each layer of calculation. I will provide an example to show how to use a profiling tool to visualize performance.

Let's start with a sample CNN model, MNIST

In this article, I choose 28x28 pixel grayscale images of handwritten digits (MNIST dataset) as an example to show how Arm’s development tools can help the user to develop and analyze performance. The reason why I choose MNIST dataset because it’s simple, mature and easy understand. Highly recommend you to read this page before proceeding, I will also make multiple references as below.

Example convolutional neural network

Example convolutional neural network

Below example shows that 28x28 pixel input data will generate to 32(*) 28x28 images and then downgrade to 32 14x14 images on layer1, similar procedure will do again on layer2. Follow by two times fully connected layers to come out final 10 value represents the probability for digital 0, 1, 2, …., 9.

Following this blog, it’s easy to use Python to construct Convolutional Neural Network (CNN) in Keras. Only 2 conv2D layers and 2 max_pooling layers are good enough to describe the concept.

My Jupyter Notebook

My Jupyter Notebook model

After couple runs of training on my laptop (4 hours happy coffee break). Finally, I achieved an accuracy of 98% model, that’s pretty much can be a perfect inference parameter for following Arm embedded device.

Validation loss rate

Validation loss rate

In order to build up a platform to explore machine learning application on Arm new CPU architecture technology. I’m using the latest Armv8.2 LITTLE-core Cortex-A55 as my target CPU, you can also choice Cortex-A75 or any Armv8.2 extensions core to instead. I used virtual prototyping to easier develop software and system architecture exploration before silicon is available. Arm Fast Model is a preferred answer for me, read this blog for the detail.

Since we have well-tuned the neural network for this application, keep using Python on embedded devices to construct neural network is not a good idea. It's native to use C language to implement those layers on the embedded device to avoid additional software overhead. An efficient development environment is essential to implement neural network from scratch, and a visualization profiling tool is another requirement for me to give me a whole picture of application workload. Arm DS5 offers complete toolchain and Streamline those are what I need.

First of all, I add a debug connection to hock on the target Cortex-A55 Fixed Virtual Platforms (FVP). I use bare-metal configuration to keep everything simple. You can treat virtual platform as a real silicon, all my following works are running on this.

DS5 Debug Configuration

DS-5 Debug Configuration

Another benefit of DS5 is scripting. In my case, an initial Python script to load test image into 0x80150000 and model parameter into 0x80100000 device memory space at power-on-reset to save effort on my firmware implementation.

DS-5 Debug Initialization

DS5 Debug Initialization

First Implementation

I choose DS5 PMU example project as my working base, you can check-out the final GitHub project here.

Before writing the neural network function, I found my Japanese colleague Ryuji Tanaka had already implemented a couple of nice APIs on his GitHub, therefore I reference this code for this project. Thanks, Ryuji-san.

Three major APIs are:

API	Description
convolution	Creates a convolution kernel that is convoluted with the layer input to produce a tensor of outputs
max_pooling	Reduce the number of parameters and amount of computation in the network
fully_connected	Connections to all activations in the previous layer, computed with a matrix multiplication followed by a bias offset

Use those APIs to construct the same well-tuned, the pseudo code looks like:

Fullscreen

1
2
3
4
5
6
7
8
9
10
mnist_cnn_eval() {
    // Pre process
    convolution(&lay, layer0, layer1, layer0_paramter);
    max_pooling(&lay, layer1, layer2);
    convolution (&lay, layer2, layer3, layer2_paramter);
    max_pooling (&lay, layer3, layer4);
    fully_connected (&lay, layer4, layer5, layer5_paramter);
    fully_connected (&lay, layer5, layer6, layer6_paramter);
    // Post process
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

mnist_cnn_eval() {
	// Pre process
    convolution(&lay, layer0, layer1, layer0_paramter);
    max_pooling(&lay, layer1, layer2);
    convolution (&lay, layer2, layer3, layer2_paramter);
    max_pooling (&lay, layer3, layer4);
    fully_connected (&lay, layer4, layer5, layer5_paramter);
	fully_connected (&lay, layer5, layer6, layer6_paramter);
	// Post process
}

Use PMU function call to monitor during the application execution, instr count means the number of executed instructions.

Let’s load the first image_8.jpg ( 8 image ) to test.

Looks good, the image can be detected correctly.

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
DS5 debug console
---------------------------------------
Inf selected image [8] from CPU: 0
    prob [ -16, -30, -6, -7, -18, -8, -21, -24, 17, -20,], avg/std: -13/10
    Conv_mode: 1    selected image [8] from CPU: 0, inference: 8,       [Pass]
        Instr count is 13146336
             Cnt 0 is 0
             Cnt 1 is 109812
             Cnt 2 is 3416674
             Cnt 3 is 13146336
             Cnt 4 is 13146336
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

DS5 debug console
---------------------------------------
Inf selected image [8] from CPU: 0
    prob [ -16, -30, -6, -7, -18, -8, -21, -24, 17, -20,], avg/std: -13/10
    Conv_mode: 1	selected image [8] from CPU: 0, inference: 8, 		[Pass]
		Instr count is 13146336
			 Cnt 0 is 0
			 Cnt 1 is 109812
			 Cnt 2 is 3416674
			 Cnt 3 is 13146336
			 Cnt 4 is 13146336

DS-5 Console

Then I use DS5 scripting to validity MNIST validation dataset to confirm my C implementation that has the same result with original Python API.

Pretty much done, right?

Unfortunately not, the PMU instruction counter tells the application takes 13 million instructions to execute a single image, that means the device can only calculate 7.6 images per second(**) for this sample network.

Streamline Profiling

To explore detail CPU behavior on the application, need to apply Streamline on my DS5 project.

Follow this blog to configure my project on the virtual platform, now Streamline can provide the detail information to observe what’s the CPU behavior.

Under the 100MIPS CPU model assumption, Streamline shows total execution time around 133ms. The user can change different time units to zoom in and out to check different time prior. This powerful tool also can cross reference source code, click here to understand more detail.

Streamline Timeline

Streamline Timeline

One interesting thing I found that is a nested loop function consume 96.7% CPU loading during the simulation. My experience told me that I should try to optimize the compiler option.

Streamline Functions Usage

Streamline Functions Usage

NN Device Inference Performance Table
	Execution Time (instr)	NN code size (byte)	Convolution function usage	CPU Load/Store
O1	13146K	49K	96.7%	1187 / 547

Change Compiler Option

Bingo!

It shorts 45% CPU time by reducing 6 million instructions by adjusting Arm Compiler 6 optimization.

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
DS5 debug console
---------------------------------------
Inf selected image [8] from CPU: 0
    prob [ -16, -30, -6, -7, -18, -8, -21, -24, 17, -20,], avg/std: -13/10
    Conv_mode: 1    selected image [8] from CPU: 0, inference: 8,       [Pass]
        Instr count is 7166001
             Cnt 0 is 0
             Cnt 1 is 110598
             Cnt 2 is 1724950
             Cnt 3 is 7166001
             Cnt 4 is 7166001
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

DS5 debug console
---------------------------------------
Inf selected image [8] from CPU: 0
    prob [ -16, -30, -6, -7, -18, -8, -21, -24, 17, -20,], avg/std: -13/10
    Conv_mode: 1	selected image [8] from CPU: 0, inference: 8, 		[Pass]
		Instr count is 7166001
			 Cnt 0 is 0
			 Cnt 1 is 110598
			 Cnt 2 is 1724950
			 Cnt 3 is 7166001
			 Cnt 4 is 7166001

DS-5 Console

Second Streamline Timeline

Second Streamline Timeline

Observe the call function utilization again, new convolution() still occupied 87% CPU resource, if we want to reduce more execution time, I must be focus on improving convolution() C code implementation.

Streamline Functions Usage 2

Second Streamline Functions Usage

NN Device Inference Performance Table #2
	Execution Time (instr)	NN code size (byte)	Convolution function usage	CPU Load/Store
O1	13146K	49K	96.7%	1187 / 547
O3	7166K	31K	86.99%	608 / 224

Re-design API

Streamline not only provide CPU instruction information, but also collect the average CPU load/store count during the each of sampling to help the user has roughly number on any selected period. Use this I can roughly calculate the total load/store on whole NN operation.

Load: 8.33 K/s * 73ms = 608
Store: 3.08 K/s * 73ms = 224

Streamline excert displaying load stores

Those numbers are much bigger than I expected, I should review my C code to reduce them.

By tracing the assemble code on DS5, I can a find better coding by separate the original convolution() into convolution_conv2() and new API convolution_filter2().

API	Description
convolution	Creates a convolution kernel that is convoluted with the layer input to produce a tensor of outputs
convolution_conv2()	Redesign convolution layer, dispatch each 2D input element calculation into convolution_filter2()
convolution_filter2()	Filter kernel convolution implementation
max_pooling	Reduce the number of parameters and amount of computation in the network
fully_connected	Connections to all activations in the previous layer, computed with a matrix multiplication followed by a bias offset

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
//pseudo code 
float int convolution_filter2() {
    // Load parameter
    for (current_filter_row = 0; current_filter_row < filter_rows; current_filter_row++) {
        for (current_filter_col = 0; current_filter_col < filter_cols; current_filter_col++) {
            for (in_ch = 0; in_ch < input_channel; in_ch++) {
                current_input = ((float*)inputs)[  ((stride_row + current_filter_row) * intput_columns * input_channel)
                                                 + ((stride_col + current_filter_col) * input_channel)
                                                 + in_ch];
                for (out_ch = 0; out_ch < output_channel; out_ch++) {
                    current_weight = ((float*)weights)[  (current_filter_row * filter_cols * input_channel * output_channel)
                                                       + (current_filter_col * input_channel * output_channel)
                                                       + (in_ch              * output_channel)
                                                       + out_ch];
                    current_result = current_input * current_weight;
                    ((float*)outputs)[  (stride_row * output_columns * output_channel)
                                      + (stride_col * output_channel)
                                      + out_ch]
                    += current_result;
                }
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

//pseudo code 
float int convolution_filter2() {
    // Load parameter
    for (current_filter_row = 0; current_filter_row < filter_rows; current_filter_row++) {
        for (current_filter_col = 0; current_filter_col < filter_cols; current_filter_col++) {
            for (in_ch = 0; in_ch < input_channel; in_ch++) {
                current_input = ((float*)inputs)[  ((stride_row + current_filter_row) * intput_columns * input_channel)
                                                 + ((stride_col + current_filter_col) * input_channel)
                                                 + in_ch];
                for (out_ch = 0; out_ch < output_channel; out_ch++) {
                    current_weight = ((float*)weights)[  (current_filter_row * filter_cols * input_channel * output_channel)
                                                       + (current_filter_col * input_channel * output_channel)
                                                       + (in_ch              * output_channel)
                                                       + out_ch];
                    current_result = current_input * current_weight;
                    ((float*)outputs)[  (stride_row * output_columns * output_channel)
                                      + (stride_col * output_channel)
                                      + out_ch]
                    += current_result;
                }

            }
        }
    }
    for (out_ch = 0; out_ch < output_channel; out_ch++) {
        current_biase = ((float*)biases)[out_ch];
        kernel_output_addr = (stride_row * output_columns * output_channel) + (stride_col * output_channel) + out_ch;
        kernel_result = ((float*)outputs)[kernel_output_addr];
        kernel_result += current_biase;
        if (relu_activation) {
        	kernel_result = relu(kernel_result);
        }
    	((float*)outputs)[kernel_output_addr] = kernel_result;
    }
}

int convolution_conv2() {
    // Initial
    // Preload parameter
    // Pre-processing
    for (stride_row = 0; stride_row < lay->output_rows; stride_row++) {
        for (stride_col = 0; stride_col < lay->output_columns; stride_col++) {
        	convolution_filter2(stride_row, stride_col, ...);
        }
    }
}

Let's do the simulation again.

Nice, new code saves 3 million instructions, 3.3 times faster than the beginning.

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
DS5 debug console
---------------------------------------
Inf selected image [8] from CPU: 0
    prob [ -16, -30, -6, -7, -18, -8, -21, -24, 17, -20,], avg/std: -13/10
    Conv_mode: 2    selected image [8] from CPU: 0, inference: 8,       [Pass]
        Instr count is 3952637
             Cnt 0 is 0
             Cnt 1 is 110440
             Cnt 2 is 629588
             Cnt 3 is 3952637
             Cnt 4 is 3952637
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

DS5 debug console
---------------------------------------
Inf selected image [8] from CPU: 0
    prob [ -16, -30, -6, -7, -18, -8, -21, -24, 17, -20,], avg/std: -13/10
    Conv_mode: 2	selected image [8] from CPU: 0, inference: 8, 		[Pass]
		Instr count is 3952637
			 Cnt 0 is 0
			 Cnt 1 is 110440
			 Cnt 2 is 629588
			 Cnt 3 is 3952637
			 Cnt 4 is 3952637

DS-5 Console

Streamline Timeline 3

Third Streamline Timeline

According to the current Streamline report, pretty like there is no room to improve new API on C level. (***)

Streamline Functions Usage 3

Calculate load/store sample count again,

Streamline excert displaying the new load stores

Streamline CPU Load/Store average count

Load: 6.12 K/s * 41ms = 250.9
Store: 1.87 K/s * 41ms = 76.6

Now, we can have a table to compare with three versions of NN device implementations on total execution instruction, the critical function call CPU loading and CPU memory access sampling count.

NN Device Inference Performance Table #3
	Execution Time (instr)	NN code size (byte)	Convolution function usage	CPU Load/Store
O1	13146K	49K	96.7%	1187 / 547
O3	7166K	31K	86.99%	608 / 224
O3 w/conv2	3952K	28K	77.16%	250 / 76

NN Accelerator

After several times of coding, I realize that this project had almost touched the software performance limitation. One another approach is going to design a parallel matrix multiplier accelerator to replace convolution_filter2() API.

I have an NN accelerator concept to offload some of the work from the CPU, which has the capability to direct access input data, model parameter and write into the output buffer. CPU can assign input_row and input_col registers to enable parallel 5x5 matrix multiplier. That's a straightforward idea, but the problem is I don't have RTL design to confirm with, does the same virtual platform environment can help with?

Accelerator Model Concept

Accelerator Model Concept

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
//pseudo code 
float int convolution_filter3() {
    // Pre-load
    *CONV3_Eng_ROW = stride_row;
    *CONV3_Eng_COL = stride_col;
    *CONV3_Eng_EN = 1;
    while (!*CONV3_Eng_READY)
        ;
    return *CONV3_Eng_Z;
}
convolution_conv3() {
    // Initial
    for (stride_row = 0; stride_row < lay->output_rows; stride_row++) {
        for (stride_col = 0; stride_col < lay->output_columns; stride_col++) {
            output[output_row][output_col] = 
                convolution_filter3(
                    stride_row,
                    stride_col
                );
        }
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

//pseudo code 
float int convolution_filter3() {
    // Pre-load
    *CONV3_Eng_ROW = stride_row;
    *CONV3_Eng_COL = stride_col;
    *CONV3_Eng_EN = 1;
    while (!*CONV3_Eng_READY)
        ;
    return *CONV3_Eng_Z;
}

convolution_conv3() {
    // Initial
    for (stride_row = 0; stride_row < lay->output_rows; stride_row++) {
        for (stride_col = 0; stride_col < lay->output_columns; stride_col++) {
        	output[output_row][output_col] = 
                convolution_filter3(
				    stride_row,
				    stride_col
        	    );
        }
    }
}

I use SystemC/TLM to implement an idea of approximately timing behavior model and integrate into Fast Model. By reference this blog, according to timing annotation on register Read/Write, now I can have a software/hardware co-simulation environment to estimate system wise performance.

Use this virtual platform, I can easily explore the whole system performance by different software implementation and adjust NN timing model.

For example, based on 4 cycles read latency and 2 cycles write latency assumption on 5x5 matrix multiplier accelerator, the total instruction count can be drop down to 0.87 million instructions.

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
DS5 debug console
---------------------------------------
Inf selected image [8] from CPU: 0
    prob [ -16, -30, -6, -7, -18, -8, -21, -24, 17, -20,], avg/std: -13/10
    Conv_mode: 3    selected image [8] from CPU: 0, inference: 8,       [Pass]
        Instr count is 875071
             Cnt 0 is 0
             Cnt 1 is 71896
             Cnt 2 is 225004
             Cnt 3 is 875071
             Cnt 4 is 875071
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

DS5 debug console
---------------------------------------
Inf selected image [8] from CPU: 0
    prob [ -16, -30, -6, -7, -18, -8, -21, -24, 17, -20,], avg/std: -13/10
    Conv_mode: 3	selected image [8] from CPU: 0, inference: 8, 		[Pass]
		Instr count is 875071
			 Cnt 0 is 0
			 Cnt 1 is 71896
			 Cnt 2 is 225004
			 Cnt 3 is 875071
			 Cnt 4 is 875071

DS5 console

Streamline Timeline 4

Fourth Streamline Timeline

Streamline Functions Usage 4

Fourth Streamline Functions Usage

NN Device Inference Performance Table #4
	Execution Time (instr)	NN code size (byte)	Convolution function usage	CPU Load/Store
O1	13146K	49K	96.7%	1187 / 547
O3	7166K	31K	86.99%	608 / 224
O3 w/conv2	3952K	28K	77.16%	250 / 76
O3 w/NNacc	850k	24K	50%	106 / 23

CPU loading reduction rate on different NN

Conclusion

On this case, I used a sample NN application to demonstrate how to bring machine learning inference to an Arm device. This use case shows DS-5 and Fast Models are excellent tools to help people develop and profile software algorithms on any Arm CPU. I will continue to post more useful cases about new CPU architectural explorations on Machine Learning. The next post will focus on NEON and v8.4 dot product (Cortex-A75, A55 also).

Please refer to developer.arm.com for more information on Arm Development Tools and don’t hesitate to comment below if any further question.

Reference

*: In my Jupyter notebook, I use 16 output channels to instead 32 because that I don’t see improvement on bigger output channel on this dataset.

**: The report rate is under 100MIPS CPU performance assumption

***: Actually, I have use assembler code to apply NEON and v8.2 architecture extension to improve that, I will write another blog to talk with.

2 comments
0 members are here

Tools, Software and IDEs blog

GCC 15: Continuously Improving

Tamar Christina

GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
- June 26, 2025
GitHub and Arm are transforming development on Windows for developers

Pareena Verma

Develop, test, and deploy natively on Windows on Arm with GitHub-hosted Arm runners—faster CI/CD, AI tooling, and full dev stack, no emulation needed.
- May 20, 2025
What is new in LLVM 20?

Volodymyr Turanskyy

Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
- April 29, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Using Arm development solutions to bring On-Device Machine Learning inference to embedded world

NN Device Inference Performance Table Template

Let's start with a sample CNN model, MNIST

First Implementation

Streamline Profiling

NN Device Inference Performance Table

Change Compiler Option

NN Device Inference Performance Table #2

Re-design API

NN Device Inference Performance Table #3

NN Accelerator

NN Device Inference Performance Table #4

Conclusion

Reference

MNIST:

Arm tools:

GitHub:

GCC 15: Continuously Improving

GitHub and Arm are transforming development on Windows for developers

What is new in LLVM 20?