Using Arm development solutions to bring On-Device Machine Learning inference to embedded world

Machine Learning (ML) is moving to edge no doubt. A lot of embedded devices want to have edge computing capability to provide more advanced services, like voice recognition for smart speaker and face detection on surveillance camera. Constructing a solid performance exploration methodology during the early stage becomes a significant topic to enable a machine learning application.

In this blog I will try to use a sample Convolutional Neural Network (CNN) as an example to describe how to use Arm development solutions to enable Machine Learning (ML) on the device. I summarize different NN implementations by four elements: 1) execution time 2) NN code size, 3) CPU loading and 4) memory access usage. 

NN Device Inference Performance Table Template
 Execution Time (instr)   NN code size (byte)  Convolution function usage   CPU Load/Store
 NN#1   (Total inference time)   (NN device code size)   (CPU loading)  (CPU memory utilization) 

The table above is a blank template showing how I will use four factors to analyze each kind of Neural Network inference implementation. Execution time and code size are definitely two important factors for application development on resource constrained devices. I will present how to use Arm’s development tools to optimize programming. Also, application memory access usage is another crucial factor of AI performance at the edge, as typically there is heavy data flow between input and output and huge amount of fetching of training model parameters (e.g. convolutional kernel filter, fully_connection, …) on each layer of calculation. I will provide an example to show how to use a profiling tool to visualize performance.

Let's start with a sample CNN model, MNIST

In this article, I choose 28x28 pixel grayscale images of handwritten digits (MNIST dataset) as an example to show how Arm’s development tools can help the user to develop and analyze performance. The reason why I choose MNIST dataset because it’s simple, mature and easy understand. Highly recommend you to read this page before proceeding, I will also make multiple references as below.

Example convolutional neural network

Example convolutional neural network

Below example shows that 28x28 pixel input data will generate to 32(*) 28x28 images and then downgrade to 32 14x14 images on layer1, similar procedure will do again on layer2. Follow by two times fully connected layers to come out final 10 value represents the probability for digital 0, 1, 2, …., 9.

Following this blog, it’s easy to use Python to construct Convolutional Neural Network (CNN) in Keras. Only 2 conv2D layers and 2 max_pooling layers are good enough to describe the concept.

My Jupyter Notebook

My Jupyter Notebook model

After couple runs of training on my laptop (4 hours happy coffee break). Finally, I achieved an accuracy of 98% model, that’s pretty much can be a perfect inference parameter for following Arm embedded device.

Validation loss rate

Validation loss rate

In order to build up a platform to explore machine learning application on Arm new CPU architecture technology. I’m using the latest Armv8.2 LITTLE-core Cortex-A55 as my target CPU, you can also choice Cortex-A75 or any Armv8.2 extensions core to instead. I used virtual prototyping to easier develop software and system architecture exploration before silicon is available. Arm Fast Model is a preferred answer for me, read this blog for the detail.

Since we have well-tuned the neural network for this application, keep using Python on embedded devices to construct neural network is not a good idea. It's native to use C language to implement those layers on the embedded device to avoid additional software overhead. An efficient development environment is essential to implement neural network from scratch, and a visualization profiling tool is another requirement for me to give me a whole picture of application workload. Arm DS5 offers complete toolchain and Streamline those are what I need.

First of all, I add a debug connection to hock on the target Cortex-A55 Fixed Virtual Platforms (FVP). I use bare-metal configuration to keep everything simple. You can treat virtual platform as a real silicon, all my following works are running on this.

DS5 Debug Configuration

DS-5 Debug Configuration

Another benefit of DS5 is scripting. In my case, an initial Python script to load test image into 0x80150000 and model parameter into 0x80100000 device memory space at power-on-reset to save effort on my firmware implementation.

DS-5 Debug Initialization

DS5 Debug Initialization

First Implementation

I choose DS5 PMU example project as my working base, you can check-out the final GitHub project here.

Before writing the neural network function, I found my Japanese colleague Ryuji Tanaka had already implemented a couple of nice APIs on his GitHub, therefore I reference this code for this project. Thanks, Ryuji-san.

Three major APIs are:

API Description
convolution Creates a convolution kernel that is convoluted with the layer input to produce a tensor of outputs
max_pooling Reduce the number of parameters and amount of computation in the network
fully_connected Connections to all activations in the previous layer, computed with a matrix multiplication followed by a bias offset

Use those APIs to construct the same well-tuned, the pseudo code looks like:

mnist_cnn_eval() {
	// Pre process
    convolution(&lay, layer0, layer1, layer0_paramter);
    max_pooling(&lay, layer1, layer2);
    convolution (&lay, layer2, layer3, layer2_paramter);
    max_pooling (&lay, layer3, layer4);
    fully_connected (&lay, layer4, layer5, layer5_paramter);
	fully_connected (&lay, layer5, layer6, layer6_paramter);
	// Post process
}

Use PMU function call to monitor during the application execution, instr count means the number of executed instructions.

Let’s load the first image_8.jpg (8 image) to test.

Looks good, the image can be detected correctly.

DS5 debug console
---------------------------------------
Inf selected image [8] from CPU: 0
    prob [ -16, -30, -6, -7, -18, -8, -21, -24, 17, -20,], avg/std: -13/10
    Conv_mode: 1	selected image [8] from CPU: 0, inference: 8, 		[Pass]
		Instr count is 13146336
			 Cnt 0 is 0
			 Cnt 1 is 109812
			 Cnt 2 is 3416674
			 Cnt 3 is 13146336
			 Cnt 4 is 13146336

DS-5 Console

Then I use DS5 scripting to validity MNIST validation dataset to confirm my C implementation that has the same result with original Python API.

Pretty much done, right?

Unfortunately not, the PMU instruction counter tells the application takes 13 million instructions to execute a single image, that means the device can only calculate 7.6 images per second(**) for this sample network.

Streamline Profiling

To explore detail CPU behavior on the application, need to apply Streamline on my DS5 project.

Follow this blog to configure my project on the virtual platform, now Streamline can provide the detail information to observe what’s the CPU behavior.

Under the 100MIPS CPU model assumption, Streamline shows total execution time around 133ms. The user can change different time units to zoom in and out to check different time prior. This powerful tool also can cross reference source code, click here to understand more detail. 

Streamline Timeline

Streamline Timeline

One interesting thing I found that is a nested loop function consume 96.7% CPU loading during the simulation. My experience told me that I should try to optimize the compiler option.

Streamline Functions Usage

Streamline Functions Usage

NN Device Inference Performance Table
 Execution Time (instr)   NN code size (byte)  Convolution function usage   CPU Load/Store
 O1  13146K  49K  96.7%  1187 / 547

Change Compiler Option

Bingo!

It shorts 45% CPU time by reducing 6 million instructions by adjusting Arm Compiler 6 optimization.

DS5 debug console
---------------------------------------
Inf selected image [8] from CPU: 0
    prob [ -16, -30, -6, -7, -18, -8, -21, -24, 17, -20,], avg/std: -13/10
    Conv_mode: 1	selected image [8] from CPU: 0, inference: 8, 		[Pass]
		Instr count is 7166001
			 Cnt 0 is 0
			 Cnt 1 is 110598
			 Cnt 2 is 1724950
			 Cnt 3 is 7166001
			 Cnt 4 is 7166001

DS-5 Console

Second Streamline Timeline

Second Streamline Timeline

Observe the call function utilization again, new convolution() still occupied 87% CPU resource, if we want to reduce more execution time, I must be focus on improving convolution() C code implementation.

Streamline Functions Usage 2

Second Streamline Functions Usage

NN Device Inference Performance Table #2
 Execution Time (instr)   NN code size (byte)  Convolution function usage   CPU Load/Store
 O1  13146K  49K  96.7%  1187 / 547
 O3  7166K  31K  86.99%  608 / 224

Re-design API

Streamline not only provide CPU instruction information, but also collect the average CPU load/store count during the each of sampling to help the user has roughly number on any selected period. Use this I can roughly calculate the total load/store on whole NN operation.

  • Load: 8.33 K/s * 73ms = 608
  • Store: 3.08 K/s * 73ms = 224

Streamline excert displaying load stores

Those numbers are much bigger than I expected, I should review my C code to reduce them.

By tracing the assemble code on DS5, I can a find better coding by separate the original convolution() into convolution_conv2() and new API convolution_filter2().

API Description
convolution Creates a convolution kernel that is convoluted with the layer input to produce a tensor of outputs
convolution_conv2() Redesign convolution layer, dispatch each 2D input element calculation into convolution_filter2()
convolution_filter2() Filter kernel convolution implementation
max_pooling Reduce the number of parameters and amount of computation in the network
fully_connected Connections to all activations in the previous layer, computed with a matrix multiplication followed by a bias offset

//pseudo code 
float int convolution_filter2() {
    // Load parameter
    for (current_filter_row = 0; current_filter_row < filter_rows; current_filter_row++) {
        for (current_filter_col = 0; current_filter_col < filter_cols; current_filter_col++) {
            for (in_ch = 0; in_ch < input_channel; in_ch++) {
                current_input = ((float*)inputs)[  ((stride_row + current_filter_row) * intput_columns * input_channel)
                                                 + ((stride_col + current_filter_col) * input_channel)
                                                 + in_ch];
                for (out_ch = 0; out_ch < output_channel; out_ch++) {
                    current_weight = ((float*)weights)[  (current_filter_row * filter_cols * input_channel * output_channel)
                                                       + (current_filter_col * input_channel * output_channel)
                                                       + (in_ch              * output_channel)
                                                       + out_ch];
                    current_result = current_input * current_weight;
                    ((float*)outputs)[  (stride_row * output_columns * output_channel)
                                      + (stride_col * output_channel)
                                      + out_ch]
                    += current_result;
                }

            }
        }
    }
    for (out_ch = 0; out_ch < output_channel; out_ch++) {
        current_biase = ((float*)biases)[out_ch];
        kernel_output_addr = (stride_row * output_columns * output_channel) + (stride_col * output_channel) + out_ch;
        kernel_result = ((float*)outputs)[kernel_output_addr];
        kernel_result += current_biase;
        if (relu_activation) {
        	kernel_result = relu(kernel_result);
        }
    	((float*)outputs)[kernel_output_addr] = kernel_result;
    }
}

int convolution_conv2() {
    // Initial
    // Preload parameter
    // Pre-processing
    for (stride_row = 0; stride_row < lay->output_rows; stride_row++) {
        for (stride_col = 0; stride_col < lay->output_columns; stride_col++) {
        	convolution_filter2(stride_row, stride_col, ...);
        }
    }
}

Let's do the simulation again.

Nice, new code saves 3 million instructions, 3.3 times faster than the beginning.

DS5 debug console
---------------------------------------
Inf selected image [8] from CPU: 0
    prob [ -16, -30, -6, -7, -18, -8, -21, -24, 17, -20,], avg/std: -13/10
    Conv_mode: 2	selected image [8] from CPU: 0, inference: 8, 		[Pass]
		Instr count is 3952637
			 Cnt 0 is 0
			 Cnt 1 is 110440
			 Cnt 2 is 629588
			 Cnt 3 is 3952637
			 Cnt 4 is 3952637

DS-5 Console

Streamline Timeline 3

Third Streamline Timeline

According to the current Streamline report, pretty like there is no room to improve new API on C level. (***)

Streamline Functions Usage 3

Calculate load/store sample count again, 

Streamline excert displaying the new load stores

Streamline CPU Load/Store average count

  • Load: 6.12 K/s * 41ms = 250.9
  • Store: 1.87 K/s * 41ms = 76.6

Now, we can have a table to compare with three versions of NN device implementations on total execution instruction, the critical function call CPU loading and CPU memory access sampling count.

NN Device Inference Performance Table #3
 Execution Time (instr)   NN code size (byte)  Convolution function usage   CPU Load/Store
 O1  13146K  49K  96.7%  1187 / 547
 O3  7166K  31K  86.99%  608 / 224
 O3 w/conv2  3952K  28K  77.16%  250 / 76

NN Accelerator

After several times of coding, I realize that this project had almost touched the software performance limitation. One another approach is going to design a parallel matrix multiplier accelerator to replace convolution_filter2() API.

I have an NN accelerator concept to offload some of the work from the CPU, which has the capability to direct access input data, model parameter and write into the output buffer. CPU can assign input_row and input_col registers to enable parallel 5x5 matrix multiplier. That's a straightforward idea, but the problem is I don't have RTL design to confirm with, does the same virtual platform environment can help with?

Accelerator Model Concept

Accelerator Model Concept 

//pseudo code 
float int convolution_filter3() {
    // Pre-load
    *CONV3_Eng_ROW = stride_row;
    *CONV3_Eng_COL = stride_col;
    *CONV3_Eng_EN = 1;
    while (!*CONV3_Eng_READY)
        ;
    return *CONV3_Eng_Z;
}

convolution_conv3() {
    // Initial
    for (stride_row = 0; stride_row < lay->output_rows; stride_row++) {
        for (stride_col = 0; stride_col < lay->output_columns; stride_col++) {
        	output[output_row][output_col] = 
                convolution_filter3(
				    stride_row,
				    stride_col
        	    );
        }
    }
}

I use SystemC/TLM to implement an idea of approximately timing behavior model and integrate into Fast Model. By reference this blog, according to timing annotation on register Read/Write, now I can have a software/hardware co-simulation environment to estimate system wise performance.

Use this virtual platform, I can easily explore the whole system performance by different software implementation and adjust NN timing model.

For example, based on 4 cycles read latency and 2 cycles write latency assumption on 5x5 matrix multiplier accelerator, the total instruction count can be drop down to 0.87 million instructions.

DS5 debug console
---------------------------------------
Inf selected image [8] from CPU: 0
    prob [ -16, -30, -6, -7, -18, -8, -21, -24, 17, -20,], avg/std: -13/10
    Conv_mode: 3	selected image [8] from CPU: 0, inference: 8, 		[Pass]
		Instr count is 875071
			 Cnt 0 is 0
			 Cnt 1 is 71896
			 Cnt 2 is 225004
			 Cnt 3 is 875071
			 Cnt 4 is 875071

DS5 console

Streamline Timeline 4

Fourth Streamline Timeline

Streamline Functions Usage 4

Fourth Streamline Functions Usage

NN Device Inference Performance Table #4
 Execution Time (instr)   NN code size (byte)  Convolution function usage   CPU Load/Store
 O1  13146K  49K  96.7%  1187 / 547
 O3  7166K  31K  86.99%  608 / 224
 O3 w/conv2  3952K  28K  77.16%  250 / 76
 O3 w/NNacc   850k  24K  50%  106 / 23

CPU loading reduction rate on different NN

Conclusion

On this case, I used a sample NN application to demonstrate how to bring machine learning inference to an Arm device. This use case shows DS-5 and Fast Models are excellent tools to help people develop and profile software algorithms on any Arm CPU. I will continue to post more useful cases about new CPU architectural explorations on Machine Learning. The next post will focus on NEON and v8.4 dot product (Cortex-A75, A55 also).

Please refer to developer.arm.com for more information on Arm Development Tools and don’t hesitate to comment below if any further question.

Reference

MNIST:

THE MNIST DATABASE of handwritten digits

Keras tutorial – build a convolutional neural network in 11 lines

Keras example MNIST_CNN.py on GitHub

Keras Tutorial: The Ultimate Beginner’s Guide to Deep Learning in Python

Arm tools:

DS-5 Development Studio Overview

Streamline Performance Analyzer overview in DS-5

Modelling Arm-Based SoCs and Subsystems with Fast Models

GitHub:

Odin Shen GitHub repository - ArmMLVP_MNIST

*: In my Jupyter notebook, I use 16 output channels to instead 32 because that I don’t see improvement on bigger output channel on this dataset.

**:  The report rate is under 100MIPS CPU performance assumption

***: Actually, I have use assembler code to apply NEON and v8.2 architecture extension to improve that, I will write another blog to talk with.

Anonymous