Machine Learning (ML) is moving to edge no doubt. A lot of embedded devices want to have edge computing capability to provide more advanced services, like voice recognition for smart speaker and face detection on surveillance camera. Constructing a solid performance exploration methodology during the early stage becomes a significant topic to enable a machine learning application.
In this blog I will try to use a sample Convolutional Neural Network (CNN) as an example to describe how to use Arm development solutions to enable Machine Learning (ML) on the device. I summarize different NN implementations by four elements: 1) execution time 2) NN code size, 3) CPU loading and 4) memory access usage.
The table above is a blank template showing how I will use four factors to analyze each kind of Neural Network inference implementation. Execution time and code size are definitely two important factors for application development on resource constrained devices. I will present how to use Arm’s development tools to optimize programming. Also, application memory access usage is another crucial factor of AI performance at the edge, as typically there is heavy data flow between input and output and huge amount of fetching of training model parameters (e.g. convolutional kernel filter, fully_connection, …) on each layer of calculation. I will provide an example to show how to use a profiling tool to visualize performance.
In this article, I choose 28x28 pixel grayscale images of handwritten digits (MNIST dataset) as an example to show how Arm’s development tools can help the user to develop and analyze performance. The reason why I choose MNIST dataset because it’s simple, mature and easy understand. Highly recommend you to read this page before proceeding, I will also make multiple references as below.
Example convolutional neural network
Below example shows that 28x28 pixel input data will generate to 32(*) 28x28 images and then downgrade to 32 14x14 images on layer1, similar procedure will do again on layer2. Follow by two times fully connected layers to come out final 10 value represents the probability for digital 0, 1, 2, …., 9.
Following this blog, it’s easy to use Python to construct Convolutional Neural Network (CNN) in Keras. Only 2 conv2D layers and 2 max_pooling layers are good enough to describe the concept.
My Jupyter Notebook model
After couple runs of training on my laptop (4 hours happy coffee break). Finally, I achieved an accuracy of 98% model, that’s pretty much can be a perfect inference parameter for following Arm embedded device.
Validation loss rate
In order to build up a platform to explore machine learning application on Arm new CPU architecture technology. I’m using the latest Armv8.2 LITTLE-core Cortex-A55 as my target CPU, you can also choice Cortex-A75 or any Armv8.2 extensions core to instead. I used virtual prototyping to easier develop software and system architecture exploration before silicon is available. Arm Fast Model is a preferred answer for me, read this blog for the detail.
Since we have well-tuned the neural network for this application, keep using Python on embedded devices to construct neural network is not a good idea. It's native to use C language to implement those layers on the embedded device to avoid additional software overhead. An efficient development environment is essential to implement neural network from scratch, and a visualization profiling tool is another requirement for me to give me a whole picture of application workload. Arm DS5 offers complete toolchain and Streamline those are what I need.
First of all, I add a debug connection to hock on the target Cortex-A55 Fixed Virtual Platforms (FVP). I use bare-metal configuration to keep everything simple. You can treat virtual platform as a real silicon, all my following works are running on this.
DS-5 Debug Configuration
Another benefit of DS5 is scripting. In my case, an initial Python script to load test image into 0x80150000 and model parameter into 0x80100000 device memory space at power-on-reset to save effort on my firmware implementation.
DS5 Debug Initialization
I choose DS5 PMU example project as my working base, you can check-out the final GitHub project here.
Before writing the neural network function, I found my Japanese colleague Ryuji Tanaka had already implemented a couple of nice APIs on his GitHub, therefore I reference this code for this project. Thanks, Ryuji-san.
Three major APIs are:
Use those APIs to construct the same well-tuned, the pseudo code looks like:
mnist_cnn_eval() { // Pre process convolution(&lay, layer0, layer1, layer0_paramter); max_pooling(&lay, layer1, layer2); convolution (&lay, layer2, layer3, layer2_paramter); max_pooling (&lay, layer3, layer4); fully_connected (&lay, layer4, layer5, layer5_paramter); fully_connected (&lay, layer5, layer6, layer6_paramter); // Post process }
Use PMU function call to monitor during the application execution, instr count means the number of executed instructions.
Let’s load the first image_8.jpg () to test.
Looks good, the image can be detected correctly.
DS5 debug console --------------------------------------- Inf selected image [8] from CPU: 0 prob [ -16, -30, -6, -7, -18, -8, -21, -24, 17, -20,], avg/std: -13/10 Conv_mode: 1 selected image [8] from CPU: 0, inference: 8, [Pass] Instr count is 13146336 Cnt 0 is 0 Cnt 1 is 109812 Cnt 2 is 3416674 Cnt 3 is 13146336 Cnt 4 is 13146336
DS-5 Console
Then I use DS5 scripting to validity MNIST validation dataset to confirm my C implementation that has the same result with original Python API.
Pretty much done, right?
Unfortunately not, the PMU instruction counter tells the application takes 13 million instructions to execute a single image, that means the device can only calculate 7.6 images per second(**) for this sample network.
To explore detail CPU behavior on the application, need to apply Streamline on my DS5 project.
Follow this blog to configure my project on the virtual platform, now Streamline can provide the detail information to observe what’s the CPU behavior.
Under the 100MIPS CPU model assumption, Streamline shows total execution time around 133ms. The user can change different time units to zoom in and out to check different time prior. This powerful tool also can cross reference source code, click here to understand more detail.
Streamline Timeline
One interesting thing I found that is a nested loop function consume 96.7% CPU loading during the simulation. My experience told me that I should try to optimize the compiler option.
Streamline Functions Usage
Bingo!
It shorts 45% CPU time by reducing 6 million instructions by adjusting Arm Compiler 6 optimization.
DS5 debug console --------------------------------------- Inf selected image [8] from CPU: 0 prob [ -16, -30, -6, -7, -18, -8, -21, -24, 17, -20,], avg/std: -13/10 Conv_mode: 1 selected image [8] from CPU: 0, inference: 8, [Pass] Instr count is 7166001 Cnt 0 is 0 Cnt 1 is 110598 Cnt 2 is 1724950 Cnt 3 is 7166001 Cnt 4 is 7166001
Second Streamline Timeline
Observe the call function utilization again, new convolution() still occupied 87% CPU resource, if we want to reduce more execution time, I must be focus on improving convolution() C code implementation.
Second Streamline Functions Usage
Streamline not only provide CPU instruction information, but also collect the average CPU load/store count during the each of sampling to help the user has roughly number on any selected period. Use this I can roughly calculate the total load/store on whole NN operation.
Those numbers are much bigger than I expected, I should review my C code to reduce them.
By tracing the assemble code on DS5, I can a find better coding by separate the original convolution() into convolution_conv2() and new API convolution_filter2().
//pseudo code float int convolution_filter2() { // Load parameter for (current_filter_row = 0; current_filter_row < filter_rows; current_filter_row++) { for (current_filter_col = 0; current_filter_col < filter_cols; current_filter_col++) { for (in_ch = 0; in_ch < input_channel; in_ch++) { current_input = ((float*)inputs)[ ((stride_row + current_filter_row) * intput_columns * input_channel) + ((stride_col + current_filter_col) * input_channel) + in_ch]; for (out_ch = 0; out_ch < output_channel; out_ch++) { current_weight = ((float*)weights)[ (current_filter_row * filter_cols * input_channel * output_channel) + (current_filter_col * input_channel * output_channel) + (in_ch * output_channel) + out_ch]; current_result = current_input * current_weight; ((float*)outputs)[ (stride_row * output_columns * output_channel) + (stride_col * output_channel) + out_ch] += current_result; } } } } for (out_ch = 0; out_ch < output_channel; out_ch++) { current_biase = ((float*)biases)[out_ch]; kernel_output_addr = (stride_row * output_columns * output_channel) + (stride_col * output_channel) + out_ch; kernel_result = ((float*)outputs)[kernel_output_addr]; kernel_result += current_biase; if (relu_activation) { kernel_result = relu(kernel_result); } ((float*)outputs)[kernel_output_addr] = kernel_result; } } int convolution_conv2() { // Initial // Preload parameter // Pre-processing for (stride_row = 0; stride_row < lay->output_rows; stride_row++) { for (stride_col = 0; stride_col < lay->output_columns; stride_col++) { convolution_filter2(stride_row, stride_col, ...); } } }
Let's do the simulation again.
Nice, new code saves 3 million instructions, 3.3 times faster than the beginning.
DS5 debug console --------------------------------------- Inf selected image [8] from CPU: 0 prob [ -16, -30, -6, -7, -18, -8, -21, -24, 17, -20,], avg/std: -13/10 Conv_mode: 2 selected image [8] from CPU: 0, inference: 8, [Pass] Instr count is 3952637 Cnt 0 is 0 Cnt 1 is 110440 Cnt 2 is 629588 Cnt 3 is 3952637 Cnt 4 is 3952637
Third Streamline Timeline
According to the current Streamline report, pretty like there is no room to improve new API on C level. (***)
Calculate load/store sample count again,
Streamline CPU Load/Store average count
Now, we can have a table to compare with three versions of NN device implementations on total execution instruction, the critical function call CPU loading and CPU memory access sampling count.
After several times of coding, I realize that this project had almost touched the software performance limitation. One another approach is going to design a parallel matrix multiplier accelerator to replace convolution_filter2() API.
I have an NN accelerator concept to offload some of the work from the CPU, which has the capability to direct access input data, model parameter and write into the output buffer. CPU can assign input_row and input_col registers to enable parallel 5x5 matrix multiplier. That's a straightforward idea, but the problem is I don't have RTL design to confirm with, does the same virtual platform environment can help with?
Accelerator Model Concept
//pseudo code float int convolution_filter3() { // Pre-load *CONV3_Eng_ROW = stride_row; *CONV3_Eng_COL = stride_col; *CONV3_Eng_EN = 1; while (!*CONV3_Eng_READY) ; return *CONV3_Eng_Z; } convolution_conv3() { // Initial for (stride_row = 0; stride_row < lay->output_rows; stride_row++) { for (stride_col = 0; stride_col < lay->output_columns; stride_col++) { output[output_row][output_col] = convolution_filter3( stride_row, stride_col ); } } }
I use SystemC/TLM to implement an idea of approximately timing behavior model and integrate into Fast Model. By reference this blog, according to timing annotation on register Read/Write, now I can have a software/hardware co-simulation environment to estimate system wise performance.
Use this virtual platform, I can easily explore the whole system performance by different software implementation and adjust NN timing model.
For example, based on 4 cycles read latency and 2 cycles write latency assumption on 5x5 matrix multiplier accelerator, the total instruction count can be drop down to 0.87 million instructions.
DS5 debug console --------------------------------------- Inf selected image [8] from CPU: 0 prob [ -16, -30, -6, -7, -18, -8, -21, -24, 17, -20,], avg/std: -13/10 Conv_mode: 3 selected image [8] from CPU: 0, inference: 8, [Pass] Instr count is 875071 Cnt 0 is 0 Cnt 1 is 71896 Cnt 2 is 225004 Cnt 3 is 875071 Cnt 4 is 875071
DS5 console
Fourth Streamline Timeline
Fourth Streamline Functions Usage
On this case, I used a sample NN application to demonstrate how to bring machine learning inference to an Arm device. This use case shows DS-5 and Fast Models are excellent tools to help people develop and profile software algorithms on any Arm CPU. I will continue to post more useful cases about new CPU architectural explorations on Machine Learning. The next post will focus on NEON and v8.4 dot product (Cortex-A75, A55 also).
Please refer to developer.arm.com for more information on Arm Development Tools and don’t hesitate to comment below if any further question.
THE MNIST DATABASE of handwritten digits
Keras tutorial – build a convolutional neural network in 11 lines
Keras example MNIST_CNN.py on GitHub
Keras Tutorial: The Ultimate Beginner’s Guide to Deep Learning in Python
DS-5 Development Studio Overview
Streamline Performance Analyzer overview in DS-5
Modelling Arm-Based SoCs and Subsystems with Fast Models
Odin Shen GitHub repository - ArmMLVP_MNIST
*: In my Jupyter notebook, I use 16 output channels to instead 32 because that I don’t see improvement on bigger output channel on this dataset.
**: The report rate is under 100MIPS CPU performance assumption
***: Actually, I have use assembler code to apply NEON and v8.2 architecture extension to improve that, I will write another blog to talk with.
Great blog Odin. Thanks
For Chinese reader, please enjoy SC version here.