随着物联网的流行,节点的低功耗,实时性智能需求也在不断增多。神经网络(Neural Network)作为一种热门的人工智能技术方向,网上已经非常丰富的介绍文档,我就不再赘述。传统的神经网络需要大量的计算资源去实现,相对于学习过程来说,推理所需要的资源相对较少,但仍然十分巨大。如何在性能相对较弱的微处理器上实现神经网络的快速推理过程?这需要对现有的神经网络模型进行大量的优化,CMSIS-NN就是在这个方向上的一个非常好的尝试。
CMSIS的全称是Cortex Microcontroller Software Interface Standard (Cortex微处理器软件接口标准),他的目的是为了解决微处理器生态中软件无法兼容的问题.目前微处理器上的软件操作系统非常分散,相应的软件无法很好的复用,存在大量重复制造轮子的现象。CMSIS整体框架如下,通过引入一些极简的抽象层API,把应用程序,中间件同OS隔离而不影响系统性能,同时加入了主流调试器DS-5/KEIL/IAR的支持。
CMSIS-NN是最近CMSIS家族引入的新成员,大大缓解了微处理器是神经网络相关软件优化压力,整体框图如下。CMSIS-NN通过对神经网络中所需要的关键函数进行优化而达到加速的目的,比如用定点计算(8/16 bits)替代浮点计算,通过查表避免激活函数计算等等。
在程序中使用CMSIS-NN添加神经网络也非常方便,只需要调用相应的API即可完成。由于目前MDK只支持Windows,下面整个过程都是在Windows 10中测试完成。
首先,你要升级CMSIS至5.2.1,目前官方发布的版本5.2.0,CMSIS 5.2.1需要手工编译,编译过程如下:
其次,打开C:\ XXXXXX \CMSIS_5\CMSIS\NN\Examples\ARM\arm_nn_examples\cifar10\ arm_nnexamples_cifar10.uvprojx,双击在MDK中打开,编译运行,你就可以在模拟器中运行CMSIS-NN的样例工程了,结果如下。
下面是示例的源码,输入是一副32x32的RGB图像,经过一个训练好的3层神经网络模型进行推导,输出最大概率的那个数,根据上面输出,图片应该是2
#include "arm_nnexamples_cifar10_weights.h" #include "arm_nnfunctions.h" #include "arm_nnexamples_cifar10_inputs.h" #ifdef _RTE_ #include "RTE_Components.h" #ifdef RTE_Compiler_EventRecorder #include "EventRecorder.h" #endif #endif // include the input and weights static q7_t conv1_wt[CONV1_IM_CH * CONV1_KER_DIM * CONV1_KER_DIM * CONV1_OUT_CH] = CONV1_WT; static q7_t conv1_bias[CONV1_OUT_CH] = CONV1_BIAS; static q7_t conv2_wt[CONV2_IM_CH * CONV2_KER_DIM * CONV2_KER_DIM * CONV2_OUT_CH] = CONV2_WT; static q7_t conv2_bias[CONV2_OUT_CH] = CONV2_BIAS; static q7_t conv3_wt[CONV3_IM_CH * CONV3_KER_DIM * CONV3_KER_DIM * CONV3_OUT_CH] = CONV3_WT; static q7_t conv3_bias[CONV3_OUT_CH] = CONV3_BIAS; static q7_t ip1_wt[IP1_DIM * IP1_OUT] = IP1_WT; static q7_t ip1_bias[IP1_OUT] = IP1_BIAS; q7_t input_data[CONV1_IM_CH * CONV1_IM_DIM * CONV1_IM_DIM] = IMG_DATA; q7_t output_data[IP1_OUT]; //vector buffer: max(im2col buffer,average pool buffer, fully connected buffer) q7_t col_buffer[2 * 5 * 5 * 32 * 2]; q7_t scratch_buffer[32 * 32 * 10 * 4]; int main() { #ifdef RTE_Compiler_EventRecorder EventRecorderInitialize (EventRecordAll, 1); // initialize and start Event Recorder #endif printf("start execution\n"); /* start the execution */ q7_t *img_buffer1 = scratch_buffer; q7_t *img_buffer2 = img_buffer1 + 32 * 32 * 32; // conv1 input_data -> img_buffer1 arm_convolve_HWC_q7_RGB(input_data, CONV1_IM_DIM, CONV1_IM_CH, conv1_wt, CONV1_OUT_CH, CONV1_KER_DIM, CONV1_PADDING, CONV1_STRIDE, conv1_bias, CONV1_BIAS_LSHIFT, CONV1_OUT_RSHIFT, img_buffer1, CONV1_OUT_DIM, (q15_t *) col_buffer, NULL); arm_relu_q7(img_buffer1, CONV1_OUT_DIM * CONV1_OUT_DIM * CONV1_OUT_CH); // pool1 img_buffer1 -> img_buffer2 arm_maxpool_q7_HWC(img_buffer1, CONV1_OUT_DIM, CONV1_OUT_CH, POOL1_KER_DIM, POOL1_PADDING, POOL1_STRIDE, POOL1_OUT_DIM, NULL, img_buffer2); // conv2 img_buffer2 -> img_buffer1 arm_convolve_HWC_q7_fast(img_buffer2, CONV2_IM_DIM, CONV2_IM_CH, conv2_wt, CONV2_OUT_CH, CONV2_KER_DIM, CONV2_PADDING, CONV2_STRIDE, conv2_bias, CONV2_BIAS_LSHIFT, CONV2_OUT_RSHIFT, img_buffer1, CONV2_OUT_DIM, (q15_t *) col_buffer, NULL); arm_relu_q7(img_buffer1, CONV2_OUT_DIM * CONV2_OUT_DIM * CONV2_OUT_CH); // pool2 img_buffer1 -> img_buffer2 arm_avepool_q7_HWC(img_buffer1, CONV2_OUT_DIM, CONV2_OUT_CH, POOL2_KER_DIM, POOL2_PADDING, POOL2_STRIDE, POOL2_OUT_DIM, col_buffer, img_buffer2); // conv3 img_buffer2 -> img_buffer1 arm_convolve_HWC_q7_fast(img_buffer2, CONV3_IM_DIM, CONV3_IM_CH, conv3_wt, CONV3_OUT_CH, CONV3_KER_DIM, CONV3_PADDING, CONV3_STRIDE, conv3_bias, CONV3_BIAS_LSHIFT, CONV3_OUT_RSHIFT, img_buffer1, CONV3_OUT_DIM, (q15_t *) col_buffer, NULL); arm_relu_q7(img_buffer1, CONV3_OUT_DIM * CONV3_OUT_DIM * CONV3_OUT_CH); // pool3 img_buffer-> img_buffer2 arm_avepool_q7_HWC(img_buffer1, CONV3_OUT_DIM, CONV3_OUT_CH, POOL3_KER_DIM, POOL3_PADDING, POOL3_STRIDE, POOL3_OUT_DIM, col_buffer, img_buffer2); #ifdef IP_X4 arm_fully_connected_q7_opt(img_buffer2, ip1_wt, IP1_DIM, IP1_OUT, IP1_BIAS_LSHIFT, IP1_OUT_RSHIFT, ip1_bias, output_data, (q15_t *) img_buffer1); #else arm_fully_connected_q7(img_buffer2, ip1_wt, IP1_DIM, IP1_OUT, IP1_BIAS_LSHIFT, IP1_OUT_RSHIFT, ip1_bias, output_data, (q15_t *) img_buffer1); #endif arm_softmax_q7(output_data, 10, output_data); for (int i = 0; i < 10; i++) { printf("%d: %d\n", i, output_data[i]); } return 0; }
因为手头没有硬件,上面程序只是在模拟器上仿真运行,所以也没法估计性能,但是我们已经有其他同事给出了硬件测试结果,性能如下
以上就是CMSIS-NN的简单上手,希望大家能在后续项目中用上,也欢迎大家能给CMSIS-NN多多建议,最好patch :)
运行bat文件的时候出现了与windows64位系统不兼容,如何解决?
那就以兼容模式运行呗