Deep learning is a rapidly evolving field in modern artificial intelligence (AI) and signal processing, with diverse applications including object detection, human identification, speech recognition, gesture tracking and so on. It is providing fantastic opportunities for innovation and business growth across the whole ARM ecosystem and in all market segments.
ARM released the ARM Compute Library (ACL) in mid-March, which is targeting at a variety of use-cases including: image processing, computer vision and deep learning through providing a comprehensive collection of software functions optimized for ARM Cortex CPU and ARM Mali GPU architectures. It is available free of charge under a permissive MIT open-source license. Open AI Lab, an initiative by ARM Accelerator, is promoting and supporting the adoption of the ARM Compute Library among Chinese companies who commit to embedded deep learning and AI devices and applications.
PerceptIn, a Chinese start-up developing next-generation robotics platform being incubated in ARM Accelerator, recently announced their case-study of enabling embedded deep learning inference engine with the ARM Compute Library. In that case, PerceptIn was building an internet-of-things product with deep learning inference capabilities on an ARM SoC containing four ARMv7 cores running at 1 GHz as well as 512 MB of RAM. At its peak it consumes about 3W of power and costs only about four dollars. Their first (and easy to understand) try is to migrate a deep learning framework on their system. They chose TensorFlow since it delivered the best performance on ARM-Linux platforms based on their study. It took weeks of intensive efforts from them to run TensorFlow on their system, which made them think about building a deep learning inference engine from scratch. However, they have two concerns in this idea:
The ARM Compute Library came just in time. PerceptIn investigated the functions in ACL and found the 7 kinds of functions for Convolutional Neural Networks (CNN), including Activation, Convolution, Fully Connected, Locally Connected, Normalization, Pooling, and Soft-Max, are exactly what they needed to build an inference engine. They didn’t waste a minute and went ahead to implement a SqueezeNet using ACL CNN functions.
In the construction of SqueezeNet, the ARM Compute Library provided most of the functions and building blocks that PerceptIn needed. Although there were still a few layer types like concat and global pooling that ACL doesn’t support yet and need to build from scratch, ACL helped Perceptin complete the implementation of SqueezeNet in only several days of work.
The ARM Compute Library also addressed the performance concern very well. PerceptIn delved into the performance of ACL versus that of TensorFlow. To ensure a fair comparison, ARM NEON-enabled building blocks in ACL were chosen when building the SqueezeNet engine, whilst ARM NEON vector computation optimization in TensorFlow was enabled as well. As shown in Figure 1, the SqueezeNet with TensorFlow was run on PerceptIn’s quad-core ARMv7@1GHz platform. On average it took 420 ms to process a 227x227 RGB image. Using the SqueezeNet engine built from ACL, it took only 320 ms to process the same image, hence a 25% speedup. To better understand the source of the gain in performance, they moved one step further and divided the processing time into two groups: the first group included convolution, RELU, and concatenate, while the second group included pooling and soft-max. The breakdown results show that the ACL-based SqueezeNet engine outperforms TensorFlow by 23% in group 1 and 110% in group 2. As for resource utilization, when running on TensorFlow, the average CPU usage is 75% while the average memory usage is about 9MB. At the same time, when running on ACL, the average CPU usage is 90% and the average memory usage is about 10 MB. The performance comparison shows the ARM Compute Library brings better NEON optimization and less performance overhead.
Figure 1: TensorFlow vs. ACL
In this case PerceptIn shared their practice of building a deep learning inference engine on an embedded platform by using the ARM Compute Library. ACL helped PerceptIn save significant development cost as well as achieved better performance. We welcome more developers to share their experience and achievements with utilizing ACL in their embedded deep learning and AI products.
 Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M. and Ghemawat, S., 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
 Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J. and Keutzer, K., 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360.