The Arm ML Research Lab explores cutting edge techniques and state-of-the-art algorithms. One of our primary research thrusts is investigating ways to bring more machine learning applications to Arm's products, and make existing applications more efficient. Machine learning algorithms, and neural networks (NNs) in particular, are increasingly deployed in very constrained devices. Popular applications include visual and speech interfaces in smart-home devices, predictive maintenance for commercial and industrial machines, and health monitoring in wearable devices. However, due to the energy, power, storage, and compute limitations of these devices, they are frequently limited to simplistic tasks, while more sophisticated requests are off-loaded to a more capable device or to a server. Frequently, even off-loading is not practical due to the additional latency and energy costs, crippling the intelligence available in constrained devices.
However, two of our projects at Arm have significantly advanced the state of the art in dramatically reducing the complexity and size of neural networks, bringing us closer to successfully running more complex tasks on such applications.
The size of typical RNN layers can prohibit deployment of these networks on IoT devices, or reduce the efficiency of the execution of these networks on devices with small capacity caches. An LSTM-based keyword spotting network, for example, has a model size of 243KB, which exceeds the amount of SRAM available in many microcontroller-based systems and even exceeds the total storage capacity of some. Thus, there is a need for a compression technique that can drastically compress RNN layers without sacrificing the task accuracy.
The research in neural network (NN) compression can be roughly categorized under four topics: Pruning, Structured Matrix-based Techniques, Tensor Decomposition, and Quantization (see 'References' for examples). However, our results show that these popular compression techniques can lead to significant losses in accuracy when attempting to compress networks by a factor of 15 or more. Additionally, a compression technique should not sacrifice run-time during inference, as some of these applications can have hard real-time deadlines. In this work, we provide an alternative to the traditional approaches that can achieve these objectives; Kronecker Product RNNs.
Given matrices B and C, the Kronecker Product A can be expressed as:
Where ◦ is the Hadamard product. B and C can be referred to as the Kronecker factors of A. We use the Kronecker factors to replace the recurrent weights of RNNs. Prior work on this topic has used a number of 2x2 Kronecker factors to reach the size of the weight matrix desired. This, however, leads to a dramatic increase in run-time.
Our technique restricts the number of factors to 2 and uses the prime factors of the weight matrix dimensions to find factor matrices as large as possible, thereby reducing the amount of computation required at inference. Results on an LSTM-based keyword spotting workload are shown below, where this technique achieved a 25x reduction in the model size:
Here, the Kronecker technique is compared to other state-of-the-art compression techniques, and an alternative version of the baseline made with fewer parameters. All of these points assume a 25x compression of the model. While the Kronecker technique is slower than the alternatives, it is still faster than the baseline and preserves the most accuracy.
One restriction of the Kronecker technique is that the amount of compression cannot be “user-driven” as it depends on the dimensions of the factor matrices. We propose an additional “hybrid” Kronecker technique in our work, which gives a model designer flexibility in the amount of compression they want for a target model accuracy.
The Kronecker and Hybrid Kronecker techniques can be used as drop-in replacements for most RNN layers, and can be used selectively based on the target accuracy, model size and execution time. Our experiments show compression ratios from 10x to 38x using these techniques with minimal accuracy loss, enabling more and richer applications to run on constrained platforms.
The Kronecker version of the keyword spotting application has a model size of only 16kB, compared to the original 244kB. We were also able to reduce a human activity recognition model from 1.5MB to just 75kB – a 28x reduction. An already optimized USPS digit-recognition model which was already small at 8kB, is shrunk to under 2kB. Making complex networks smaller, and making small networks smaller still enables new applications to run on Arm microcontrollers, and enables those that already do to run on smaller, more power-efficient ones.
Pruning and matrix decomposition are popular ways to reduce model size, but we sometimes find that certain revolutionary neural architecture design choices have a significant impact as well. This is certainly the case with the depth-wise separable convolution popularized in the MobileNet networks. The authors of this paper also found that using depth-wise separable convolutions was the best way to get the most accurate, fastest, and smallest keyword-spotting model when compared to several other techniques.
But can we go further?
Two recent techniques – Bonsai Trees and StrassenNets – showed impressive results in reducing model size and computation. We applied these techniques individually to the most efficient DS-CNN depth-wise separable network in the Ternary Hybrid Neural-tree Network paper.
What we initially found was somewhat discouraging. Using StrassenNets resulted in dramatic reduction in model size due to its use of ternary weights but led to a significant increase in the amount of computation required. Bonsai Trees had the opposite effect – a reduction in execution time but an increase in model size.
We observed that different parts of the DS-CNN network responded differently to these two techniques. This led us to the HybridNet network architecture, shown below, that combines both of these ideas.
We preserved much of the original neural network in the feature extraction parts of DS-CNN and optimized it with the StrassenNets technique. We then optimized the classification layers using the Bonsai Decision Trees.
This new network architecture led to a 30% reduction in the overall memory footprint of the network, and a 12% reduction in the amount of computation. It achieved all of this while having a minimal impact on the model accuracy of only 0.3%.
HybridNet illustrates how innovatively combining several known techniques leads to new levels of optimization. More details on this work can be found in our SysML 2019 paper, 'Ternary Hybrid Neural-tree Networks for Highly Constrained IoT Applications'. The Arm ML Research group are continuing to improve on these techniques, trying to find ways to mitigate the overheads even further.
The Cortex-M4F is frequently used in chips that have 32kB or less data RAM capacity. Reducing the memory footprint of keyword spotting, and other such similar workloads, will allow us to deploy machine learning on hundreds more SoCs and innumerable more IoT devices.
Our Kronecker and HybridNet work illustrate the two-pronged approach we are taking in our TinyML work at Arm, and the significant optimization improvements that have been achieved so far. So, what’s next? Our team is focusing on both uncovering new algorithms for aggressively optimizing ML models, while simultaneously trying to combine a variety of these algorithms to best address the diverse landscape of neural networks. We’re excited to drive these improvements forward and pave the way for continued innovation across the Arm ML ecosystem.
This blog post was authored by the Arm ML Research Lab
Pruning:
Soravit Changpinyo, Mark Sandler, and Andrey Zhmoginov. The power of sparsity in convolutional neural networks. CoRR, abs/1702.06257, 2017.
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR), 2016.
Sharan Narang, Eric Undersander, and Gregory F. Diamos. Block-sparse recurrent neural networks. CoRR, abs/1711.02782, 2017.
Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv e-prints, page arXiv:1710.01878, October 2017
Structured Matrix
Caiwen Ding, Ao Ren, Geng Yuan, Xiaolong Ma, Jiayu Li, Ning Liu, Bo Yuan, and Yanzhi Wang. Structured weight matrices-based hardware accelerators in deep neural networks: Fpgas and asics. In Proceedings of the 2018 on Great Lakes Symposium on VLSI, GLSVLSI ’18, pages 353–358, New York, NY, USA, 2018. ACM.
Vikas Sindhwani, Tara Sainath, and Sanjiv Kumar. Structured transforms for small-footprint deep learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3088–3096. Curran Associates, Inc., 2015.
Tensor Decomposition
V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition. arXiv e-prints, December 2014.
Giuseppe Giovanni Calvi, Ahmad Moniri, Mahmoud Mahfouz, Zeyang Yu, Qibin Zhao, and Danilo P. Mandic. Tucker tensor layer in fully connected neural networks. CoRR, abs/1903.06133, 2019.
Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. Compressing recurrent neural network with tensor train. In Neural Networks (IJCNN), 2017 International Joint Conference on, pages 4451–4458. IEEE, 2017.
Ting Chen, Ji Lin, Tian Lin, Song Han, Chong Wang, and Denny Zhou. Adaptive mixture of low-rank factorizations for compact neural modeling. Advances in neural information processing systems (CDNNRIA workshop), 2018
Artem M. Grachev, Dmitry I. Ignatov, and Andrey V. Savchenko. Neural networks compression for language modeling. In B. Uma Shankar, Kuntal Ghosh, Deba Prasad Mandal, Shubhra Sankar Ray, David Zhang, and Sankar K. Pal, editors, Pattern Recognition and Machine Intelligence, pages 351–357, Cham, 2017. Springer International Publishing.
Quantization
Qinyao He, He Wen, Shuchang Zhou, Yuxin Wu, Cong Yao, Xinyu Zhou, and Yuheng Zou. Effective quantization methods for recurrent neural networks. CoRR, abs/1611.10176, 2016.
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res., 18(1):6869–6898, January 2017
Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. ArXiv e-prints, abs/1510.03009, October 2015.
Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.
Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. Trained ternary quantization. CoRR, abs/1612.01064, 2016.
Other References
Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. CoRR, abs/1703.10722, 2017.
Cijo Jose, Moustapha Cissé, and François Fleuret. Kronecker recurrent units. CoRR, abs/1705.10142, 2017.
There is a chain of reasoning you can follow using random projections that would suggest that having n weighted sums operating on a single common vector of nonlinear terms is very wasteful in the number of parameters used. This may be a reason why pruning is so effective with conventional neural networks.
Anyway the computational cost per layer is O(n^2) which is almost a guarantee that the hardware will run sizzling hot and pull a lot of power.
How nice it would be if there was an O(nlog(n)) per layer neural network. The hardware could run both cool and fast.
https://github.com/S6Regen/Fixed-Filter-Bank-Neural-Networks
Thanks for the link to the Bonsai algorithm by the way.