Researchers have studied neural network compression for quite some time. However, the need for always-on compute has led to a recent trend towards executing these applications on even smaller IoT devices. These devices can have total system memory of 1MB to as low as few kilobytes [3]. Yet, the neural networks that are traditionally used to run these applications can be huge, and fitting them on TinyML devices requires significant compression. For example, to efficiently run Long Short-Term Memory (LSTM) layers of size 25 MB [5] on devices with 1MB cache requires a minimum of 25x compression. Running the Recurrent Neural Network (RNN)-based key-word spotting application in [6] on 2KB caches can require 25-38x compression.
Traditional techniques, pruning & low-rank matrix factorization (LMF), achieve these high compression rates at the cost of significant accuracy loss [1]. This is because such high compression rates lead to matrices in the compressed network with poor characteristics (rank-loss, ill-conditioning). As a result, we need to rethink the way we design these architectures to achieve these compression factors. We at Arm have been working on one such structure based on Kronecker Products [1] [2]. This work complements the prior work by Google [4] and Microsoft [3] in this domain. The preliminary results have been exciting – we are able to compress IoT workloads by 16-38x and a large language model by 25x using variations of Kronecker Products. Our techniques outperform the accuracy that is achieved by traditional compression techniques and previous state-of-the-art by a large margin.
In this post, we walk through our work in developing efficient architectures for resource constrained devices. We first introduce Kronecker Products and talk about the best methodology to apply Kronecker product compression to sequence-based neural networks for IoT domain. Next, we talk about how to use Kronecker Products to compress larger models. This requires a slight tweak in the Kronecker Product Architecture. We call this tweak ‘doping’ and the resultant technique as Doped Kronecker Product. Finally, we discuss the best learning methodology to train Doped Kronecker Product. This includes over-coming what we refer to as co-matrix adaptation by using a stochastic regularization scheme that we call co-matrix row dropout regularization.
Kronecker product, denoted by ⊗, is an operation on two matrices of arbitrary size resulting in a bigger matrix. If A is an m × n matrix and B is a p × q matrix, then the Kronecker product, C, of A and B is given by
where C is a mp × nq block matrix.
We call A and B as the Kronecker Factors (KFs) of C. C can have multiple such Kronecker factors, that is, C can be expressed as A_{1}⊗A_{2}⊗…⊗A_{n.}
Now that we understand what Kronecker Products looks like, lets deep dive into how inference works when a matrix is expressed as a KP of multiple smaller KFs. Using techniques from Linear Algebra, we can avoid reconstructing the larger matrix during inference, this saves computation, that is. If,
where y is a (mp x 1) vector and x is a (nq x 1) input vector, then
Here, the matrix() function is a transpose and reshape operation and vec() function is a transpose and vectorization function. When the number of KF is > 3, this formula Is applied recursively.
Our first work explored the following questions
Our research indicates that the number of such matrices should be limited to two [1]. This is because larger number of KFs lead to vanishing gradient issues during training and issues that are related to slower inference run-time. Further, [1] also provides the algorithm for maximum compression while maximizing the rank of the matrix after compression. Our results on IoT based sequential networks were promising. KP can compress these networks by 16-38x while achieving 9.6% better accuracy than the pruned networks and 4.5% better accuracy than LMF compressed networks. We evaluate this network on 5 benchmarks spanning 3 IoT based applications. Further, KP compressed networks achieve 1.27-1.73x better runtime than the baseline network. Some of these results are presented in the following table. If you find these results exciting, we encourage you to read [1] to dive deeper into this compression technique.
Compression Technique
Benchmark Name
Attribute
Baseline
Small Baseline
Pruning
LMF
KP
HAR1
Compression Factor
1x
20x
29x
28x
30x
Accuracy (%)
91.9
88.9
82.9
89.9
91.2
Runtime (ms)
470
30
98
64
157
KWS
16x
24x
21x
25x
92.5
89.7
84.9
89.1
26.8
1.9
5.9
4.1
17.5
We tried extending KP to large language models (LM) from [5]. However, this led to almost 35% accuracy loss. This loss came at 338x compression. Traditional compression techniques like pruning and LMF can introduce more parameters into the network to get better accuracy that is, decrease the compression factor to achieve better accuracy. Pruning does this by decreasing the sparsity of the network, while LMF does this by increasing the rank of the matrix. There is no obvious analogy to this in the Kronecker World. In [1] we show one such technique, Hybrid KP. However, this technique achieves iso-accuracy for the LM benchmark at around 5x compression, reducing the compressive capabilities of KP. The next immediate issue that we decided to tackle was to explore a method to inject parameters into a KP compressed network, without sacrificing the compressive capabilities of KP.
To understand why KP compression does not scale to larger network, we focus on how gradient flows through a KP compressed network during backpropagation. When we focus on the gradient flow, we realize that parameters in the KP space need additional degrees of freedom (See Figure 1 a,b). Inspired by Robust PCA techniques, we introduce a sparse overlay matrix to provide these additional degrees of freedom. We end-up combining different compression techniques, sparsity, and KP. We call this compression technique as doped Kronecker product (See Figure 1c).
Figure 1: (a) Example of a Kronecker Product of two matrices. (b) Issues with back-propagation through a matrix expressed as a KP of two smaller matrices. (c) Shows how doping solves the issues written in (b).
Training DKP networks is not trivial. While training these networks, we ran into what we call co-matrix adaptation. The two matrices (KP compressed matrix and the sparse matrix) co-adapt to each other. This happens during the initial phase of training when the overlay matrix is dense, and pruning has not begun. This leads to lost capacity that is, additional parameters do not lead to accuracy gain. To overcome this issue, we introduce stochasticity into the availability of these networks. This stochasticity was inspired by concepts of dropout and stochastic depth. In brevity, we do not discuss the technical details of these learning methodology and defer to the published paper [2]. We encourage the reader to read this paper to understand the intuitions and experiments that created the ideal training methodology for compression using doped Kronecker products.
We compress the LSTM layers of size 25MB in [5] using DKP and compare it with other state-of-the-art compression techniques. The results are indicated in the following table.
Test Perplexity
82.04
4-bit quantization [7]
8x
83.84
3-bit quantization [8]
10.67x
83.14
Tensor Train Decomposition [9]
1.67x
168.64
Weight Distortion with Pruning [10]
10x
84.64
Low-rank matrix factorization
114.29
HMD [11]
105.43
HKD [1]
99.88
Magnitude Pruning
85.14
Our work, Doped KP [2]
83.24
We showed two methods that can achieve significant compression and help deploy applications on tiny devices. These methods, which are derived from linear algebra, changed the network structure, and data-flow of a neural network architecture. This helped us achieve high compression factors while outperforming other traditional compression techniques and other state-of-the-art methods.
^{1}Compressing RNNs for IoT devices by 15-38x using Kronecker Products https://arxiv.org/abs/1906.02876^{2}Compressing Language Models using Doped Kronecker Products https://arxiv.org/abs/2001.08896^{3}Resource-efficient Machine Learning in 2 KB RAM for the Internet of Things https://github.com/Microsoft/EdgeML/wiki/Bonsai ^{4}ProjectionNet: Learning Efficient On-Device Deep Networks Using Neural Projections https://arxiv.org/abs/1708.00630^{5}Recurrent Neural Network Regularization https://arxiv.org/abs/1409.2329^{6}Hello Edge: Keyword Spotting on Microcontrollers https://arxiv.org/abs/1711.07128^{7}Weighted-Entropy-based Quantization for Deep Neural Networks https://ieeexplore.ieee.org/document/8100244^{8}Retraining-Based Iterative Weight Quantization for Deep Neural Networks https://arxiv.org/pdf/1805.11233.pdf^{9}Compression of Recurrent Neural Networks for Efficient Language Modeling https://www.sciencedirect.com/science/article/abs/pii/S1568494619301851^{10}DeepTwist: Learning Model Compression via Occasional Weight Distortion https://openreview.net/forum?id=HJzLdjR9FX^{11}Run-Time Efficient RNN Compression for Inference on Edge Devices https://arxiv.org/abs/1906.04886