Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Research Collaboration and Enablement
Research Collaboration and Enablement
Research Articles TinyML Applications Require New Network Architectures
  • Research Articles
  • Arm Research - Most active
  • Arm Research Events
  • Members
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
Research Collaboration and Enablement requires membership for participation - click to join
More blogs in Research Collaboration and Enablement
  • Research Articles

Tags
  • Arm Research
  • Machine Learning (ML)
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

TinyML Applications Require New Network Architectures

Urmish Thakker
Urmish Thakker
February 13, 2020
7 minute read time.

Why TinyML?

Researchers have studied neural network compression for quite some time. However, the need for always-on compute has led to a recent trend towards executing these applications on even smaller IoT devices. These devices can have total system memory of 1MB to as low as few kilobytes [3]. Yet, the neural networks that are traditionally used to run these applications can be huge, and fitting them on TinyML devices requires significant compression. For example, to efficiently run Long Short-Term Memory (LSTM) layers of size 25 MB [5] on devices with 1MB cache requires a minimum of 25x compression. Running the Recurrent Neural Network (RNN)-based key-word spotting application in [6] on 2KB caches can require 25-38x compression.

Why should we rethink network architectures?

Traditional techniques, pruning & low-rank matrix factorization (LMF), achieve these high compression rates at the cost of significant accuracy loss [1]. This is because such high compression rates lead to matrices in the compressed network with poor characteristics (rank-loss, ill-conditioning). As a result, we need to rethink the way we design these architectures to achieve these compression factors. We at Arm have been working on one such structure based on Kronecker Products [1] [2]. This work complements the prior work by Google [4] and Microsoft [3] in this domain. The preliminary results have been exciting – we are able to compress IoT workloads by 16-38x and a large language model by 25x using variations of Kronecker Products. Our techniques outperform the accuracy that is achieved by traditional compression techniques and previous state-of-the-art by a large margin.

Overview

In this post, we walk through our work in developing efficient architectures for resource constrained devices. We first introduce Kronecker Products and talk about the best methodology to apply Kronecker product compression to sequence-based neural networks for IoT domain. Next, we talk about how to use Kronecker Products to compress larger models. This requires a slight tweak in the Kronecker Product Architecture. We call this tweak ‘doping’ and the resultant technique as Doped Kronecker Product. Finally, we discuss the best learning methodology to train Doped Kronecker Product. This includes over-coming what we refer to as co-matrix adaptation by using a stochastic regularization scheme that we call co-matrix row dropout regularization.

Kronecker Products (KP)

Kronecker product, denoted by ⊗, is an operation on two matrices of arbitrary size resulting in a bigger matrix. If A is an m × n matrix and B is a p × q matrix, then the Kronecker product, C, of A and B is given by 
 Kronecker product equation

where C is a mp × nq block matrix.

 Kronecker product equation

We call A and B as the Kronecker Factors (KFs) of C. C can have multiple such Kronecker factors, that is, C can be expressed as A1⊗A2⊗…⊗An.

Now that we understand what Kronecker Products looks like, lets deep dive into how inference works when a matrix is expressed as a KP of multiple smaller KFs. Using techniques from Linear Algebra, we can avoid reconstructing the larger matrix during inference, this saves computation, that is. If,

 Kronecker product equation

where y is a (mp x 1) vector and x is a (nq x 1) input vector, then

 Kronecker product equation

Here, the matrix() function is a transpose and reshape operation and vec() function is a transpose and vectorization function. When the number of KF is > 3, this formula Is applied recursively.

Best Practices to compress networks using Kronecker Products

Our first work explored the following questions

  1. How many numbers of Kronecker Factors should we decompose a matrix into?
  2. What is the impact on inference run-time using this decomposition technique?
  3. What are the compression factors achieved after compression using KP?
  4. How do the accuracy of these compressed networks fare against traditional compression techniques?

Our research indicates that the number of such matrices should be limited to two [1]. This is because larger number of KFs lead to vanishing gradient issues during training and issues that are related to slower inference run-time. Further, [1] also provides the algorithm for maximum compression while maximizing the rank of the matrix after compression. Our results on IoT based sequential networks were promising. KP can compress these networks by 16-38x while achieving 9.6% better accuracy than the pruned networks and 4.5% better accuracy than LMF compressed networks. We evaluate this network on 5 benchmarks spanning 3 IoT based applications. Further, KP compressed networks achieve 1.27-1.73x better runtime than the baseline network. Some of these results are presented in the following table. If you find these results exciting, we encourage you to read [1] to dive deeper into this compression technique.

 

 

Compression Technique

Benchmark Name

Attribute

Baseline

Small Baseline

Pruning

LMF

KP

HAR1

Compression Factor

1x

20x

29x

28x

30x

Accuracy (%)

91.9

88.9

82.9

89.9

91.2

Runtime (ms)

470

30

98

64

157

KWS

Compression Factor

1x

16x

24x

21x

25x

Accuracy (%)

92.5

89.7

84.9

89.1

91.2

Runtime (ms)

26.8

1.9

5.9

4.1

17.5

Extending KP Compression to larger networks

We tried extending KP to large language models (LM) from [5]. However, this led to almost 35% accuracy loss. This loss came at 338x compression. Traditional compression techniques like pruning and LMF can introduce more parameters into the network to get better accuracy that is, decrease the compression factor to achieve better accuracy. Pruning does this by decreasing the sparsity of the network, while LMF does this by increasing the rank of the matrix. There is no obvious analogy to this in the Kronecker World. In [1] we show one such technique, Hybrid KP. However, this technique achieves iso-accuracy for the LM benchmark at around 5x compression, reducing the compressive capabilities of KP. The next immediate issue that we decided to tackle was to explore a method to inject parameters into a KP compressed network, without sacrificing the compressive capabilities of KP.

Doped Kronecker Products (DKP)

To understand why KP compression does not scale to larger network, we focus on how gradient flows through a KP compressed network during backpropagation. When we focus on the gradient flow, we realize that parameters in the KP space need additional degrees of freedom (See Figure 1 a,b). Inspired by Robust PCA techniques, we introduce a sparse overlay matrix to provide these additional degrees of freedom. We end-up combining different compression techniques, sparsity, and KP. We call this compression technique as doped Kronecker product (See Figure 1c).

 Example of a Kronecker Product of two matrices

Figure 1: (a) Example of a Kronecker Product of two matrices. (b) Issues with back-propagation through a matrix expressed as a KP of two smaller matrices. (c) Shows how doping solves the issues written in (b).

Training DKP Networks

Training DKP networks is not trivial. While training these networks, we ran into what we call co-matrix adaptation. The two matrices (KP compressed matrix and the sparse matrix) co-adapt to each other. This happens during the initial phase of training when the overlay matrix is dense, and pruning has not begun. This leads to lost capacity that is, additional parameters do not lead to accuracy gain. To overcome this issue, we introduce stochasticity into the availability of these networks. This stochasticity was inspired by concepts of dropout and stochastic depth. In brevity, we do not discuss the technical details of these learning methodology and defer to the published paper [2]. We encourage the reader to read this paper to understand the intuitions and experiments that created the ideal training methodology for compression using doped Kronecker products.

Compressing Large Language Model using DKP

We compress the LSTM layers of size 25MB in [5] using DKP and compare it with other state-of-the-art compression techniques. The results are indicated in the following table.

Compression
Factor

Test Perplexity

Baseline

1x

82.04

4-bit quantization [7]

8x

83.84

3-bit quantization [8]

10.67x

83.14

Tensor Train Decomposition [9]

1.67x

168.64

Weight Distortion with Pruning [10]

10x

84.64

Low-rank matrix factorization

20x

114.29

HMD [11]

20x

105.43

HKD [1]

20x

99.88

Magnitude Pruning

20x

85.14

Our work, Doped KP [2]

25x

83.24

Conclusions

We showed two methods that can achieve significant compression and help deploy applications on tiny devices. These methods, which are derived from linear algebra, changed the network structure, and data-flow of a neural network architecture. This helped us achieve high compression factors while outperforming other traditional compression techniques and other state-of-the-art methods.

References

1Compressing RNNs for IoT devices by 15-38x using Kronecker Products https://arxiv.org/abs/1906.02876
2Compressing Language Models using Doped Kronecker Products https://arxiv.org/abs/2001.08896
3Resource-efficient Machine Learning in 2 KB RAM for the Internet of Things https://github.com/Microsoft/EdgeML/wiki/Bonsai         
4ProjectionNet: Learning Efficient On-Device Deep Networks Using Neural Projections https://arxiv.org/abs/1708.00630
5Recurrent Neural Network Regularization https://arxiv.org/abs/1409.2329
6Hello Edge: Keyword Spotting on Microcontrollers https://arxiv.org/abs/1711.07128
7Weighted-Entropy-based Quantization for Deep Neural Networks https://ieeexplore.ieee.org/document/8100244
8Retraining-Based Iterative Weight Quantization for Deep Neural Networks https://arxiv.org/pdf/1805.11233.pdf
9Compression of Recurrent Neural Networks for Efficient Language Modeling https://www.sciencedirect.com/science/article/abs/pii/S1568494619301851
10DeepTwist: Learning Model Compression via Occasional Weight Distortion https://openreview.net/forum?id=HJzLdjR9FX
11Run-Time Efficient RNN Compression for Inference on Edge Devices https://arxiv.org/abs/1906.04886

Anonymous
Research Articles
  • HOL4 users' workshop 2025

    Hrutvik Kanabar
    Hrutvik Kanabar
    Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
    • March 24, 2025
  • TinyML: Ubiquitous embedded intelligence

    Becky Ellis
    Becky Ellis
    With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
    • November 28, 2024
  • To the edge and beyond

    Becky Ellis
    Becky Ellis
    London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
    • November 5, 2024