TinyML Applications Require New Network Architectures

February 13, 2020

7 minute read time.

Why TinyML?

Researchers have studied neural network compression for quite some time. However, the need for always-on compute has led to a recent trend towards executing these applications on even smaller IoT devices. These devices can have total system memory of 1MB to as low as few kilobytes [3]. Yet, the neural networks that are traditionally used to run these applications can be huge, and fitting them on TinyML devices requires significant compression. For example, to efficiently run Long Short-Term Memory (LSTM) layers of size 25 MB [5] on devices with 1MB cache requires a minimum of 25x compression. Running the Recurrent Neural Network (RNN)-based key-word spotting application in [6] on 2KB caches can require 25-38x compression.

Why should we rethink network architectures?

Traditional techniques, pruning & low-rank matrix factorization (LMF), achieve these high compression rates at the cost of significant accuracy loss [1]. This is because such high compression rates lead to matrices in the compressed network with poor characteristics (rank-loss, ill-conditioning). As a result, we need to rethink the way we design these architectures to achieve these compression factors. We at Arm have been working on one such structure based on Kronecker Products [1] [2]. This work complements the prior work by Google [4] and Microsoft [3] in this domain. The preliminary results have been exciting – we are able to compress IoT workloads by 16-38x and a large language model by 25x using variations of Kronecker Products. Our techniques outperform the accuracy that is achieved by traditional compression techniques and previous state-of-the-art by a large margin.

Overview

In this post, we walk through our work in developing efficient architectures for resource constrained devices. We first introduce Kronecker Products and talk about the best methodology to apply Kronecker product compression to sequence-based neural networks for IoT domain. Next, we talk about how to use Kronecker Products to compress larger models. This requires a slight tweak in the Kronecker Product Architecture. We call this tweak ‘doping’ and the resultant technique as Doped Kronecker Product. Finally, we discuss the best learning methodology to train Doped Kronecker Product. This includes over-coming what we refer to as co-matrix adaptation by using a stochastic regularization scheme that we call co-matrix row dropout regularization.

Kronecker Products (KP)

Kronecker product, denoted by ⊗, is an operation on two matrices of arbitrary size resulting in a bigger matrix. If A is an m × n matrix and B is a p × q matrix, then the Kronecker product, C, of A and B is given by
Kronecker product equation

where C is a mp × nq block matrix.

Kronecker product equation

We call A and B as the Kronecker Factors (KFs) of C. C can have multiple such Kronecker factors, that is, C can be expressed as A₁⊗A₂⊗…⊗A_n.

Now that we understand what Kronecker Products looks like, lets deep dive into how inference works when a matrix is expressed as a KP of multiple smaller KFs. Using techniques from Linear Algebra, we can avoid reconstructing the larger matrix during inference, this saves computation, that is. If,

Kronecker product equation

where y is a (mp x 1) vector and x is a (nq x 1) input vector, then

Kronecker product equation

Here, the matrix() function is a transpose and reshape operation and vec() function is a transpose and vectorization function. When the number of KF is > 3, this formula Is applied recursively.

Best Practices to compress networks using Kronecker Products

Our first work explored the following questions

How many numbers of Kronecker Factors should we decompose a matrix into?
What is the impact on inference run-time using this decomposition technique?
What are the compression factors achieved after compression using KP?
How do the accuracy of these compressed networks fare against traditional compression techniques?

Our research indicates that the number of such matrices should be limited to two [1]. This is because larger number of KFs lead to vanishing gradient issues during training and issues that are related to slower inference run-time. Further, [1] also provides the algorithm for maximum compression while maximizing the rank of the matrix after compression. Our results on IoT based sequential networks were promising. KP can compress these networks by 16-38x while achieving 9.6% better accuracy than the pruned networks and 4.5% better accuracy than LMF compressed networks. We evaluate this network on 5 benchmarks spanning 3 IoT based applications. Further, KP compressed networks achieve 1.27-1.73x better runtime than the baseline network. Some of these results are presented in the following table. If you find these results exciting, we encourage you to read [1] to dive deeper into this compression technique.

		Compression Technique
Benchmark Name	Attribute	Baseline	Small Baseline	Pruning	LMF	KP
HAR1	Compression Factor	1x	20x	29x	28x	30x
	Accuracy (%)	91.9	88.9	82.9	89.9	91.2
	Runtime (ms)	470	30	98	64	157
KWS	Compression Factor	1x	16x	24x	21x	25x
	Accuracy (%)	92.5	89.7	84.9	89.1	91.2
	Runtime (ms)	26.8	1.9	5.9	4.1	17.5

Extending KP Compression to larger networks

We tried extending KP to large language models (LM) from [5]. However, this led to almost 35% accuracy loss. This loss came at 338x compression. Traditional compression techniques like pruning and LMF can introduce more parameters into the network to get better accuracy that is, decrease the compression factor to achieve better accuracy. Pruning does this by decreasing the sparsity of the network, while LMF does this by increasing the rank of the matrix. There is no obvious analogy to this in the Kronecker World. In [1] we show one such technique, Hybrid KP. However, this technique achieves iso-accuracy for the LM benchmark at around 5x compression, reducing the compressive capabilities of KP. The next immediate issue that we decided to tackle was to explore a method to inject parameters into a KP compressed network, without sacrificing the compressive capabilities of KP.

Doped Kronecker Products (DKP)

To understand why KP compression does not scale to larger network, we focus on how gradient flows through a KP compressed network during backpropagation. When we focus on the gradient flow, we realize that parameters in the KP space need additional degrees of freedom (See Figure 1 a,b). Inspired by Robust PCA techniques, we introduce a sparse overlay matrix to provide these additional degrees of freedom. We end-up combining different compression techniques, sparsity, and KP. We call this compression technique as doped Kronecker product (See Figure 1c).

Example of a Kronecker Product of two matrices

Figure 1: (a) Example of a Kronecker Product of two matrices. (b) Issues with back-propagation through a matrix expressed as a KP of two smaller matrices. (c) Shows how doping solves the issues written in (b).

Training DKP Networks

Training DKP networks is not trivial. While training these networks, we ran into what we call co-matrix adaptation. The two matrices (KP compressed matrix and the sparse matrix) co-adapt to each other. This happens during the initial phase of training when the overlay matrix is dense, and pruning has not begun. This leads to lost capacity that is, additional parameters do not lead to accuracy gain. To overcome this issue, we introduce stochasticity into the availability of these networks. This stochasticity was inspired by concepts of dropout and stochastic depth. In brevity, we do not discuss the technical details of these learning methodology and defer to the published paper [2]. We encourage the reader to read this paper to understand the intuitions and experiments that created the ideal training methodology for compression using doped Kronecker products.

Compressing Large Language Model using DKP

We compress the LSTM layers of size 25MB in [5] using DKP and compare it with other state-of-the-art compression techniques. The results are indicated in the following table.

	Compression Factor	Test Perplexity
Baseline	1x	82.04
4-bit quantization [7]	8x	83.84
3-bit quantization [8]	10.67x	83.14
Tensor Train Decomposition [9]	1.67x	168.64
Weight Distortion with Pruning [10]	10x	84.64
Low-rank matrix factorization	20x	114.29
HMD [11]	20x	105.43
HKD [1]	20x	99.88
Magnitude Pruning	20x	85.14
Our work, Doped KP [2]	25x	83.24

Conclusions

We showed two methods that can achieve significant compression and help deploy applications on tiny devices. These methods, which are derived from linear algebra, changed the network structure, and data-flow of a neural network architecture. This helped us achieve high compression factors while outperforming other traditional compression techniques and other state-of-the-art methods.

References

¹Compressing RNNs for IoT devices by 15-38x using Kronecker Products https://arxiv.org/abs/1906.02876
²Compressing Language Models using Doped Kronecker Products https://arxiv.org/abs/2001.08896
³Resource-efficient Machine Learning in 2 KB RAM for the Internet of Things https://github.com/Microsoft/EdgeML/wiki/Bonsai
⁴ProjectionNet: Learning Efficient On-Device Deep Networks Using Neural Projections https://arxiv.org/abs/1708.00630
⁵Recurrent Neural Network Regularization https://arxiv.org/abs/1409.2329
⁶Hello Edge: Keyword Spotting on Microcontrollers https://arxiv.org/abs/1711.07128
⁷Weighted-Entropy-based Quantization for Deep Neural Networks https://ieeexplore.ieee.org/document/8100244
⁸Retraining-Based Iterative Weight Quantization for Deep Neural Networks https://arxiv.org/pdf/1805.11233.pdf
⁹Compression of Recurrent Neural Networks for Efficient Language Modeling https://www.sciencedirect.com/science/article/abs/pii/S1568494619301851
¹⁰DeepTwist: Learning Model Compression via Occasional Weight Distortion https://openreview.net/forum?id=HJzLdjR9FX
¹¹Run-Time Efficient RNN Compression for Inference on Edge Devices https://arxiv.org/abs/1906.04886

0 comments
0 members are here

Research Articles

HOL4 users' workshop 2025

Hrutvik Kanabar

Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
- March 24, 2025
TinyML: Ubiquitous embedded intelligence

Becky Ellis

With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
- November 28, 2024
To the edge and beyond

Becky Ellis

London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
- November 5, 2024

Research Articles

TinyML Applications Require New Network Architectures

Why TinyML?

Why should we rethink network architectures?

Overview

Kronecker Products (KP)

Best Practices to compress networks using Kronecker Products

Extending KP Compression to larger networks

Doped Kronecker Products (DKP)

Training DKP Networks

Compressing Large Language Model using DKP

Conclusions

References

HOL4 users' workshop 2025

TinyML: Ubiquitous embedded intelligence

To the edge and beyond