Smart Pruning: Improve Machine Learning Performance on Mobile

October 12, 2020

15 minute read time.

Much of the progress in Machine Learning (ML) over the past few years has been achieved using deep neural networks. To achieve state-of-the-art accuracies, researchers have designed large networks, which require lots of memory and computational power to achieve their stated performance. For example, ResNet50 has ~25 million parameters, and takes about 100MB of memory to store on device. Other networks, such as VGG16 requires 500MB.

On a desktop computer, the platform of choice for neural network designers and researchers, this is not a problem. However, when you want to deploy your network to a mobile device and run inference at anything close to real time, the networks are often simply too large to deploy straight to mobile. One solution to this might be to run your network in the cloud and deliver the result to your device. However, this has a few limitations, such as:

Often we would like inference to happen at real time. Network latency makes this hard to achieve
The best way to protect our data is to keep in on the device
Data Quality. To send data over a network, you first have to compress it. This often comes with a loss in quality which in turn reduces the accuracy of your network
Network Coverage. In many locations, there is not enough bandwidth to send data to the cloud.

One of the simplest methods to reduce the size and memory footprint of a model is pruning. Pruning involves removing connections, or convolution filters, from a network to reduce its size and complexity. Analogous to the pruning of trees or shrubs to remove dead or overgrown branches, neural network pruning aims to remove parts of the network that are redundant and contribute the least to the final predictions.

In use cases where you want to spend your hardware resources wisely, pruning is an effective tool to modify your network. This ensures that you achieve the best accuracy under the given constraints (for example, latency, memory footprint, compute power, and so on). This technical blog describes some techniques to optimize your convolutional neural network (CNN) for a given use case and target IP. By utilizing these techniques, it is possible to take a large off-the-shelf network and deploy it on your Arm-based mobile device without having to spend the time and effort to create a custom architecture.

Weight Pruning

The simplest and most common method to prune a neural network is weight pruning. Weight pruning involves removing individual connections between neural network layers to increase the sparsity (number of zeroed weights), and therefore decreasing the number of parameters. For example, to increase the sparsity of the model in figure 1-50 percent, you must remove 9 out of the 18 connections.

Figure 1: Weight Pruning

There are two main “hyperparameters” to understand when deciding how to prune the weights of your neural network. The first is the pruning criteria, or metric. This metric looks at all the connections and weights and make a decision on which ones to remove. The most common example of this is magnitude-based pruning, where the weights with the lowest values are removed.

The second decision to make will be what pruning schedule to use. The simplest schedule would be a one-shot approach. This is where you go from your start sparsity (that is, 0 percent) to your target sparsity (50 percent in the example shown in figure 1) in one go. While this method is the simplest, it has been shown that better results can be achieved if you use an iterative schedule.

In an iterative schedule, you prune a few weights at each iteration. You then perform a short-term fine-tune (retraining the pruned network using only a small subset of the training data) at each step. This is repeated until you reach your target sparsity. The fine-tune enables the model to recover its accuracy and place more importance on weights that are most useful for the final prediction.

Figure 2: Weight Pruning Algorithm

Weight pruning is simple to implement, and many ML frameworks offer simple APIs to perform weight pruning. One example is the TensorFlow model optimization toolkit. This API is built on top of Keras and allows the user to run weight pruning in only a few simple lines of code.

Rather than test the weight pruning on the standard image classification use case, we have a look at something more interesting – background segmentation. The network in question is a u-net, with a MobilenetV2 encoder, and transpose convolutions as the decoder. The model has been trained on a dataset of human portraits, where the aim is to separate the foreground from the background. As such, this is effectively a binary segmentation problem, where the pixels are classified into either foreground or background.

Applying the weight pruning API to the u-net with a target sparsity of 40 percent, it can be found that the accuracy dropped from 95.71 to 91.65 percent. The compression ratio (after exporting the models to tflite, then zipping) was 38 percent. Here you can see that the model now takes up 38 percent less space on your device, but with only a small drop in accuracy.

Baseline Model Accuracy: 0.9571
Pruned Model Accuracy: 0.9165
Size of gzipped original TFlite model: 3613521 bytes
Size of gzipped pruned TFlite model: 2243745 bytes
Model compression rate (after pruning): 37.91%

This highlights the key advantage of using weight pruning – that compression tools can take advantage of the sparse matrix of weights to compress the models more efficiently. However, at inference time, there is no change to the performance of the network. As yet, mobile CPUs and GPUs cannot easily take advantage of the sparse representation of weights. The model execution is operating in the same manner as if the model had not been pruned (that is, a multiplication operation is still performed on a zeroed weight). Therefore, if you want to accelerate the performance of your deep neural network, another method is required.

Structured Pruning

Structured, or channel pruning, is another pruning technique which involves removing whole filters from convolutional layers, or neurons from fully connected layers.

Unlike in the case of weight pruning, this alters the structure of the model, and simply put, makes it smaller. Therefore, unlike weight pruning, all mobile CPUs, and GPUs take advantage of the smaller model and run faster inference.

As shown in figure 3, a whole neuron has been removed from the model, along with its weights. Like weight pruning, this reduces the size of the model on disk. However, unlike weight pruning, the model structure is now smaller. This smaller structure means that there are less multiply and accumulate operations (MACS). Therefore, the time taken to run inference is reduced, and the memory bandwidth required is lower.

Figure 3: Structured Pruning

The general algorithm flow is very similar to that of weight pruning. The only difference being that in the case of a fully connected layer whole neurons are removed. In the case of a convolutional layer, whole filters are removed.

Figure 4: Structured Pruning Algorithm

As you can see from figure 4 above, this shows the clear benefits of structured pruning overweight pruning. However, why is this method not more well known? When you talk with developers and engineers about pruning, it is often assumed that you are talking about weight pruning. Some possible explanations for this are:

Very few open-source implementations for structured pruning for mobile.
Some frameworks, such as Keras, do not have support built in to remove filters from predefined models. Libraries such as Keras Surgeon exist, but it doesn’t support the latest version of TensorFlow.
Metrics to rank the importance of filters have not been good enough, resulting in a large drop in accuracy.
Complexity, as structured pruning is much more complex to implement than weight pruning. This is because you modify the model structure, rather than just zeroing weights.
You cannot always prune all convolution layers of a network, for reasons I will explain soon. This means you cannot achieve sparsity targets or that you end up pruning too much from certain filters, resulting in a large accuracy drop.

How does structured pruning work?

So, structured pruning: how does it work in practice? In the example below, we see how structured pruning works and why it is more challenging than weight pruning. Consider the simple network shown in figure 5. It contains two convolution operations, one with six filters and one with three.

Figure 5: Simple Convolutional Neural Network

If your metric chooses to remove the second filter, as that is deemed the least important, then the corresponding channel in the output would be removed, highlighted in red in figure 6. This means that the input number of channels to the second convolution have been reduced.

Figure 6: Example of Structured Pruning to remove one filter.

This is quite a simple example, but you can see how removing a filter in one layer will affect the input of downstream layers.

Let us have a look at something more complicated.

Figure 7: More Complicated Example of Structured Pruning

Let us say that your metric has looked at all the convolution layers in figure 7 and found that the one in red is the least important and removes this filter. However, if you do this and try to rebuild your model, you encounter an issue. The element-wise addition requires the same number of channels from both inputs.

To overcome this issue, you could choose to prune the same filter in CONV2 (highlighted in blue). However, your metric has not accounted for this. Therefore, by removing this filter you could severely impact the accuracy of your model. As deep neural networks are increasingly utilizing residual connections to improve training, this hinders the effectiveness of structured pruning.

However, this issue is only encountered when trying to prune layers just before an element-wise addition. Therefore, it is possible just to mark these layers as non-prunable, and work with the rest of the layers that are possible to prune. As long as you have enough prunable layers, then it is possible to prune effectively.

Not all filters are created equal

Like weight pruning, there are many different metrics to rank the importance of filters. Let us have a look at two of the simpler ones to get a flavor of how the metrics are implemented, L1-norm and Average Percentage of Zeros (APoZ) [1].

L1-norm is one of the simplest metrics. This is effectively just the mean of the absolute values contained in the filter. The idea being the lower the values, the less it contributes towards the final predication.

APoZ [1] is a slightly more complicated metric. It involves measuring the output of the ReLU activation function for each convolution filter. You calculate the percentage of outputs that are zero, and then average over several different inputs. After running for a few iterations, you can build up a picture of how often that filter is activated. If a filter has a high APoZ (that is, more often zero), then it does not contribute much to the final prediction and can be discarded.

Figure 8: L1-Norm and APoZ for Filters from a Layer of Mobilenet. You can see here that their values are not distributed evenly, and you can use this to determine which filters are the most/least important.

Structured pruning - How does it compare?

The following figures show the accuracy of two well-known image classification networks after structured pruning has been applied.

Figure 9: Structured Pruning of InceptionV3, Trained on CIFAR10.

Figure 10: Structured Pruning of MobileNetV1, Trained on CIFAR10

As you can see, when applying structure pruning you can find parts of the network that are redundant and can be pruned away with minimal impact on the accuracy. For example, with the InceptionV3 network, you can prune away roughly 40 percent of the network with only a 0.2 percent drop in accuracy.

Using the tflite benchmark model, and running the network pruned to 40 percent sparsity on a Samsung Galaxy S7, you can see that the inference time drops from 76ms to 43ms. The peak memory bandwidth also changes from 10.5MB to 7.8MB. This highlights the clear advantages of structured pruning overweight pruning – improved performance at runtime.

With more tuning, and a less aggressive pruning schedule, you are able to achieve higher accuracy than this. But, is there another method available where we can prune more efficiently?

The examples shown in figures 9-10 previous show the results of a very simplistic approach to structured pruning. The metrics are quite basic, and no information about the underlying hardware is used to make any pruning decisions.

The metrics used might be great for finding the most redundant filters, but it is just assumed that removing those filters reduce latency. No thought is given to how much the latency will be reduced. In this paper [2], the authors find that reducing the MACS, which occurs as a result of filters being removed, does not always decrease the latency. Therefore, just removing filters to reduce the number of MACs might not always be the most optimal solution.

Smart(er) pruning

So, can we do better? Is there a method available that actually looks at the underlying hardware when making pruning decisions?

Well, one example would be NetAdapt [2], from Google and MIT. Instead of relying on so called indirect metrics (number of filters/weights/MACs), this method makes use of direct metrics, obtaining them by performing on-device measurements. These measurements can then be used to determine which filters to remove.

The basic aim is to remove the filters that:

Reduce the latency the most
Degrade the accuracy the least

By doing so, it is possible to more accurately pick the correct filters to remove. Since you are taking on-device measurements, this network will now be tailor-made to run fastest on your device.

Additionally, you do not have to have the latency as your metric to guide pruning. Theoretically, it is possible to use any metric, provided you can take a measurement of it. For example, if there was a way to measure the energy consumption of a convolution, then you could aim to remove filters that:

Reduce the energy the most
Degrade the accuracy the least

In this way, you can prune with energy as a target, and find the subnetwork that consumes the smallest amount of energy for your target platform.

While measuring energy consumption directly is fairly complex, hardware counters can easily be used to measure the number of instructions required to complete a convolution. While not a like-for-like comparison, fewer instructions will generally mean less power and faster execution time.

HWCPipe is a piece of Arm software that allows you to easily access hardware counters on an Arm-powered device. The hardware counters available include the number of CPU/GPU instructions, among many others (read blogs here and here to understand more about GPU hardware counters).

When combining HWCPipe with Arm Compute Library, it is easy to create an application for your mobile device to take measurements of a convolution operation layer with different parameters (e.g. kernel size, number of filters, etc.), which can then be used with the NetAdapt algorithm.

So then, how does this perform in practice? Applying the NetAdapt algorithm to a small MobilenetV1 (alpha = 0.5), trained on CIFAR10, the results are shown in figure 11. You can see that NetAdapt significantly outperforms the simpler pruning techniques mentioned previously. If you use instructions as a metric rather than latency, you can get a further slight performance improvement.

Figure 11: Pruning a Small MobilenetV1 with Different Structured Pruning Techniques. All measurements were taken on a mobile CPU (Cortex-A53). Multipliers correspond to changing the alpha value, which changes reduces number of filters for each convolution.

Figure 12 shows what happens when applying this to the more interesting segmentation use case (just using the instructions hardware counters instead of latency).

Figure 12: Pruning U-net with Different Structured Pruning Techniques. All measurements were taken on a mobile CPU (Cortex-A53)

As you can see, for this background segmentation use case, NetAdapt performs very well. You can get a 3x speedup for almost no drop in accuracy.

Summing Up

Pruning is a very active area of research, and for good reason. It is much more efficient for developers to be able to prune a pre-existing state-of-the-art network for their platforms target constraints, rather than spending the effort to design their own custom architectures. Algorithms, such as NetAdapt, use empirical measurements to produce tailor-made models for the use case and the platform. This helps to provide a method to spend your resources more carefully.

But this is one of many pruning algorithms. Others, such as Adversarial Neural Pruning [3], combine the concept of adversarial training with traditional pruning techniques. Self-Adaptive Network Pruning [4] reduces the computational cost of a convolutional neural network by creating a Saliency-and-Pruning module, which is used to predict saliency, or importance, scores for each convolutional layer.

And this is just the tip of the iceberg. Pruning is a hot topic in neural network optimization and many more methods exist [4]. Combined with other techniques such as quantization, it can easily be possible to make large networks suitable for deployment on a mobile device.

Learn more about delivering advanced ML

References

1) Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures - Hengyuan Hu, Rui Peng, Yu-Wing Tai, Chi-Keung Tang [link]

2) NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications - Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, Hartwig Adam [link]

3) Adversarial Neural Pruning with Latent Vulnerability Suppression – Divyam Madaan, Jinwoo Shin, Sung Ju Hwang [link]

4) What is the State of Neural Network Pruning - Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, John Guttag [link]

AI blog

Make an AI Yoga Tutor for TV or smartphone and accelerate it with SME2

Hamza Arslan

In this blog post learn how Arm SME2 accelerates our AI Yoga Tutor, delivering real-time, personalized pose correction.
- September 22, 2025
Sign language translation using machine learning

Lizzie Salter

In this blog post, learn how the Arm Developer Advocacy team is exploring how machine learning can enable a sign-to-speech translator for video conferencing.
- August 15, 2025
Bringing Generative AI to the masses with ExecuTorch and KleidiAI

Gian Marco Iodice

With the recent Arm SME2 announcement, the role of Arm KleidiAI is increasingly clear as Arm’s AI accelerator layer powering the next wave of AI.
- August 13, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog