Much of the progress in Machine Learning (ML) over the past few years has been achieved using deep neural networks. To achieve state-of-the-art accuracies, researchers have designed large networks, which require lots of memory and computational power to achieve their stated performance. For example, ResNet50 has ~25 million parameters, and takes about 100MB of memory to store on device. Other networks, such as VGG16 requires 500MB.
On a desktop computer, the platform of choice for neural network designers and researchers, this is not a problem. However, when you want to deploy your network to a mobile device and run inference at anything close to real time, the networks are often simply too large to deploy straight to mobile. One solution to this might be to run your network in the cloud and deliver the result to your device. However, this has a few limitations, such as:
One of the simplest methods to reduce the size and memory footprint of a model is pruning. Pruning involves removing connections, or convolution filters, from a network to reduce its size and complexity. Analogous to the pruning of trees or shrubs to remove dead or overgrown branches, neural network pruning aims to remove parts of the network that are redundant and contribute the least to the final predictions.
In use cases where you want to spend your hardware resources wisely, pruning is an effective tool to modify your network. This ensures that you achieve the best accuracy under the given constraints (for example, latency, memory footprint, compute power, and so on). This technical blog describes some techniques to optimize your convolutional neural network (CNN) for a given use case and target IP. By utilizing these techniques, it is possible to take a large off-the-shelf network and deploy it on your Arm-based mobile device without having to spend the time and effort to create a custom architecture.
The simplest and most common method to prune a neural network is weight pruning. Weight pruning involves removing individual connections between neural network layers to increase the sparsity (number of zeroed weights), and therefore decreasing the number of parameters. For example, to increase the sparsity of the model in figure 1-50 percent, you must remove 9 out of the 18 connections.
Figure 1: Weight Pruning
There are two main “hyperparameters” to understand when deciding how to prune the weights of your neural network. The first is the pruning criteria, or metric. This metric looks at all the connections and weights and make a decision on which ones to remove. The most common example of this is magnitude-based pruning, where the weights with the lowest values are removed.
The second decision to make will be what pruning schedule to use. The simplest schedule would be a one-shot approach. This is where you go from your start sparsity (that is, 0 percent) to your target sparsity (50 percent in the example shown in figure 1) in one go. While this method is the simplest, it has been shown that better results can be achieved if you use an iterative schedule.
In an iterative schedule, you prune a few weights at each iteration. You then perform a short-term fine-tune (retraining the pruned network using only a small subset of the training data) at each step. This is repeated until you reach your target sparsity. The fine-tune enables the model to recover its accuracy and place more importance on weights that are most useful for the final prediction.
Figure 2: Weight Pruning Algorithm
Weight pruning is simple to implement, and many ML frameworks offer simple APIs to perform weight pruning. One example is the TensorFlow model optimization toolkit. This API is built on top of Keras and allows the user to run weight pruning in only a few simple lines of code.
Rather than test the weight pruning on the standard image classification use case, we have a look at something more interesting – background segmentation. The network in question is a u-net, with a MobilenetV2 encoder, and transpose convolutions as the decoder. The model has been trained on a dataset of human portraits, where the aim is to separate the foreground from the background. As such, this is effectively a binary segmentation problem, where the pixels are classified into either foreground or background.
Applying the weight pruning API to the u-net with a target sparsity of 40 percent, it can be found that the accuracy dropped from 95.71 to 91.65 percent. The compression ratio (after exporting the models to tflite, then zipping) was 38 percent. Here you can see that the model now takes up 38 percent less space on your device, but with only a small drop in accuracy.
Baseline Model Accuracy: 0.9571Pruned Model Accuracy: 0.9165Size of gzipped original TFlite model: 3613521 bytesSize of gzipped pruned TFlite model: 2243745 bytesModel compression rate (after pruning): 37.91%
This highlights the key advantage of using weight pruning – that compression tools can take advantage of the sparse matrix of weights to compress the models more efficiently. However, at inference time, there is no change to the performance of the network. As yet, mobile CPUs and GPUs cannot easily take advantage of the sparse representation of weights. The model execution is operating in the same manner as if the model had not been pruned (that is, a multiplication operation is still performed on a zeroed weight). Therefore, if you want to accelerate the performance of your deep neural network, another method is required.
Structured, or channel pruning, is another pruning technique which involves removing whole filters from convolutional layers, or neurons from fully connected layers.
Unlike in the case of weight pruning, this alters the structure of the model, and simply put, makes it smaller. Therefore, unlike weight pruning, all mobile CPUs, and GPUs take advantage of the smaller model and run faster inference.
As shown in figure 3, a whole neuron has been removed from the model, along with its weights. Like weight pruning, this reduces the size of the model on disk. However, unlike weight pruning, the model structure is now smaller. This smaller structure means that there are less multiply and accumulate operations (MACS). Therefore, the time taken to run inference is reduced, and the memory bandwidth required is lower.
Figure 3: Structured Pruning
The general algorithm flow is very similar to that of weight pruning. The only difference being that in the case of a fully connected layer whole neurons are removed. In the case of a convolutional layer, whole filters are removed.
Figure 4: Structured Pruning Algorithm
As you can see from figure 4 above, this shows the clear benefits of structured pruning overweight pruning. However, why is this method not more well known? When you talk with developers and engineers about pruning, it is often assumed that you are talking about weight pruning. Some possible explanations for this are:
So, structured pruning: how does it work in practice? In the example below, we see how structured pruning works and why it is more challenging than weight pruning. Consider the simple network shown in figure 5. It contains two convolution operations, one with six filters and one with three.
Figure 5: Simple Convolutional Neural Network
If your metric chooses to remove the second filter, as that is deemed the least important, then the corresponding channel in the output would be removed, highlighted in red in figure 6. This means that the input number of channels to the second convolution have been reduced.
Figure 6: Example of Structured Pruning to remove one filter.
This is quite a simple example, but you can see how removing a filter in one layer will affect the input of downstream layers.
Let us have a look at something more complicated.
Figure 7: More Complicated Example of Structured Pruning
Let us say that your metric has looked at all the convolution layers in figure 7 and found that the one in red is the least important and removes this filter. However, if you do this and try to rebuild your model, you encounter an issue. The element-wise addition requires the same number of channels from both inputs.
To overcome this issue, you could choose to prune the same filter in CONV2 (highlighted in blue). However, your metric has not accounted for this. Therefore, by removing this filter you could severely impact the accuracy of your model. As deep neural networks are increasingly utilizing residual connections to improve training, this hinders the effectiveness of structured pruning.
However, this issue is only encountered when trying to prune layers just before an element-wise addition. Therefore, it is possible just to mark these layers as non-prunable, and work with the rest of the layers that are possible to prune. As long as you have enough prunable layers, then it is possible to prune effectively.
Like weight pruning, there are many different metrics to rank the importance of filters. Let us have a look at two of the simpler ones to get a flavor of how the metrics are implemented, L1-norm and Average Percentage of Zeros (APoZ) [1].
L1-norm is one of the simplest metrics. This is effectively just the mean of the absolute values contained in the filter. The idea being the lower the values, the less it contributes towards the final predication.
APoZ [1] is a slightly more complicated metric. It involves measuring the output of the ReLU activation function for each convolution filter. You calculate the percentage of outputs that are zero, and then average over several different inputs. After running for a few iterations, you can build up a picture of how often that filter is activated. If a filter has a high APoZ (that is, more often zero), then it does not contribute much to the final prediction and can be discarded.
Figure 8: L1-Norm and APoZ for Filters from a Layer of Mobilenet. You can see here that their values are not distributed evenly, and you can use this to determine which filters are the most/least important.
The following figures show the accuracy of two well-known image classification networks after structured pruning has been applied.
Figure 9: Structured Pruning of InceptionV3, Trained on CIFAR10.
Figure 10: Structured Pruning of MobileNetV1, Trained on CIFAR10
As you can see, when applying structure pruning you can find parts of the network that are redundant and can be pruned away with minimal impact on the accuracy. For example, with the InceptionV3 network, you can prune away roughly 40 percent of the network with only a 0.2 percent drop in accuracy.
Using the tflite benchmark model, and running the network pruned to 40 percent sparsity on a Samsung Galaxy S7, you can see that the inference time drops from 76ms to 43ms. The peak memory bandwidth also changes from 10.5MB to 7.8MB. This highlights the clear advantages of structured pruning overweight pruning – improved performance at runtime.
With more tuning, and a less aggressive pruning schedule, you are able to achieve higher accuracy than this. But, is there another method available where we can prune more efficiently?
The examples shown in figures 9-10 previous show the results of a very simplistic approach to structured pruning. The metrics are quite basic, and no information about the underlying hardware is used to make any pruning decisions.
The metrics used might be great for finding the most redundant filters, but it is just assumed that removing those filters reduce latency. No thought is given to how much the latency will be reduced. In this paper [2], the authors find that reducing the MACS, which occurs as a result of filters being removed, does not always decrease the latency. Therefore, just removing filters to reduce the number of MACs might not always be the most optimal solution.
So, can we do better? Is there a method available that actually looks at the underlying hardware when making pruning decisions?
Well, one example would be NetAdapt [2], from Google and MIT. Instead of relying on so called indirect metrics (number of filters/weights/MACs), this method makes use of direct metrics, obtaining them by performing on-device measurements. These measurements can then be used to determine which filters to remove.
The basic aim is to remove the filters that:
By doing so, it is possible to more accurately pick the correct filters to remove. Since you are taking on-device measurements, this network will now be tailor-made to run fastest on your device.
Additionally, you do not have to have the latency as your metric to guide pruning. Theoretically, it is possible to use any metric, provided you can take a measurement of it. For example, if there was a way to measure the energy consumption of a convolution, then you could aim to remove filters that:
In this way, you can prune with energy as a target, and find the subnetwork that consumes the smallest amount of energy for your target platform.
While measuring energy consumption directly is fairly complex, hardware counters can easily be used to measure the number of instructions required to complete a convolution. While not a like-for-like comparison, fewer instructions will generally mean less power and faster execution time.
HWCPipe is a piece of Arm software that allows you to easily access hardware counters on an Arm-powered device. The hardware counters available include the number of CPU/GPU instructions, among many others (read blogs here and here to understand more about GPU hardware counters).
When combining HWCPipe with Arm Compute Library, it is easy to create an application for your mobile device to take measurements of a convolution operation layer with different parameters (e.g. kernel size, number of filters, etc.), which can then be used with the NetAdapt algorithm.
So then, how does this perform in practice? Applying the NetAdapt algorithm to a small MobilenetV1 (alpha = 0.5), trained on CIFAR10, the results are shown in figure 11. You can see that NetAdapt significantly outperforms the simpler pruning techniques mentioned previously. If you use instructions as a metric rather than latency, you can get a further slight performance improvement.
Figure 11: Pruning a Small MobilenetV1 with Different Structured Pruning Techniques. All measurements were taken on a mobile CPU (Cortex-A53). Multipliers correspond to changing the alpha value, which changes reduces number of filters for each convolution.
Figure 12 shows what happens when applying this to the more interesting segmentation use case (just using the instructions hardware counters instead of latency).
Figure 12: Pruning U-net with Different Structured Pruning Techniques. All measurements were taken on a mobile CPU (Cortex-A53)
As you can see, for this background segmentation use case, NetAdapt performs very well. You can get a 3x speedup for almost no drop in accuracy.
Pruning is a very active area of research, and for good reason. It is much more efficient for developers to be able to prune a pre-existing state-of-the-art network for their platforms target constraints, rather than spending the effort to design their own custom architectures. Algorithms, such as NetAdapt, use empirical measurements to produce tailor-made models for the use case and the platform. This helps to provide a method to spend your resources more carefully.
But this is one of many pruning algorithms. Others, such as Adversarial Neural Pruning [3], combine the concept of adversarial training with traditional pruning techniques. Self-Adaptive Network Pruning [4] reduces the computational cost of a convolutional neural network by creating a Saliency-and-Pruning module, which is used to predict saliency, or importance, scores for each convolutional layer.
And this is just the tip of the iceberg. Pruning is a hot topic in neural network optimization and many more methods exist [4]. Combined with other techniques such as quantization, it can easily be possible to make large networks suitable for deployment on a mobile device.
[CTAToken URL = "https://developer.arm.com/ip-products/processors/machine-learning" target="_blank" text="Learn more about delivering advanced ML" class ="green"]
1) Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures - Hengyuan Hu, Rui Peng, Yu-Wing Tai, Chi-Keung Tang [link]
2) NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications - Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, Hartwig Adam [link]
3) Adversarial Neural Pruning with Latent Vulnerability Suppression – Divyam Madaan, Jinwoo Shin, Sung Ju Hwang [link]
4) What is the State of Neural Network Pruning - Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, John Guttag [link]