Recurrent Neural Networks (RNNs) are an important class of algorithms. They are used in tasks where the strict order of the input conveys certain information, for example, natural language processing (NLP) and time-series based data. Increasingly, these networks are being deployed on resource constrained devices with limited cache and compute resources. At Arm Research’s ML Lab, we have been exploring ways to efficiently deploy RNNs on these constrained devices. Techniques for efficient execution of neural networks (NNs) can range from faster run-time libraries 1 to compression techniques of NNs 2.3.
Compression techniques can reduce the algorithmic and parameter complexities of an NN which leads to a smaller number of computations and smaller DRAM footprint. However, this technique generally requires retraining of the original model. This assumption can make these techniques expensive and hard to adopt. Retraining requires significant expertise in NN hyper parameter tuning and GPU resources. This can also limit the adaptation of open-source or third-party NNs.
Faster run-time libraries provide efficient execution of NN through better cache and CPU utilization resources, but do not reduce the computational and parameter complexity of an NN. We have developed a technique that bridges the gap between these two classes of techniques. We demonstrate faster execution of an RNN model by reducing the number of RNN computations, without retraining the original RNN model. This work was published at SenSysML workshop, held in New York in November 2019.
RNNs consist of multiple fully connected layers with non-linearity that summarize an input sequence to a continuous vector. The input sequence is processed one element at a time, referred to as a time-step. In NLP, each time-step represents the word in a sentence being processed. The same fully connected layer is executed for each time-step where the input sequence is executed. Recent work has shown that executing these fully connected layers over multiple time-steps can lead to poor cache behavior 4.
One valid question to ask is whether an RNN needs to execute all these time-steps to get to a correct answer. By skipping some of these time steps we skip the execution of the fully connected layers. This saves computation and can also reduce the traffic on DRAM or Cache 4. Fully connected layers generally execute matrix-vector computation, which are generally memory bound. By reducing the number of times that these layers are executed, we can potentially accelerate an RNN application by reducing the memory bottleneck. We can also reduce the energy consumption if these models lead to the layer being read from DRAM at every time-step 4.
The paper presents various use-cases where this observation is valid, however, in this case, let us take the example of a sentiment analysis task. The goal is to detect whether a sentence expresses a positive emotion or a negative emotion. For a sentence like “The movie was good”, looking at the last word suffices to understand that the statement expresses a positive sentiment. However, not all sentences can be classified that easily. Language is complex and we express sentiments using various methods – sarcasm, multi polarity, word ambiguity, and negations. This makes the task of identifying what time-steps are important challenging.
Previous work 5.6 has tried to exploit this opportunity by learning a small NN in parallel to a larger RNN network. This determines whether to feed a time-step to the larger RNN model or to skip it. However, the smaller NN model and the larger RNN model are trained simultaneously, learning from each other’s mistakes and accounting for them. While this is a valid and effective method to use to skip RNN time-step, it leads to retraining of the original large RNN model. In our latest work, we improve on previous findings and push the state-of-the-art by dropping this assumption, asking ourselves:
“Can we skip RNN time-steps without retraining the RNN model?”
Before we delved deeper into this problem, we wanted to understand whether there was an obvious answer to this question. Coming back to the sentiment analysis problem, research in how humans use language to communicate provides us with multiple obvious solutions that we could potentially use to answer this question. We evaluated two such solutions:
Stop words are words that do not express a sentiment. These could be words like “the”, “an”, "a". The third row of following table shows the result of using this technique on top of RNNs trained on SST and IMDB datasets. The second solution is to focus on the start and end of a sentence. Generally, we express emotions in the start and end of a sentence and use the middle part to provide more evidence to support those sentiments. Concat(n) captures this intuition by focusing on only first n and last n words in a sentence. Rows 4-6 in Table 1 capture the results of using this technique on top of RNNs trained on the SST and IMDB datasets. These solutions lead to high skip rates but come at a significant accuracy cost. These experiments provided us with enough evidence that this problem is not obvious, and we needed to explore a different solution to this problem.
Table 1: Baseline accuracy on SST and IMDB validation datasets using pretrained RNNs only, pretrained RNNs including stop words removal and pretrained RNNs using first n and last n words from each sequence as inputs (section 4.2). Concat(n) refers to concatenating first n and last n words of an input sequence.
Figure 1: We add a predictor logic in front of the RNN application to filter inputs to this application. The predictor logic looks at the current input in the sequence and the memory vector of the RNN layers to determine its importance.
To develop a suitable solution, we focused on encapsulating an RNN layer by using a predictor logic (Figure 1). This predictor logic can be an NN layer or any classical ML technique (for example, decision trees or linear regressions). The predictor logic determines which elements of an input sequence need to be fed to the RNN layer. This predictor is based on the hypothesis that elements in the input sequence do not significantly alter the memory of this layer do not contribute to the final prediction of the RNN application. The predictor takes in the current hidden memory of the RNN layer along with the input element that will be fed to it. To determine whether this element leads to any 'significant change' to the layer's current hidden memory. We used two distance metrics to measure 'change' – L2 norm and cosine distance. The notion of 'significant change' is determined by observing the distribution of the L2 norm and cosine distance metrics when the training data is fed to the pretrained RNN model and using the distribution to fix a threshold. Finally, we created a small dataset to train the predictor logic without retraining the original RNN application.
This methodology has led to remarkable success. We demonstrate that in ideal conditions, even without retraining the original model, we can train a predictor to skip 45% of steps for the SST dataset and 70% of steps for the IMDB dataset without impacting the model accuracy. For a more realistic predictor, we implement the predictor logic using a small NN (0.025x size of the original RNN model). Using this realistic predictor logic, we can reduce the multiply accumulate operations by more than 25% with at most a 0.3% loss in accuracy for the SST dataset.
This paper won the Best Paper Award at the SenSysML Workshop, co-located with SysML. If you find this work interesting, we encourage you to read the paper we published here.
Read the Paper Contact Urmish Thakker
1 Enabling deep learning at the IoT Edge. Liangzhen Lai, Naveen Suda. https://dl.acm.org/citation.cfm?id=32434732 Pushing the limits of RNN compression. Urmish Thakker, Igor Fedorov, Jesse Beu, Dibakar Gope, Chu Zhou, Ganesh Dasika, Matthew Mattina. https://arxiv.org/abs/1910.025583 Ternary Hybrid Neural-Tree Networks for Highly Constrained IoT Applications. Dibakar Gope, Ganesh Dasika, Matthew Mattina. https://arxiv.org/abs/1903.015314 Measuring scheduling efficiency of RNNs for NLP applications. Urmish Thakker, Ganesh Dasika, Jesse Beu, Matthew Mattina. https://arxiv.org/abs/1904.033025 Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks. Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, Shih-Fu Chang. https://arxiv.org/abs/1708.068346 Neural Speed Reading with Structural-Jump-LSTM. Christian Hansen, Casper Hansen, Stephen Alstrup, Jakob Grue Simonsen, Christina Lioma. https://arxiv.org/abs/1711.02085