Increasingly, intelligent applications are using neural networks at their core to deliver new functionality to users. These applications include language understanding and translation, image recognition, and object tracking and localization. Furthermore, to reduce latency and improve privacy, there is increasing pressure to move these applications out of centralized datacenters and onto embedded devices. While neural networks have demonstrated state of the art accuracy for the types of applications listed above, they require significant memory storage for holding the multitude of parameters (ie neural network weights and activations) needed to deliver this high accuracy. An obstacle to executing these applications on embedded devices is the relatively scarce memory resources present.
At Arm’s ML Research Lab, we’ve been exploring different techniques for reducing the memory requirements of advanced neural networks. One of these techniques is quantization, wherein the neural network weights and activations are stored in a lower bit width format, thereby reducing the overall storage requirements. A common approach is to quantize the IEEE FP32 (32 bits) representation to an 8 bit integer representation, reducing storage requirements by a factor of 4. Quantization can be performed during neural network training by using the 8b integer representation during the *forward pass* execution of the neural network, while performing the gradient update using the FP32 representation. This approach requires differentiation of quantization functions whose derivatives are almost everywhere equal to zero. To avoid this “vanishing gradient” problem, the straight through estimator (STE) is commonly used. The STE replaces the quantization function with the identity function during backpropagation.
We have developed an alternative training recipe for quantizing networks without using the STE. Our method - Alpha-Blending - avoids STE approximation by replacing the quantized weight in the loss function by an affine combination of the quantized weight w_q and the corresponding full-precision weight w with non-trainable scalar coefficient α and 1−α. During training, α is gradually increased from 0 to 1; the gradient updates to the weights are through the full-precision term, (1−α)w, of the affine combination; the model is converted from full-precision to low-precision progressively. Our results with MobileNet v1 on ImageNet are shown in the table below. Alpha-Blending performs best at very low bitwidths: with weights quantized to 4b and activations quantized to 8b, Alpha-Blending achieves 68.7% top-1 accuracy, which is only 2.2% worse than full FP32 precision. This encouraging result suggests that significant reductions in memory footprints are possible while still retaining high accuracy, thus enabling future neural network-based applications on embedded devices and significantly broadening the scope of the tasks that could be completed by these devices.
You can read the full paper below, and lead author Zhi-gang Liu will be presenting this work at the 2019 International Joint Conference on Artificial Intelligence (IJCAI) in Macao, China.
Read the Paper