In recent years, deep learning has been widely implemented in various areas of industry, such as vision, natural language processing, and recommender systems. The exponential rise in the number of deep learning model parameters and the new business demand for complex models require cloud vendors to reduce arithmetic costs and improve computational efficiency. This condition is especially true in deep learning inference, which has become our focus for optimization. Under this influence, Alibaba Cloud unveils the new Arm server chip - Yitian 710, with the 5nm process. Yitian 710 is based on Arm Neoverse and supports the latest Armv9 instruction set. This instruction set includes extended instruction such as Int8 MatMul, BFloat16 (BF16), and others, enabling a performance advantage in high-performance computing.
In this blog post, we focus on Alibaba Elastic Cloud Service (ECS) powered by Yitian 710 to test and compare the performance of deep learning inference.
We select four common inference scenarios, covering image classification and recognition, object detection, natural language processing, and recommendation systems. The representative models used are as follows:
Resnet, SSD, and BERT are all from the MLPerf Inference Benchmark project. DIN is the click-through rate prediction model proposed by Alibaba.
We tested on two Alibaba ECS instance types, the g8m powered by Yitian 710 (Arm Neoverse) and the g7 powered by the Icelake (3rd generation Intel Xeon Scalable Processor). For both instances we tested with 8 vCPUs.
We use TensorFlow v2.10.0 and PyTorch v1.12.1.
On Arm devices, TensorFlow supports two backends, and we use the OneDNN backend. OneDNN is an open-source deep learning library, which can integrate with Arm Compute Library(ACL) and gain higher performance on Arm-based devices.
Currently, OneDNN backend is still experimental on PyTorch, so the default OpenBLAS backend is used on the PyTorch framework, and we introduce OpenBLAS later.
BFloat16 (BF16) is a floating-point representation with the same exponent bits as the single-precision floating-point (IEEE FP32), but with only 7 fractional bits. So BF16 has the same representation range as FP32, but with lower precision. BF16 is well suited for deep learning because the decrease in precision usually does not significantly reduce the prediction accuracy of the model. But the 16-bit data format saves space and speeds up computation. With the new BF16 instruction, g8m dramatically improves the deep learning inference performance and achieves better results than g7 in several scenarios. In addition, benefiting from the Yitian 710, g8m has a maximum 30% price advantage over the g7.
Figures 1-4 show the results of Resnet50, SSD, BERT, and DIN models respectively. The blue bar is a direct performance comparison and the orange bar is a price-performance comparison. As is shown in Figure 1, on Resnet50, g8m performs 1.43x better than the g7 and achieves 2.05x better price-performance than g7.
OneDNN backend is still experimental on PyTorch, so we use the default OpenBLAS backend. OpenBLAS is a widely used open-source linear algebra library. We add an optimized implementation for BF16 matrix multiplication on Arm Neoverse.
Matrix multiplication is strongly related to deep learning. For example, the Fully Connected Layer, Convolutional Layer, etc., which are commonly used in deep learning, are eventually converted into matrix multiplication. Therefore, the performance of matrix multiplication determines the deep learning inference performance.
OpenBLAS is a widely used library that serves as a backend for Numpy, PyTorch, and others. In our investigation, we find that the library does not support Yitian's BF16 extension instructions. After engagement with the community, we decided to implement matrix multiplication for BF16 data format using the BFMMLA instruction supported by the Yitian 710. The performance is significantly improved as is shown in Figure 5. This implementation has been contributed to the community and is included in the latest version 0.3.21 of OpenBLAS.
Figure 5: Matrix multiplication performance comparison of OpenBLAS. The number of rows and columns of the matrix involved are both 1000.
As the default backend of PyTorch, the optimization of OpenBLAS on matrix multiplication can be reflected in the PyTorch deep learning models implemented. We take the example of VGG-19, a model with a high percentage of convolutional computation. When inferring, all the convolutional operators are converted to matrix multiplication and OpenBLAS is called to complete the computation.
Figure 6: VGG-19 Inference performance comparison.
This blog shows that on the Alibaba ECS g8m instance, the inference performance of several deep learning models is higher than that of the g7 for equal-sized instances. This higher performance is mainly due to the new instructions of Armv9 and the constantly updated software support (OneDNN, ACL, and OpenBLAS). The Alibaba Cloud compiler team has contributed some software optimizations. And we continue to focus on software and hardware optimizations in this area to improve the competitiveness of the Arm instances in ML/AI.