Deep learning inference performance on the Yitian 710

December 19, 2022

4 minute read time.

In recent years, deep learning has been widely implemented in various areas of industry, such as vision, natural language processing, and recommender systems. The exponential rise in the number of deep learning model parameters and the new business demand for complex models require cloud vendors to reduce arithmetic costs and improve computational efficiency. This condition is especially true in deep learning inference, which has become our focus for optimization. Under this influence, Alibaba Cloud unveils the new Arm server chip - Yitian 710, with the 5nm process. Yitian 710 is based on Arm Neoverse and supports the latest Armv9 instruction set. This instruction set includes extended instruction such as Int8 MatMul, BFloat16 (BF16), and others, enabling a performance advantage in high-performance computing.

In this blog post, we focus on Alibaba Elastic Cloud Service (ECS) powered by Yitian 710 to test and compare the performance of deep learning inference.

Workloads

We select four common inference scenarios, covering image classification and recognition, object detection, natural language processing, and recommendation systems. The representative models used are as follows:

Area	Task	Model
Vision	Image classification	Resnet50-v1.5 and VGG19
Vision	Object detection	SSD-Resnet34
Language	Natural language processing	BERT-Large
Recommendation	Click-through rate prediction	DIN

Resnet, SSD, and BERT are all from the MLPerf Inference Benchmark project. DIN is the click-through rate prediction model proposed by Alibaba.

Platforms

Instances

We tested on two Alibaba ECS instance types, the g8m powered by Yitian 710 (Arm Neoverse) and the g7 powered by the Icelake (3rd generation Intel Xeon Scalable Processor). For both instances we tested with 8 vCPUs.

Deep learning framework

We use TensorFlow v2.10.0 and PyTorch v1.12.1.

On Arm devices, TensorFlow supports two backends, and we use the OneDNN backend. OneDNN is an open-source deep learning library, which can integrate with Arm Compute Library(ACL) and gain higher performance on Arm-based devices.

Currently, OneDNN backend is still experimental on PyTorch, so the default OpenBLAS backend is used on the PyTorch framework, and we introduce OpenBLAS later.

BFloat16

BFloat16 (BF16) is a floating-point representation with the same exponent bits as the single-precision floating-point (IEEE FP32), but with only 7 fractional bits. So BF16 has the same representation range as FP32, but with lower precision. BF16 is well suited for deep learning because the decrease in precision usually does not significantly reduce the prediction accuracy of the model. But the 16-bit data format saves space and speeds up computation. With the new BF16 instruction, g8m dramatically improves the deep learning inference performance and achieves better results than g7 in several scenarios. In addition, benefiting from the Yitian 710, g8m has a maximum 30% price advantage over the g7.

TensorFlow performance

Figures 1-4 show the results of Resnet50, SSD, BERT, and DIN models respectively. The blue bar is a direct performance comparison and the orange bar is a price-performance comparison. As is shown in Figure 1, on Resnet50, g8m performs 1.43x better than the g7 and achieves 2.05x better price-performance than g7.

Figure 1: Inference performance of Resnet50-v1.5 on g8m and g7.

Figure 1: Inference performance of Resnet50-v1.5 on g8m and g7.

Here, batch size is 32 and the test image size is 224 * 224.

Inference performance of SSD on g8m and g7.

Figure 2: Inference performance of SSD on g8m and g7.

Batch size is 1 and the test image size is 1200 * 1200.

Figure 3: BERT Inference performance comparison.

Figure 4: DIN Inference performance comparison.

PyTorch performance comparison

OneDNN backend is still experimental on PyTorch, so we use the default OpenBLAS backend. OpenBLAS is a widely used open-source linear algebra library. We add an optimized implementation for BF16 matrix multiplication on Arm Neoverse.

OpenBLAS BFloat16 matrix multiplication optimization

Matrix multiplication is strongly related to deep learning. For example, the Fully Connected Layer, Convolutional Layer, etc., which are commonly used in deep learning, are eventually converted into matrix multiplication. Therefore, the performance of matrix multiplication determines the deep learning inference performance.

OpenBLAS is a widely used library that serves as a backend for Numpy, PyTorch, and others. In our investigation, we find that the library does not support Yitian's BF16 extension instructions. After engagement with the community, we decided to implement matrix multiplication for BF16 data format using the BFMMLA instruction supported by the Yitian 710. The performance is significantly improved as is shown in Figure 5. This implementation has been contributed to the community and is included in the latest version 0.3.21 of OpenBLAS.

Figure 5: Matrix multiplication performance comparison of OpenBLAS.

^{Figure 5: Matrix multiplication performance comparison of OpenBLAS. The number of rows and columns of the matrix involved are both 1000.}

PyTorch CNN performance

As the default backend of PyTorch, the optimization of OpenBLAS on matrix multiplication can be reflected in the PyTorch deep learning models implemented. We take the example of VGG-19, a model with a high percentage of convolutional computation. When inferring, all the convolutional operators are converted to matrix multiplication and OpenBLAS is called to complete the computation.

Figure 6: VGG-19 Inference performance comparison.

^{Figure 6: VGG-19 Inference performance comparison.}

Conclusion

This blog shows that on the Alibaba ECS g8m instance, the inference performance of several deep learning models is higher than that of the g7 for equal-sized instances. This higher performance is mainly due to the new instructions of Armv9 and the constantly updated software support (OneDNN, ACL, and OpenBLAS). The Alibaba Cloud compiler team has contributed some software optimizations. And we continue to focus on software and hardware optimizations in this area to improve the competitiveness of the Arm instances in ML/AI.

0 comments
0 members are here

Servers and Cloud Computing blog

Refining MurmurHash64A for greater efficiency in Libstdc++

Zongyao Zhang

Discover how tuning MurmurHash64A’s memory access pattern yields up to 9% faster hashing performance.
- October 16, 2025
How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Marc Meunier

Discover how FUJITSU-MONAKA secures AI and HPC workloads with Arm v9 and Realm-based confidential computing.
- October 13, 2025
Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

odinlmshen

In this blog post, learn how to integrate virtual BMC and firmware simulation into CI pipelines to speed bring-up, testing, and developer onboarding.
- October 13, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog