Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Deep learning inference performance on the Yitian 710
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Cloud Computing
  • Deep Learning
  • Server and Infrastructure
  • Neoverse
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Deep learning inference performance on the Yitian 710

Honglin Zhu
Honglin Zhu
December 19, 2022
4 minute read time.

In recent years, deep learning has been widely implemented in various areas of industry, such as vision, natural language processing, and recommender systems. The exponential rise in the number of deep learning model parameters and the new business demand for complex models require cloud vendors to reduce arithmetic costs and improve computational efficiency. This condition is especially true in deep learning inference, which has become our focus for optimization. Under this influence, Alibaba Cloud unveils the new Arm server chip - Yitian 710, with the 5nm process. Yitian 710 is based on Arm Neoverse and supports the latest Armv9 instruction set. This instruction set includes extended instruction such as Int8 MatMul, BFloat16 (BF16), and others, enabling a performance advantage in high-performance computing.

In this blog post, we focus on Alibaba Elastic Cloud Service (ECS) powered by Yitian 710 to test and compare the performance of deep learning inference.

Workloads

We select four common inference scenarios, covering image classification and recognition, object detection, natural language processing, and recommendation systems. The representative models used are as follows:

Area Task Model
Vision Image classification Resnet50-v1.5 and VGG19
Vision Object detection SSD-Resnet34
Language Natural language processing BERT-Large
Recommendation Click-through rate prediction DIN

Resnet, SSD, and BERT are all from the MLPerf Inference Benchmark project. DIN is the click-through rate prediction model proposed by Alibaba.

Platforms

Instances

We tested on two Alibaba ECS instance types, the g8m powered by Yitian 710 (Arm Neoverse) and the g7 powered by the Icelake (3rd generation Intel Xeon Scalable Processor). For both instances we tested with 8 vCPUs.

Deep learning framework

We use TensorFlow v2.10.0 and PyTorch v1.12.1.

On Arm devices, TensorFlow supports two backends, and we use the OneDNN backend. OneDNN is an open-source deep learning library, which can integrate with Arm Compute Library(ACL) and gain higher performance on Arm-based devices.

Currently, OneDNN backend is still experimental on PyTorch, so the default OpenBLAS backend is used on the PyTorch framework, and we introduce OpenBLAS later.

BFloat16

BFloat16 (BF16) is a floating-point representation with the same exponent bits as the single-precision floating-point (IEEE FP32), but with only 7 fractional bits. So BF16 has the same representation range as FP32, but with lower precision. BF16 is well suited for deep learning because the decrease in precision usually does not significantly reduce the prediction accuracy of the model. But the 16-bit data format saves space and speeds up computation. With the new BF16 instruction, g8m dramatically improves the deep learning inference performance and achieves better results than g7 in several scenarios. In addition, benefiting from the Yitian 710, g8m has a maximum 30% price advantage over the g7.

TensorFlow performance

Figures 1-4 show the results of Resnet50, SSD, BERT, and DIN models respectively. The blue bar is a direct performance comparison and the orange bar is a price-performance comparison. As is shown in Figure 1, on Resnet50, g8m performs 1.43x better than the g7 and achieves 2.05x better price-performance than g7.

Figure 1: Inference performance of Resnet50-v1.5 on g8m and g7.

Figure 1: Inference performance of Resnet50-v1.5 on g8m and g7.

Here, batch size is 32 and the test image size is 224 * 224.

Inference performance of SSD on g8m and g7.

Figure 2: Inference performance of SSD on g8m and g7.

Batch size is 1 and the test image size is 1200 * 1200.

Figure 3: BERT Inference performance comparison.

Figure 3: BERT Inference performance comparison.

Figure 4: DIN Inference performance comparison.

Figure 4: DIN Inference performance comparison.

PyTorch performance comparison

OneDNN backend is still experimental on PyTorch, so we use the default OpenBLAS backend. OpenBLAS is a widely used open-source linear algebra library. We add an optimized implementation for BF16 matrix multiplication on Arm Neoverse.

OpenBLAS BFloat16 matrix multiplication optimization

Matrix multiplication is strongly related to deep learning. For example, the Fully Connected Layer, Convolutional Layer, etc., which are commonly used in deep learning, are eventually converted into matrix multiplication. Therefore, the performance of matrix multiplication determines the deep learning inference performance.

OpenBLAS is a widely used library that serves as a backend for Numpy, PyTorch, and others. In our investigation, we find that the library does not support Yitian's BF16 extension instructions. After engagement with the community, we decided to implement matrix multiplication for BF16 data format using the BFMMLA instruction supported by the Yitian 710. The performance is significantly improved as is shown in Figure 5. This implementation has been contributed to the community and is included in the latest version 0.3.21 of OpenBLAS.

Figure 5: Matrix multiplication performance comparison of OpenBLAS.

Figure 5: Matrix multiplication performance comparison of OpenBLAS. The number of rows and columns of the matrix involved are both 1000.

PyTorch CNN performance

As the default backend of PyTorch, the optimization of OpenBLAS on matrix multiplication can be reflected in the PyTorch deep learning models implemented. We take the example of VGG-19, a model with a high percentage of convolutional computation. When inferring, all the convolutional operators are converted to matrix multiplication and OpenBLAS is called to complete the computation.

Figure 6: VGG-19 Inference performance comparison.

Figure 6: VGG-19 Inference performance comparison.

Conclusion

This blog shows that on the Alibaba ECS g8m instance, the inference performance of several deep learning models is higher than that of the g7 for equal-sized instances. This higher performance is mainly due to the new instructions of Armv9 and the constantly updated software support (OneDNN, ACL, and OpenBLAS). The Alibaba Cloud compiler team has contributed some software optimizations. And we continue to focus on software and hardware optimizations in this area to improve the competitiveness of the Arm instances in ML/AI.

Anonymous
Servers and Cloud Computing blog
  • Migrating our GenAI pipeline to AWS Graviton powered by Arm Neoverse: A 40% cost reduction story

    Hrudu Shibu
    Hrudu Shibu
    This blog post explains how Esankethik.com, an IT and AI solutions company, successfully migrated its internal GenAI pipeline to AWS Graviton Arm64.
    • August 28, 2025
  • Using GitHub Arm-hosted runners to install Arm Performance Libraries

    Waheed Brown
    Waheed Brown
    In this blog post, learn how Windows developers can set up and use Arm-hosted Windows runners in GitHub Action.
    • August 21, 2025
  • Distributed Generative AI Inference on Arm

    Waheed Brown
    Waheed Brown
    As generative AI becomes more efficient, large language models (LLMs) are likewise shrinking in size. This creates new opportunities to run LLMs on more efficient hardware, on cloud machines doing AI inference…
    • August 18, 2025