Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Smart Homes
    • Tools, Software and IDEs blog
    • Works on Arm blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
Infrastructure Solutions blog Deep learning inference performance on the Yitian 710
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Cloud Computing
  • Deep Learning
  • Server and Infrastructure
  • Neoverse
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Deep learning inference performance on the Yitian 710

Honglin Zhu
Honglin Zhu
December 19, 2022
4 minute read time.

In recent years, deep learning has been widely implemented in various areas of industry, such as vision, natural language processing, and recommender systems. The exponential rise in the number of deep learning model parameters and the new business demand for complex models require cloud vendors to reduce arithmetic costs and improve computational efficiency. This condition is especially true in deep learning inference, which has become our focus for optimization. Under this influence, Alibaba Cloud unveils the new Arm server chip - Yitian 710, with the 5nm process. Yitian 710 is based on Arm Neoverse and supports the latest Armv9 instruction set. This instruction set includes extended instruction such as Int8 MatMul, BFloat16 (BF16), and others, enabling a performance advantage in high-performance computing.

In this blog post, we focus on Alibaba Elastic Cloud Service (ECS) powered by Yitian 710 to test and compare the performance of deep learning inference.

Workloads

We select four common inference scenarios, covering image classification and recognition, object detection, natural language processing, and recommendation systems. The representative models used are as follows:

Area Task Model
Vision Image classification Resnet50-v1.5 and VGG19
Vision Object detection SSD-Resnet34
Language Natural language processing BERT-Large
Recommendation Click-through rate prediction DIN

Resnet, SSD, and BERT are all from the MLPerf Inference Benchmark project. DIN is the click-through rate prediction model proposed by Alibaba.

Platforms

Instances

We tested on two Alibaba ECS instance types, the g8m powered by Yitian 710 (Arm Neoverse) and the g7 powered by the Icelake (3rd generation Intel Xeon Scalable Processor). For both instances we tested with 8 vCPUs.

Deep learning framework

We use TensorFlow v2.10.0 and PyTorch v1.12.1.

On Arm devices, TensorFlow supports two backends, and we use the OneDNN backend. OneDNN is an open-source deep learning library, which can integrate with Arm Compute Library(ACL) and gain higher performance on Arm-based devices.

Currently, OneDNN backend is still experimental on PyTorch, so the default OpenBLAS backend is used on the PyTorch framework, and we introduce OpenBLAS later.

BFloat16

BFloat16 (BF16) is a floating-point representation with the same exponent bits as the single-precision floating-point (IEEE FP32), but with only 7 fractional bits. So BF16 has the same representation range as FP32, but with lower precision. BF16 is well suited for deep learning because the decrease in precision usually does not significantly reduce the prediction accuracy of the model. But the 16-bit data format saves space and speeds up computation. With the new BF16 instruction, g8m dramatically improves the deep learning inference performance and achieves better results than g7 in several scenarios. In addition, benefiting from the Yitian 710, g8m has a maximum 30% price advantage over the g7.

TensorFlow performance

Figures 1-4 show the results of Resnet50, SSD, BERT, and DIN models respectively. The blue bar is a direct performance comparison and the orange bar is a price-performance comparison. As is shown in Figure 1, on Resnet50, g8m performs 1.43x better than the g7 and achieves 2.05x better price-performance than g7.

Figure 1: Inference performance of Resnet50-v1.5 on g8m and g7.

Figure 1: Inference performance of Resnet50-v1.5 on g8m and g7.

Here, batch size is 32 and the test image size is 224 * 224.

Inference performance of SSD on g8m and g7.

Figure 2: Inference performance of SSD on g8m and g7.

Batch size is 1 and the test image size is 1200 * 1200.

Figure 3: BERT Inference performance comparison.

Figure 3: BERT Inference performance comparison.

Figure 4: DIN Inference performance comparison.

Figure 4: DIN Inference performance comparison.

PyTorch performance comparison

OneDNN backend is still experimental on PyTorch, so we use the default OpenBLAS backend. OpenBLAS is a widely used open-source linear algebra library. We add an optimized implementation for BF16 matrix multiplication on Arm Neoverse.

OpenBLAS BFloat16 matrix multiplication optimization

Matrix multiplication is strongly related to deep learning. For example, the Fully Connected Layer, Convolutional Layer, etc., which are commonly used in deep learning, are eventually converted into matrix multiplication. Therefore, the performance of matrix multiplication determines the deep learning inference performance.

OpenBLAS is a widely used library that serves as a backend for Numpy, PyTorch, and others. In our investigation, we find that the library does not support Yitian's BF16 extension instructions. After engagement with the community, we decided to implement matrix multiplication for BF16 data format using the BFMMLA instruction supported by the Yitian 710. The performance is significantly improved as is shown in Figure 5. This implementation has been contributed to the community and is included in the latest version 0.3.21 of OpenBLAS.

Figure 5: Matrix multiplication performance comparison of OpenBLAS.

Figure 5: Matrix multiplication performance comparison of OpenBLAS. The number of rows and columns of the matrix involved are both 1000.

PyTorch CNN performance

As the default backend of PyTorch, the optimization of OpenBLAS on matrix multiplication can be reflected in the PyTorch deep learning models implemented. We take the example of VGG-19, a model with a high percentage of convolutional computation. When inferring, all the convolutional operators are converted to matrix multiplication and OpenBLAS is called to complete the computation.

Figure 6: VGG-19 Inference performance comparison.

Figure 6: VGG-19 Inference performance comparison.

Conclusion

This blog shows that on the Alibaba ECS g8m instance, the inference performance of several deep learning models is higher than that of the g7 for equal-sized instances. This higher performance is mainly due to the new instructions of Armv9 and the constantly updated software support (OneDNN, ACL, and OpenBLAS). The Alibaba Cloud compiler team has contributed some software optimizations. And we continue to focus on software and hardware optimizations in this area to improve the competitiveness of the Arm instances in ML/AI.

Anonymous
Infrastructure Solutions blog
  • Improve Apache httpd performance up to 40% by deploying on Alibaba Cloud Yitian 710 instances

    Martin Ma
    Martin Ma
    In this blog, we look at the advantages of using Alibaba Yitian 710 CPU Arm-based instances for Apache httpd compared to x86-based instances.
    • January 5, 2023
  • Deep learning inference performance on the Yitian 710

    Honglin Zhu
    Honglin Zhu
    In this blog post, we focus on Alibaba Elastic Cloud Service (ECS) powered by Yitian 710 to test and compare the performance of deep learning inference.
    • December 19, 2022
  • Improve NGINX performance up to 32% by deploying on Alibaba Cloud Yitian 710 instances

    Ker Liu
    Ker Liu
    In this blog, we look at the advantages of using Alibaba Yitian 710 CPU Arm-based instances for NGINX compared to x86-based instances.
    • December 14, 2022