Kleidi Technology Delivers Best Price-Performance for ASR on Arm Neoverse N2

September 16, 2024

5 minute read time.

This blog post was co-authored by Willen Yang and Fred Jin

Automatic Speech Recognition (ASR) technology has permeated various aspects of modern life, it is used in a wide range of applications, from voice assistants and transcription services to call center analytics and speech-to-text translation, offering innovative solutions and enhancing user experiences across different industries.

With recent advancements in machine learning and deep learning, ASR technology has reached a new level of sophistication. Now, ASR software can understand a wide array of accents, dialects, and speaking styles with high accuracy. FunASR is an advanced open-source ASR toolkit developed by Alibaba DAMO Academy. It provides a comprehensive set of tools and models for developing and deploying ASR systems.

FunASR supports both CPU and GPU compute. While GPUs offer superior performance for training deep learning models, CPUs are more prevalent in edge and datacenter servers and can be more suitable for model inference. Running ASR inference on CPUs enables deployment in scenarios where GPU acceleration is not feasible due to cost, power constraints, or lack of availability.

The Arm Neoverse N2 is a high-performance CPU processor designed for cloud and edge computing. It can support a wide range of cloud workloads including AI and ML, with added AI capabilities such as SVE2 (Scalable Vector Extension 2), Bfloat16 (BF16) data format and MMLA.

SVE2 allows developers to operate on larger data vectors, improving parallel processing capabilities and execution efficiency, which is particularly important during the extensive mathematical computations in the training and inference phases of AI models.
Bfloat16 is a newer floating-point format designed specifically for AI and machine learning applications. BF16 offers the same dynamic range as a 32-bit floating-point number but uses only 16 bits of storage space, allowing for reduced model size and increased computational efficiency while maintaining precision.
MMLA(Matrix-multiply-and-accumulate) is an architecture feature in Armv8.6. It provides great acceleration for GEMM (General Matrix Multiplications) operation, which is an essential algorithm in machine learning that performs a complex multiplication of two input matrices together to get one output.

Recently, Arm has announced Arm Kleidi Technology which is a collection of developer enablement technologies, designed by Arm to enhance AI performance on Arm platforms, including Arm Neoverse. The impact of Kleidi technology ranges from frameworks to highly optimized libs and the vibrant AI ISV ecosystems.

In this blog post, we will share the steps of how to deploy FunASR inference on Arm Neoverse N2-based Alibaba Yitian 710 platform along with our benchmarking methodology. Moreover, by enabling Arm Kleidi technologies, we will present a comparative analysis that highlights the key price-performance advantages of running FunASR inference on Yitian 710 CPUs over other CPU and GPU-based platforms.

Benchmarking setup

Software Version:

Ubuntu 22.04(64bit)
PyTorch v2.3.0
pip install funasr==0.8.8
pip install modelscope==1.10.0

Model: speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch

Please make sure PyTorch and associated python libraries are installed on the system^[1], if you are running on Arm platform, Arm also provides the PyTorch docker image^[2] in docker hub for quick evaluation purpose.

1. Initialize the environment and import the required dependencies:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
export OMP_NUM_THREADS=16
export DNNL_VERBOSE=1
import torch 
import torch.autograd.profiler as profiler 
import os 
import random 
import numpy as np 
from funasr.tasks.asr import ASRTaskParaformer as ASRTask 
from funasr.export.models import get_model 
from modelscope.hub.snapshot_download import snapshot_download
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

export OMP_NUM_THREADS=16
export DNNL_VERBOSE=1

import torch 
import torch.autograd.profiler as profiler 
import os 
import random 
import numpy as np 
from funasr.tasks.asr import ASRTaskParaformer as ASRTask 
from funasr.export.models import get_model 
from modelscope.hub.snapshot_download import snapshot_download

2. Download and configuration the model：

Paraformer is an advanced automatic speech recognition (ASR) model developed by Alibaba DAMO Academy under FunASR open-source project. This model is specifically designed to improve the robustness and efficiency of end-to-end speech recognition systems. It builds upon the Transformer architecture but introduces several innovations to improve its performance in the context of speech recognition. For benchmarking we will use FunASR paraformer model in modelscope community^[3].

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
model_dir = snapshot_download('damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', cache_dir='./',revision=None)
 
#set the radom seed 0
random.seed(0)
np.random.seed(0)
torch.random.manual_seed(0)
 
model, asr_train_args = ASRTask.build_model_from_file(
            'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/config.yaml','damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/model.pb' ,'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/am.mvn' , 'cpu')
model = get_model(model, dict(feats_dim=560, onnx=False, model_name="model"))
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

model_dir = snapshot_download('damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', cache_dir='./',revision=None)

 
#set the radom seed 0
random.seed(0)
np.random.seed(0)
torch.random.manual_seed(0)
 
model, asr_train_args = ASRTask.build_model_from_file(
            'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/config.yaml','damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/model.pb' ,'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/am.mvn' , 'cpu')
model = get_model(model, dict(feats_dim=560, onnx=False, model_name="model"))

3. Run with profiler to get the model inference result:

The inference runs 10 iterations to get the average results.

Fullscreen

1
2
3
4
5
6
7
8
9
10
batch = 64
seq_len = 93
dim = 560
speech = torch.randn((batch, seq_len, dim))
speech_lengths = torch.tensor([seq_len for _ in range(batch)], dtype=torch.int32) 
with torch.no_grad():
        with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof:
            for _ in range(10):
                model(speech, speech_lengths)
        print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=200))
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

batch = 64
seq_len = 93
dim = 560
speech = torch.randn((batch, seq_len, dim))
speech_lengths = torch.tensor([seq_len for _ in range(batch)], dtype=torch.int32) 
with torch.no_grad():
        with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof:
            for _ in range(10):
                model(speech, speech_lengths)
        print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=200))

Inference speedup with bfloat16 Fast Math Kernels

Being part of Arm Kleidi technologies, Arm Compute Library (ACL) provides optimized bfloat16 General Matrix Multiplication (GEMM) kernels by leveraging bfloat16 MMLA instructions. These instructions are available in Arm Neoverse N2 CPUs, and are integrated into PyTorch via oneDNN backend starting with PyTorch 2.0 release. On-Cpu inference performance can be highly optimized with the fast math GEMM kernels in ACL.

To enable the fast math GEMM kernels, please set the following environment variable before you run the inference:

$ export DNNL_DEFAULT_FPMATH_MODE=BF16

We found that with the bfloat16 fastmath kernel enabled on Neoverse N2 based Yitian 710 platform, there is around 2.3x performance improvement compared with the default FP32 kernel.

Parasolver Inference

Performance comparisons

We also compared the performance of the FunASR paraformer model on Yitian 710 (Arm Neoverse N2) vs. other same tier cloud instances on AliCloud ^[*].

Arm Neoverse N2 (Yitian 710): ecs.c8y.4xlarge (16 vCPU + 32GB)
4^th Gen Intel Xeon “Sapphire Rapids”: ecs.c8i.4xlarge (16 vCPU + 32GB)
4^th Gen AMD EPYC “Genoa”: ecs.c8a.4xlarge (16 vCPU + 32GB)

* AliCloud Yitian 710 using armswdev/pytorch-arm-neoverse:r24.07-torch-2.3.0-onednn-acl docker image[2], official PyTorch v2.3.0 for Intel Sapphire-Rapids and AMD Genoa

We found that Arm Neoverse N2-based Yitian 710 delivers up to 2.4 times better inference performance for paraformer Automatic Speech Recognition model with BF16 fastmath kernels.

FunASR Inference Latency with BF16

In real world inference deployments, cost is one of the major considerations for AI deployment, which significantly impacts the practical implementation and adoption of these technologies. In order to have an overall TCO (Total Cost of Ownership) view of ASR inference deployment across CPU and GPU platforms, we also added the NVIDIA A10 GPU into the comparison. With the industry-leading performance and power efficiency of Arm Neoverse N2, AliCloud Yitian 710 platform is much more cost effective than same tier x86 instance and GPU platforms, which is reflected in the lower pricing of AliCloud Yitian 710 instances as below.

Platform	Instance type	Pricing (RMB per Hour)
Arm Neoverse N2 (Yitian 710)	ecs.c8y.4xlarge	2.135466
4th Gen Intel Xeon “Sapphire Rapids”	ecs.c8i.4xlarge	3.261591
4^th Gen AMD EPYC “Genoa”	ecs.c8a.4xlarge	3.0976
NVIDIA A10	ecs.gn7i-c16g1.4xlarge	10.0934

From the benchmarking result, we can see that AliCloud Yitian 710 has a significant advantage here in terms of TCO for ASR inference deployment, providing up to 3.5 times better price performance over same tier x86 and GPU platforms.

FunASR Inference Throughput per Price

Conclusion

Arm Neoverse N2-based Alibaba Yitian 710 featuring ML-specific features like bfloat16 MMLA extension delivers outstanding inference performance for FunASR paraformer models with Arm Kleidi technologies. Developers can achieve the best price performance to build ASR (automatic speech recognition) applications on Alibaba Yitian 710.

Reference

0 comments
0 members are here

Servers and Cloud Computing blog

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025
Arm CMN S3: Driving CXL storage innovation

John Xavier Lionel

CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
- February 24, 2025
Streamline Arm adoption with GitHub Copilot and Arm64 Runners

Michael Gamble

The Arm for GitHub Copilot extension is here to change the way developers approach architecture migration.
- February 19, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Kleidi Technology Delivers Best Price-Performance for ASR on Arm Neoverse N2

Benchmarking setup

1. Initialize the environment and import the required dependencies:

2. Download and configuration the model：

3. Run with profiler to get the model inference result:

Inference speedup with bfloat16 Fast Math Kernels

Performance comparisons

Conclusion

Reference

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Arm CMN S3: Driving CXL storage innovation

Streamline Arm adoption with GitHub Copilot and Arm64 Runners