This blog was co-authored by Willen Yang and Fred Jin
Automatic Speech Recognition (ASR) technology has permeated various aspects of modern life, it is used in a wide range of applications, from voice assistants and transcription services to call center analytics and speech-to-text translation, offering innovative solutions and enhancing user experiences across different industries.
With recent advancements in machine learning and deep learning, ASR technology has reached a new level of sophistication. Now, ASR software can understand a wide array of accents, dialects, and speaking styles with high accuracy. FunASR is an advanced open-source ASR toolkit developed by Alibaba DAMO Academy. It provides a comprehensive set of tools and models for developing and deploying ASR systems.
FunASR supports both CPU and GPU compute. While GPUs offer superior performance for training deep learning models, CPUs are more prevalent in edge and datacenter servers and can be more suitable for model inference. Running ASR inference on CPUs enables deployment in scenarios where GPU acceleration is not feasible due to cost, power constraints, or lack of availability.
The Arm Neoverse N2 is a high-performance CPU processor designed for cloud and edge computing. It can support a wide range of cloud workloads including AI and ML, with added AI capabilities such as SVE2 (Scalable Vector Extension 2), Bfloat16 (BF16) data format and MMLA.
Recently, Arm has announced Arm Kleidi Technology which is a collection of developer enablement technologies, designed by Arm to enhance AI performance on Arm platforms, including Arm Neoverse. The impact of Kleidi technology ranges from frameworks to highly optimized libs and the vibrant AI ISV ecosystems.
In this blog post, we will share the steps of how to deploy FunASR inference on Arm Neoverse N2-based Alibaba Yitian 710 platform along with our benchmarking methodology. Moreover, by enabling Arm Kleidi technologies, we will present a comparative analysis that highlights the key price-performance advantages of running FunASR inference on Yitian 710 CPUs over other CPU and GPU-based platforms.
Software Version:
Model: speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
Please make sure PyTorch and associated python libraries are installed on the system[1], if you are running on Arm platform, Arm also provides the PyTorch docker image[2] in docker hub for quick evaluation purpose.
export OMP_NUM_THREADS=16 export DNNL_VERBOSE=1 import torch import torch.autograd.profiler as profiler import os import random import numpy as np from funasr.tasks.asr import ASRTaskParaformer as ASRTask from funasr.export.models import get_model from modelscope.hub.snapshot_download import snapshot_download
Paraformer is an advanced automatic speech recognition (ASR) model developed by Alibaba DAMO Academy under FunASR open-source project. This model is specifically designed to improve the robustness and efficiency of end-to-end speech recognition systems. It builds upon the Transformer architecture but introduces several innovations to improve its performance in the context of speech recognition. For benchmarking we will use FunASR paraformer model in modelscope community[3].
model_dir = snapshot_download('damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', cache_dir='./',revision=None) #set the radom seed 0 random.seed(0) np.random.seed(0) torch.random.manual_seed(0) model, asr_train_args = ASRTask.build_model_from_file( 'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/config.yaml','damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/model.pb' ,'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/am.mvn' , 'cpu') model = get_model(model, dict(feats_dim=560, onnx=False, model_name="model"))
The inference runs 10 iterations to get the average results.
batch = 64 seq_len = 93 dim = 560 speech = torch.randn((batch, seq_len, dim)) speech_lengths = torch.tensor([seq_len for _ in range(batch)], dtype=torch.int32) with torch.no_grad(): with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof: for _ in range(10): model(speech, speech_lengths) print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=200))
Being part of Arm Kleidi technologies, Arm Compute Library (ACL) provides optimized bfloat16 General Matrix Multiplication (GEMM) kernels by leveraging bfloat16 MMLA instructions. These instructions are available in Arm Neoverse N2 CPUs, and are integrated into PyTorch via oneDNN backend starting with PyTorch 2.0 release. On-Cpu inference performance can be highly optimized with the fast math GEMM kernels in ACL.
To enable the fast math GEMM kernels, please set the following environment variable before you run the inference:
$ export DNNL_DEFAULT_FPMATH_MODE=BF16
We found that with the bfloat16 fastmath kernel enabled on Neoverse N2 based Yitian 710 platform, there is around 2.3x performance improvement compared with the default FP32 kernel.
We also compared the performance of the FunASR paraformer model on Yitian 710 (Arm Neoverse N2) vs. other same tier cloud instances on AliCloud [*].
* AliCloud Yitian 710 using armswdev/pytorch-arm-neoverse:r24.07-torch-2.3.0-onednn-acl docker image[2], official PyTorch v2.3.0 for Intel Sapphire-Rapids and AMD Genoa
We found that Arm Neoverse N2-based Yitian 710 delivers up to 2.4 times better inference performance for paraformer Automatic Speech Recognition model with BF16 fastmath kernels.
In real world inference deployments, cost is one of the major considerations for AI deployment, which significantly impacts the practical implementation and adoption of these technologies. In order to have an overall TCO (Total Cost of Ownership) view of ASR inference deployment across CPU and GPU platforms, we also added the Nvidia A10 GPU into the comparison. With the industry-leading performance and power efficiency of Arm Neoverse N2, AliCloud Yitian 710 platform is much more cost effective than same tier x86 instance and GPU platforms, which is reflected in the lower pricing of AliCloud Yitian 710 instances as below.
Platform
Instance type
Pricing (RMB per Hour)
Arm Neoverse N2 (Yitian 710)
ecs.c8y.4xlarge
2.135466
4th Gen Intel Xeon “Sapphire Rapids”
ecs.c8i.4xlarge
3.261591
4th Gen AMD EPYC “Genoa”
ecs.c8a.4xlarge
3.0976
NVIDIA A10
ecs.gn7i-c16g1.4xlarge
10.0934
From the benchmarking result, we can see that AliCloud Yitian 710 has a significant advantage here in terms of TCO for ASR inference deployment, providing up to 3.5 times better price performance over same tier x86 and GPU platforms.
Arm Neoverse N2-based Alibaba Yitian 710 featuring ML-specific features like bfloat16 MMLA extension delivers outstanding inference performance for FunASR paraformer models with Arm Kleidi technologies. Developers can achieve the best price performance to build ASR (automatic speech recognition) applications on Alibaba Yitian 710.