Delivering the best H.265 video experience on Arm Neoverse N2 Platform

October 29, 2024

4 minute read time.

This blog post was co-authored by Wei Chen, Principal Software Engineer, Arm

With the advancement of the 5G, online video has become one of the primary mediums for people to obtain information, according to Ericsson, at the end of 2023, video traffic accounted for 73 percent of all mobile data traffic^[1], with the viewing resolution of mobile internet terminals quickly jumping from the initial 360P, 480P, to 720P, 1080P, and even to the 4K/8K ultra-high-definition video. In addition to the demand of higher resolution, immersive video experiences represented by AR (Augmented Reality) and VR (Virtual Reality) are also increasingly demanding in terms of frame rate and color space. Each increase in resolution, frame rate, or color space results in an exponential growth in the amount of video data, making bandwidth resources even more valuable. Considering this context, video coding technology, as the cornerstone of the development of ultra-high-definition video has become increasingly important. Through advanced encoding and decoding technology, we can significantly reduce the bandwidth required for video transmission, which can not only effectively save costs but also greatly enhance the user's viewing experience.

Video Codec

To cope with the increasing video traffic, video standardization organizations (such as MPEG, ITU-T and ISO) have been continuously advancing the iterative upgrades of video coding technology, from MPEG-2, H.264/AVC, H.265/HEVC, to the latest H.266/VVC. Today, H.264/AVC is still the widely used codec standard, it can provide high-quality video streams at lower data rates, making it an ideal choice for online and mobile video platforms. H.265/HEVC further improves compression efficiency, theoretically achieving twice the compression efficiency of H.264 at the same video quality. This means that the bandwidth required for H.265/HEVC to transmit high-definition video streams and 4K videos is about half of H.264, this bandwidth saving is particularly important for the transmission of high-definition video streams and 4K videos. The latest H.266/VVC standard further improves compression efficiency more than 40% while maintaining the same video quality, which is of great significance for the transmission of 8K videos and future ultra-high-resolution video.

Video transcoding is the process of converting videos streams into different formats by changing video parameters such as codec settings, resolution, and bitrate, etc. to accommodate different networking connectivity conditions and user end devices. It has been widely employed to improve the user experience, and most importantly to reduce both storage space and network bandwidth costs. Although dedicated video acceleration cards (such as ASICs) and GPUs have shown excellent performance in specific video transcoding tasks, general purpose server CPUs are gradually becoming the preferred option in video transcoding scenarios due to their excellent flexibility and higher cost-performance ratio, especially in video-on-demand and live-streaming media. CPU based video transcoding deployment can greatly benefit from the rapidly evolving video codec algorithms innovation and continuous software optimization in both open-source and in-house video codec libraries, providing a cost-effective solution compared to dedicated acceleration cards and GPUs.

In the past few years, Arm and ecosystem partners have been working closely with open-source communities to enhance the video codec performance on Arm platform by employing Neon and SVE/SVE2 instruction sets supported on Arm Neoverse CPUs. In the recent x.265 release (version 4.0) ^[2], video codec on Arm has been optimized to deliver notable frames per second (FPS) improvements across different encoding presets which provides substantial performance boosts in multi-stream encoding situations.

x265 benchmarking

Today, we are going to benchmark the latest x265 release (version 4.0)to compare the performance of video transcoding scenario on Neoverse N2 based Yitian710 and the same tier of x86 instances (Intel IceLake, Intel Sapphire Rapids, and AMD Genoa) with platform configuration as follows:

AliCloud Yitian710: ecs.c8y.8xlarge, 32 VCPUs, 64 GB memory, RMB 4.270933 /hr
Intel Icelake: ecs.c7.8xlarge, 32 VCPUs, 64 GB memory, RMB 6.523183 /hr
Intel Sapphire Rapids: ecs.c8i.8xlarge, 32 VCPUs, 64 GB memory, RMB 6.523183 /hr
AMD Genoa: ecs.c8a.8xlarge, 32 VCPUs, 64 GB memory, RMB 6.1952 /hr

We started with performance scaling benchmarking between Yitian710 and the same tier x86 instances, as it is illustrated below, Neoverse N2 based Yitian710 shows much better scaling linearity in Frames Per Second (FPS) as the number of transcoding tasks increases from 1 to 16. Also, Yitian710 can achieve up to 2.2 times higher throughput than the same tier of x86 instances under 16 parallel transcoding tasks configuration.

X265_Video Transcoding_Aggregate Performance on Arm Figure 1: x265 transcoding aggregate performance scaling by number of transcoding tasks

X265_Video Transcoding_Parallel Performance on Arm

Figure 2: x265 transcoding performance comparison with 16 parallel transcoding tasks

It is also important to note that the Neoverse N2 based Yitian710 platform is much more cost effective than Intel Icelake, Intel Sapphire Rapids and AMD Genoa, which is reflected in the lower pricing of AliCloud Yitian710 instances above. This has given AliCloud Yitian710 a significant advantage here in terms of Total Cost of Ownership (TCO) for x.265 transcoding throughput per RMB, providing up to 3.3 times higher frames per RMB over same tier of x86 instances, which provides a compelling advantage to users looking to deploy video transcoding services in cloud.

X265_Video Transcoding_Price Performance on Arm

Figure 3: x265 transcoding price performance comparison with 16 parallel transcoding tasks

Conclusion

As high-resolution streaming becomes more prevalent, there is an increasing demand for advanced compression codecs such as H.265 in cloud-based video streaming services. However, this enhanced compression technology requires more computational power and results in greater energy consumption. On the system-level, Arm Neoverse N2 based Alibaba Yitian710 provides better scaling and up to 2.2x higher transcoding performance, while offering up to 3.3x higher price performance compared to same tier of x86 instances (Intel Icelake, Intel Sapphire Rapids and AMD Genoa). Please visit Arm developer Hub^[3] to learn how to build and run x265 on Arm servers to benefit from the outstanding transcoding performance and power efficiency on Arm Neoverse.

References

0 comments
0 members are here

Servers and Cloud Computing blog

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025
Arm CMN S3: Driving CXL storage innovation

John Xavier Lionel

CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
- February 24, 2025
Streamline Arm adoption with GitHub Copilot and Arm64 Runners

Michael Gamble

The Arm for GitHub Copilot extension is here to change the way developers approach architecture migration.
- February 19, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Delivering the best H.265 video experience on Arm Neoverse N2 Platform

Video Codec

x265 benchmarking

Conclusion

References

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Arm CMN S3: Driving CXL storage innovation

Streamline Arm adoption with GitHub Copilot and Arm64 Runners