Arm-based cloud instances outperform x86 instances by up to 64% on VP9 encoding

November 6, 2023

6 minute read time.

Video streaming is a key cloud application that attracts intensive attention from the whole industry. It is worth noting that COVID-19 global pandemic has shown a clear impact to video industry trends and challenges. For instance, according to the research report from bitmovin[1], live-streaming at scale and low latency have become the top two areas for innovation, while video quality is also a top concern.

To adapt to this trend and meet new customer requirements, such as lower latencies, higher resolution, higher bit-depth, etc., popular video codecs have evolved from x264 to x265, VP9, and AV1. In particular, VP9 has been widely deployed in production, which occupies 17% in live-streaming and 10% in VoD encoding [1]. Compared with x264, VP9 is believed to provide better compression efficiency, enabling higher video quality at lower bit rates, which is particularly beneficial for streaming services and users with limited bandwidth [2].

To better enable customers to deploy VP9 on increasingly popular Arm-based cloud servers, Arm, together with its partner VectorCamp, have contributed numerous open-sourced optimization work and achieved significant performance improvements. These improvements are mainly achieved by leveraging Arm Neon technology [5], an advanced SIMD architecture extension for Arm processors, from 2021-2023.

Specifically, VectorCamp [8] has contributed Neon implementations of the high bit-depth functions, starting from all the FDCT/FHT functions for various sizes (4x4 up to 32x32), refactoring and simplifying the DCT code and its helper functions in the process. In addition, VectorCamp provided optimized versions of the temporal filter, color quantization, diamond-search SAD and non-SDOT implementations for high bit-depth variance, subtract block and SAD functions.

For standard bit-depth, Arm has contributed a series of Armv8.4-A DotProd Neon optimizations to accelerate standard bit-depth convolution, sum-of-absolute-difference (SAD) and variance functions [9]. Arm has also improved the existing Neon implementations of Armv8.0 SAD, SAD4D, variance and sub-pixel variance, plus a host of helper functions for multi-vector reduction and matrix transposition. In Armv8.6-A, Arm contributed I8MM implementation of the convolution algorithms, and added new Neon code for the mean-squared-error (MSE), block error and intra-frame block predictor functions.

For high bit-depth, Arm has optimized the SAD and SAD4D Neon paths, contributed new Neon paths for Hadamard transforms, minmax, averaging, satd, block error and intra-frame block predictor functions. And Arm ported our standard bit-depth sub-pixel variance optimizations to high bit-depth paths.

A full list of contributions by Konstantinos Margaritis (VectorCamp), Jonathan Wright (Arm), Salomé Thirot (Arm), George Steed (Arm), and Gerda Zsejke More (Arm), and the teams can be found at libvpx log page [10].

To illustrate the performance uplift of these NEON optimizations, and the performance and TCO benefits running VP9 on Arm servers, we have conducted intensive experiments on Amazon AWS EC2 platforms.

System Configurations

We measured the performance on AWS C6g, C7g and C6a 16xlarge instances, as listed in Table 1. C6g instances are based on AWS Graviton2, which uses Arm Neoverse N1 cores. While C7g instances are based on AWS Graviton3, which uses Arm Neoverse V1 cores. All the instances run Ubuntu 22.04 and GCC 13.1. We compared the original VP9 branch, 2eb934d in 2021 [6], and Optimized branch, 60ee1b1 in 2023 [7]. The videos we used for the tests are 8/10-Bit 4K Bosphorus downloaded from ultravideo [3] and 8/10-Bit 1080P 4Ever downloaded from HEVC Dash [4].

AWS Instances	Architecture	vCPUs	DRAM(GB)	On-demand Pricing
C6g.16xlarge	Arm	64	128		$2.176/h
C7g.16xlarge	Arm	64	128		$2.312/h
C6a.16xlarge	X86	64	128		$2.448/h

Table 1: AWS VM Configuration and Pricing information

Performance Evaluation

We conducted two sets of experiments to illustrate the performance improvement on Arm from our optimizations. We also show the performance advantage of Arm over competitive offerings. All the performance results presented below are full-socket performance when the same number of CPU cores (64 vCPUs) are used. --cpu-used in all the figures means the VP9 parameter that determines the speed and video quality of the compression. A lower value will provide higher video quality, which is typically used in Video on Demand (VoD) scenario, whereas a higher value will provide higher encoding speed at the expense of having some impact on video quality or rate control accuracy, which is typically used in Live Streaming mode.

Performance Improvement on Arm

As for the performance improvement of the latest VP9 commit, we benchmarked with 8Bit 1080P and 4K videos, respectively. As shown in Figure 1, for 8-Bit 1080P video, compared with Original version, the Optimized one can increase the FPS by 37.2% on C6g, and by 37.8% on C7g. Similarly, as Figure 2 shows, the FPS increase by 21.3% on C6g and 15.3% on C7g for 8-Bit 4K input.

Performance Improvement for 8Bit 1080P

Figure 1: Performance for 8-Bit 1080p

Performance for 8Bit 4K

Figure 2: Performance Improvement for 8-Bit 4K

As for the 10-bit inputs, the performance boost is more significant. For instance, for 10-bit 1080P video, FPS increases by 70.7% and 61.6% on C6g and C7g, respectively, as Figure 3 shows. For 10-Bit 4K video, FPS increase from 5.12 to 12.39, or by 142.0%, on C6g, and from 8.95 to 18.61, or by 108.0%, on C7g.

Performance for 10Bit videos

Figure 3: Performance Improvement for 10-Bit Videos

Overall, for all the cases we mention above, the performance uplift we observed is from 15.3% to 142.0% across different cases. This means the Neon optimizations to VP9 conducted by Arm, and its partners, can boost performance significantly.

Competitive Analysis

To compare the full socket performance on Arm and x86, we chose to showcase the FPS on C6a (AMD-Milan) and C7g (Arm-Neoverse).

For 8-Bit 1080P video, Arm can achieve 30.6% higher FPS when –cpu-used is 4, and 64.0% higher FPS when --cpu-used is 8, This is based on comparing rm-based C7g to x86-based C6a, as Figure 4 shows. As for 8-Bit 4K video (Figure 5), the FPS on C7g is higher than C6a by 43.1% and 50.0% when --cpu-used being 4 and 8, respectively.

Performance for 8-Bit 1080P video

Figure 4: Performance comparison for 8-Bit 1080P Video

Performance for 8-Bit 4K video

Figure 5: Perf for 8-Bit 4K Video

For 10-bit videos, the boost is more significant. For instance, Figure 6 shows that Arm C7g provides up to 63.2% and 42.4% higher FPS for 1080P and 4K videos respectively, than x86 C6a.

Competitive comparison for 10-Bit Videos

Figure 6: Competitive Analysis for 10-Bit Videos

Overall, for all the cases we mentioned above, the performance benefits we observed on Arm-based C7g EC2 instances ranges from 30.6% to 64.0%, compared with x86-based C6a.

Summary

Based on our evaluation, deploying VP9 is quite promising on Arm platforms. With continuous optimizations to VP9 in the past two years, it achieves significant performance uplift, by up to 142.0% on Arm instances compared with original branch without Neon optimizations. Compared with x86 based C6a, the latest VP9 codec can provide 64.0% higher FPS on Arm based C7g.

More Neoverse Blogs

References

^{[1] The 6th Annual Bitmovin Video Developer Report, Shaping the future of video 2022/2023, https://bitmovin.com/wp-content/uploads/2022/12/bitmovin-6th-video-developer-report-2022-2023.pdf}^{[2] VP9 Codec: The Complete Guide to Google’s Open Source Video Codec, https://bitmovin.com/vp9-codec-status-quo/}^{[3] Ultravideo, https://ultravideo.fi/#testsequences}^{[4] Ultra High Definition HEVC DASH Data Set,}^{https://download.tsi.telecom-paristech.fr/gpac/dataset/dash/uhd/}^{[5] Arm NEON, https://www.arm.com/technologies/neon}^{[6] VP9 Source Code, Original version, https://chromium.googlesource.com/webm/libvpx/+/2eb934d9c1fb4a460e3f03c8578b7b4f4f195784}^{[7] VP9 Source Code, Optimized version, https://chromium.googlesource.com/webm/libvpx/+/60ee1b149bd3de8e53857ff5a9f7f19d9398144e}^{[8] VectorCamp, https://www.vectorcamp.gr}^{[9] DotProduct by Arm, https://developer.arm.com/documentation/102651/a/Use-case--improving-VP9-performance}^{[10] Libvpx log, https://chromium.googlesource.com/webm/libvpx/+log}

0 comments
0 members are here

Servers and Cloud Computing blog

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025
Arm CMN S3: Driving CXL storage innovation

John Xavier Lionel

CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
- February 24, 2025
Streamline Arm adoption with GitHub Copilot and Arm64 Runners

Michael Gamble

The Arm for GitHub Copilot extension is here to change the way developers approach architecture migration.
- February 19, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog