Video streaming is a key cloud application that attracts intensive attention from the whole industry. It is worth noting that COVID-19 global pandemic has shown a clear impact to video industry trends and challenges. For instance, according to the research report from bitmovin, live-streaming at scale and low latency have become the top two areas for innovation, while video quality is also a top concern.
To adapt to this trend and meet new customer requirements, such as lower latencies, higher resolution, higher bit-depth, etc., popular video codecs have evolved from x264 to x265, VP9, and AV1. In particular, VP9 has been widely deployed in production, which occupies 17% in live-streaming and 10% in VoD encoding . Compared with x264, VP9 is believed to provide better compression efficiency, enabling higher video quality at lower bit rates, which is particularly beneficial for streaming services and users with limited bandwidth .
To better enable customers to deploy VP9 on increasingly popular Arm-based cloud servers, Arm, together with its partner VectorCamp, have contributed numerous open-sourced optimization work and achieved significant performance improvements. These improvements are mainly achieved by leveraging Arm Neon technology , an advanced SIMD architecture extension for Arm processors, from 2021-2023.
Specifically, VectorCamp  has contributed Neon implementations of the high bit-depth functions, starting from all the FDCT/FHT functions for various sizes (4x4 up to 32x32), refactoring and simplifying the DCT code and its helper functions in the process. In addition, VectorCamp provided optimized versions of the temporal filter, color quantization, diamond-search SAD and non-SDOT implementations for high bit-depth variance, subtract block and SAD functions.
For standard bit-depth, Arm has contributed a series of Armv8.4-A DotProd Neon optimizations to accelerate standard bit-depth convolution, sum-of-absolute-difference (SAD) and variance functions . Arm has also improved the existing Neon implementations of Armv8.0 SAD, SAD4D, variance and sub-pixel variance, plus a host of helper functions for multi-vector reduction and matrix transposition. In Armv8.6-A, Arm contributed I8MM implementation of the convolution algorithms, and added new Neon code for the mean-squared-error (MSE), block error and intra-frame block predictor functions.
For high bit-depth, Arm has optimized the SAD and SAD4D Neon paths, contributed new Neon paths for Hadamard transforms, minmax, averaging, satd, block error and intra-frame block predictor functions. And Arm ported our standard bit-depth sub-pixel variance optimizations to high bit-depth paths.
A full list of contributions by Konstantinos Margaritis (VectorCamp), Jonathan Wright (Arm), Salomé Thirot (Arm), George Steed (Arm), and Gerda Zsejke More (Arm), and the teams can be found at libvpx log page .
To illustrate the performance uplift of these NEON optimizations, and the performance and TCO benefits running VP9 on Arm servers, we have conducted intensive experiments on Amazon AWS EC2 platforms.
We measured the performance on AWS C6g, C7g and C6a 16xlarge instances, as listed in Table 1. C6g instances are based on AWS Graviton2, which uses Arm Neoverse N1 cores. While C7g instances are based on AWS Graviton3, which uses Arm Neoverse V1 cores. All the instances run Ubuntu 22.04 and GCC 13.1. We compared the original VP9 branch, 2eb934d in 2021 , and Optimized branch, 60ee1b1 in 2023 . The videos we used for the tests are 8/10-Bit 4K Bosphorus downloaded from ultravideo  and 8/10-Bit 1080P 4Ever downloaded from HEVC Dash .
Table 1: AWS VM Configuration and Pricing information
We conducted two sets of experiments to illustrate the performance improvement on Arm from our optimizations. We also show the performance advantage of Arm over competitive offerings. All the performance results presented below are full-socket performance when the same number of CPU cores (64 vCPUs) are used. --cpu-used in all the figures means the VP9 parameter that determines the speed and video quality of the compression. A lower value will provide higher video quality, which is typically used in Video on Demand (VoD) scenario, whereas a higher value will provide higher encoding speed at the expense of having some impact on video quality or rate control accuracy, which is typically used in Live Streaming mode.
As for the performance improvement of the latest VP9 commit, we benchmarked with 8Bit 1080P and 4K videos, respectively. As shown in Figure 1, for 8-Bit 1080P video, compared with Original version, the Optimized one can increase the FPS by 37.2% on C6g, and by 37.8% on C7g. Similarly, as Figure 2 shows, the FPS increase by 21.3% on C6g and 15.3% on C7g for 8-Bit 4K input.
Figure 1: Performance for 8-Bit 1080p
Figure 2: Performance Improvement for 8-Bit 4K
As for the 10-bit inputs, the performance boost is more significant. For instance, for 10-bit 1080P video, FPS increases by 70.7% and 61.6% on C6g and C7g, respectively, as Figure 3 shows. For 10-Bit 4K video, FPS increase from 5.12 to 12.39, or by 142.0%, on C6g, and from 8.95 to 18.61, or by 108.0%, on C7g.
Figure 3: Performance Improvement for 10-Bit Videos
Overall, for all the cases we mention above, the performance uplift we observed is from 15.3% to 142.0% across different cases. This means the Neon optimizations to VP9 conducted by Arm, and its partners, can boost performance significantly.
To compare the full socket performance on Arm and x86, we chose to showcase the FPS on C6a (AMD-Milan) and C7g (Arm-Neoverse).
For 8-Bit 1080P video, Arm can achieve 30.6% higher FPS when –cpu-used is 4, and 64.0% higher FPS when --cpu-used is 8, This is based on comparing rm-based C7g to x86-based C6a, as Figure 4 shows. As for 8-Bit 4K video (Figure 5), the FPS on C7g is higher than C6a by 43.1% and 50.0% when --cpu-used being 4 and 8, respectively.
Figure 4: Performance comparison for 8-Bit 1080P Video
Figure 5: Perf for 8-Bit 4K Video
For 10-bit videos, the boost is more significant. For instance, Figure 6 shows that Arm C7g provides up to 63.2% and 42.4% higher FPS for 1080P and 4K videos respectively, than x86 C6a.
Figure 6: Competitive Analysis for 10-Bit Videos
Overall, for all the cases we mentioned above, the performance benefits we observed on Arm-based C7g EC2 instances ranges from 30.6% to 64.0%, compared with x86-based C6a.
Based on our evaluation, deploying VP9 is quite promising on Arm platforms. With continuous optimizations to VP9 in the past two years, it achieves significant performance uplift, by up to 142.0% on Arm instances compared with original branch without Neon optimizations. Compared with x86 based C6a, the latest VP9 codec can provide 64.0% higher FPS on Arm based C7g.
More Neoverse Blogs
 The 6th Annual Bitmovin Video Developer Report, Shaping the future of video 2022/2023, https://bitmovin.com/wp-content/uploads/2022/12/bitmovin-6th-video-developer-report-2022-2023.pdf  VP9 Codec: The Complete Guide to Google’s Open Source Video Codec, https://bitmovin.com/vp9-codec-status-quo/  Ultravideo, https://ultravideo.fi/#testsequences  Ultra High Definition HEVC DASH Data Set, https://download.tsi.telecom-paristech.fr/gpac/dataset/dash/uhd/  Arm NEON, https://www.arm.com/technologies/neon  VP9 Source Code, Original version, https://chromium.googlesource.com/webm/libvpx/+/2eb934d9c1fb4a460e3f03c8578b7b4f4f195784  VP9 Source Code, Optimized version, https://chromium.googlesource.com/webm/libvpx/+/60ee1b149bd3de8e53857ff5a9f7f19d9398144e  VectorCamp, https://www.vectorcamp.gr  DotProduct by Arm, https://developer.arm.com/documentation/102651/a/Use-case--improving-VP9-performance  Libvpx log, https://chromium.googlesource.com/webm/libvpx/+log